Apr 2, 2026

My Test Failed at Step 4. I Had No Idea What Step 4 Did.

615 passing tests, full coverage, and I still couldn't debug a failure without reading the source code. The Screenplay pattern fixed that.

You know the feeling. CI goes red. You open the report. “Step 4 of 7: FAILED.” You stare at it. Step 4 is given the workspace is configured. What does that even do? You open the step definition. It’s client.post("/api/workspaces", json={...}) followed by three lines stuffing things into a shared context dict. You trace context['workspace_id'] back through steps 1–3 to figure out what went wrong. Twenty minutes gone, and you still haven’t found the bug.

I had 615 acceptance tests like this. Across 9 bounded contexts. Every single one passed. And I dreaded every failure.

The Problem: Green Tests, Useless Reports

BDD promises that your tests read like specifications. The Gherkin side delivers on that:

Given Neda has overridden "PII Protection Standard" to "block" at workspace level
And Neda overrides "PII Protection Standard" to "warn" for "Sales Report Agent"
When Farid starts a Run of "Sales Report Agent"
Then a PolicyViolationDetected event is recorded with violation_action "warn"

Beautiful. Clear actors, clear intent, clear expected outcome. But behind each line? Raw HTTP calls, response parsing, assertions scattered across mutable shared state. The Gherkin reads like a spec. The code behind it reads like a bash script.

When a test failed, the report told me which step broke. It told me nothing about why. No trail of what happened. No context about what each actor did between steps. Just “step 4: FAILED” and good luck.

The Solution: The Screenplay Pattern

The Screenplay pattern gives structure to what happens behind each Gherkin step. Actors perform Tasks through Interactions, and verify outcomes by asking Questions.

The procedural version:

@given("a workspace is created")
def step_create_workspace(client, context):
    resp = client.post("/api/workspaces", json={"name": "Acme"})
    context["workspace_id"] = resp.json()["id"]

The Screenplay version:

@given("a workspace is created")
def step_create_workspace(get_actor):
    actor = get_actor("admin")
    actor.attempts_to(CreateWorkspaceViaApi("Acme Corp"))

Every attempts_to call emits an event. Every event shows up in the test report. When step 4 fails, the report now shows me who did what — which actor, which interaction, and what went wrong. That’s a bug report, not a scavenger hunt.

Why This Matters With AI

When AI agents write your code, the test suite becomes the real documentation. Not the README, not the architecture diagram — the tests. If those tests are procedural scripts with shared mutable dicts, the AI learns how to call your API, not what your system is supposed to do.

Screenplay tests read as behavior specs. An AI agent looking at actor.attempts_to(FullyEnableTool("Slack Integration")) understands the domain action. An AI agent looking at client.post("/api/tool-registrations/", json={"name": "Slack Integration", "enabled": True}) understands an HTTP call. Same behavior, completely different signal.

After integrating screenwright for cinematic BDD reports, every Screenplay event became visible in the output. Each report showed the full narrative — which actor attempted which interaction, through which ability, with what result. Debugging went from reading source code to reading the report. My AI agents started producing better code too, because the test suite they learned from finally communicated intent.

What This Looked Like in Practice

I refactored all 615 tests in a single commit. Each bounded context got its own screenplay/ package:

Abilities — what actors can do (ManageToolRegistrations, UseApi, InspectEventStore)
Interactions — atomic API operations (RegisterToolViaApi, OverridePlatformDefault)
Tasks — composed business operations (FullyEnableTool, CreateWorkspaceWithAgent)
Questions — verification queries (TheToolConnectionStatus, TheResponseStatusCode)

A shared layer at tests/screenplay/ provides the generic pieces — UseApi wraps the test client, InspectEventStore wraps the in-memory event store. Domain-specific interactions in each bounded context wrap these with business language.

The result: 12,042 events captured across the suite — 987 interactions, 65 tasks, 12 questions. Every one of them shows up in the BDD report as narration between Gherkin steps. 47 test files rewritten, 85 new screenplay modules created. Not a single assertion changed — only how the tests expressed themselves.

How to Build This In

Start small. Define UseApi and generic interactions like PostToEndpoint and GetFromEndpoint. These cover 80% of what your step definitions already do.

Then wrap them with domain language per bounded context. RegisterToolViaApi internally calls PostToEndpoint("/api/tool-registrations/", ...) but shows up as a domain action in the report. The test reads better and the failure message is immediately useful.

Compose interactions into Tasks when multiple steps always go together. Replace raw assertions with Questions — instead of assert resp.status_code == 200, write actor.should(SeeThat(TheResponseStatusCode(), is_(200))). The report shows what was asked and what was found.

The investment is upfront. But every new test after the refactor automatically generates rich reports. Every failure tells you what happened without opening a single source file.

The Takeaway

Passing tests are the minimum. Tests that tell you what went wrong — without reading the source — are the goal. The Screenplay pattern turned 615 acceptance tests from a CI checkbox into something I actually use when things break. And when AI agents read those tests, they finally understand what the system does, not just how to call it.

This article was written by AI and approved by Hossein for publication.