Getting started with Playwright test agents

Writing end-to-end tests has never been the hard part. Keeping them alive is. Selectors drift, flows change, and the suite that gave you confidence six months ago quietly becomes the thing everyone retries in CI until it goes green.

Playwright test agents are an attempt at attacking that maintenance loop directly. There are three of them, and they map neatly onto the lifecycle of a test:

  • 🎭 planner explores your running app and produces a Markdown test plan.
  • 🎭 generator turns that plan into executable Playwright tests, verifying selectors and assertions against the live page as it goes.
  • 🎭 healer runs the suite and, when a test fails, works out whether the test or the app is at fault and patches the former.

You can use each one on its own, run them in sequence, or chain them together in an agentic loop. I’ve been trying them out and this post covers how they fit together.

Setting up

The agents aren’t a separate install. They’re definitions, a collection of instructions and MCP tools, that Playwright generates into your project for whichever coding agent you use:

npx playwright init-agents --loop=vscode
npx playwright init-agents --loop=claude
npx playwright init-agents --loop=opencode

I ran the claude loop since Claude Code is what I use day to day, but the VS Code experience (you’ll need v1.105 or later) and OpenCode are equivalent. One thing worth knowing up front: the definitions should be regenerated whenever you update Playwright because new tools and instructions ship with new versions. It’s a one-line command, but it’s easy to forget and you’ll be running agents against stale instructions.

The seed test

Before the agents can do anything useful, they need a seed test:

import { test, expect } from './fixtures';

test('seed', async ({ page }) => {
  // this test uses custom fixtures from ./fixtures
});

This looks like it does nothing, but it’s doing two jobs. It gives the planner an initialised page to explore from, with all of your custom fixtures applied, so authentication, test data and routing are already handled before the agent starts clicking around. And it acts as the reference example for every test the generator writes. If your tests import from a custom fixtures file, the generated ones will too.

It’s the same principle as showing a new team member one good example rather than writing them a style guide. The seed test is the style guide.

The planner

You give the planner a clear request, something like “generate a plan for guest checkout”, and it explores the app from your seed test’s page context. What comes back is not code but a Markdown file, for example specs/basic-operations.md, containing an overview of the application and a set of scenarios with numbered steps, expected results and the data they need:

### 1. Add Valid Todo

**Steps:**
1. Navigate to the application
2. Click on the todo input field
3. Type "Buy vegetarian chicken nuggets"
4. Press Enter

**Expected Results:**
- Todo appears in the list
- Counter shows "1 item left"

I like this artefact a lot more than I expected to. It’s human-readable, so it can be reviewed by someone who doesn’t write Playwright, but it’s precise enough to generate tests from. It also lives in your repository, which means your test coverage has a documented intent next to it rather than buried in the heads of whoever wrote the suite.

The generator

The generator takes a plan and turns each scenario into a real test. Crucially, it doesn’t write the code blind. It performs the steps in a live browser as it generates, verifying that the selectors resolve and the assertions hold before committing them to the file:

// spec: specs/basic-operations.md
// seed: tests/seed.spec.ts
import { test, expect } from '../fixtures';

test.describe('Adding New Todos', () => {
  test('Add Valid Todo', async ({ page }) => {
    const todoInput = page.getByRole('textbox', { name: 'What needs to be done?' });
    await todoInput.click();
    await todoInput.fill('Buy groceries');
    await todoInput.press('Enter');
    await expect(page.getByText('Buy groceries')).toBeVisible();
  });
});

Note the comment header: each generated file records which spec and which seed it came from, so the chain from intent to plan to test stays traceable. The output is ordinary Playwright, role-based locators and web-first assertions, the same code you’d hope a careful human would write.

The docs are honest about the limits here: generated tests may include initial errors. Which is where the third agent comes in.

The healer

The healer is the one I suspect most teams will reach for first because it addresses the pain you already have. You point it at a failing test and it:

  1. Replays the failing steps
  2. Inspects the current UI to find the equivalent elements or flows
  3. Suggests a patch, a locator update, a wait adjustment or a data fix
  4. Re-runs the test until it passes, or until guardrails stop the loop

The output is either a passing test or, and this is the detail I appreciate most, a skipped test if the healer believes the functionality itself is broken. That distinction matters. A tool that mechanically rewrites assertions until they pass would be worse than useless because it would convert real regressions into green builds. Skipping and flagging is the right behaviour: the suite keeps running, and a human gets a clear signal to investigate.

That said, I’d still review every heal with the same suspicion I’d apply to any automated fix. “The healer believes the functionality is broken” is doing a lot of work in that sentence, and the judgement is only as good as what the agent can observe.

Where I’d start

If you want to try this on an existing project, I’d resist the temptation to do the full planner-to-healer pipeline on day one:

  1. Run init-agents and write a proper seed test first. Everything downstream inherits its quality. If your fixtures are messy, every generated test will faithfully reproduce the mess.
  2. Point the healer at your flakiest existing test. It works on tests you wrote by hand just as happily as generated ones, and it’s the quickest way to evaluate the whole idea against pain you actually have.
  3. Only then let the planner loose, on one well-understood flow, and read the spec it produces before generating anything from it.

The plans-in-Markdown approach is the part I think will outlast the hype cycle. Tests generated from a reviewed, version-controlled statement of intent feel like a genuinely better starting point than tests generated from a prompt that vanished with the chat session. The agents still need supervision, but they’re doing the part of the job nobody enjoyed anyway.