Do You Review AI-Generated Test Code?

Home > Blog > Do You Review AI-Generated Test Code?

Author: Tomotaka ASAGI Published: Feb 22, 2026

Introduction

I've been asking AI to write automated test scripts more and more often. Honestly, don't you think the code that comes back is surprisingly well-crafted?

Until around the end of last year, I still found myself wanting to tweak things here and there. But now, I'd say AI writes exactly what I ask for. (It all comes down to how you ask, of course.)

On one project, I had AI generate a Page Object Model for Playwright. When I looked at the code, the selectors were completely different from what I was used to in my Selenium + Java days. Instead of CSS selectors or XPath, it was using getByRole and getByLabel — role-based selectors.

"What does this code actually mean?"

What would you do in this situation? Accept it because it works, or stop and dig deeper?

I always review AI-generated code. I never submit code I don't understand — no exceptions. So I looked into it. Turns out, these are Playwright's recommended selectors, tied to accessibility principles. AI had used a better approach that I simply didn't know about, while still delivering exactly what I asked for. I genuinely thought, "Things have really progressed."

But what AI writes isn't always "something I didn't know about that happens to be better." On another project, accepting AI's output without enough scrutiny led to a problem I only noticed later.

AI Built BDD Tests with Cucumber.js

For that project, I wanted to manage E2E tests in BDD format using Playwright. I asked Claude Code to set it up, and it generated BDD tests using Cucumber.js — almost exactly what I had in mind.

Faster than writing everything from scratch, with clean structure throughout. "Impressive," I thought.

Then, a Problem Came Up

A little later, a requirement came in for VRT (Visual Regression Testing). Playwright has a built-in toHaveScreenshot() function, so I assumed we could handle it right away.

But then I discovered that when you're using Cucumber.js, you can't use Playwright Test's test runner. That means no toHaveScreenshot(), no HTML Reporter, no Fixtures — none of the Playwright Test features. To implement VRT, we'd have to build it from scratch.

The code AI wrote was good. It answered my request for "BDD tests" by using Cucumber.js, exactly as asked. But it didn't account for the VRT requirement that would come later, or the fact that Cucumber.js would make that difficult.

And of course it didn't. AI simply responded to the requirements I gave it.

The problem wasn't with AI's code — it was that I lacked the knowledge on my end.

AI's "Knowledge" vs Human "Wisdom"

The selector experience and the Cucumber.js experience — in both cases, my own knowledge hadn't kept up with what AI produced. The difference was in the outcome.

With the selectors, the thing I didn't know happened to be a step forward. With Cucumber.js, what I didn't know led to a real problem down the road.

AI has knowledge. It knows how to write Cucumber.js tests, and it knows Playwright-bdd just as well. When it comes to individual technologies, AI probably has broader knowledge than I do at this point.

But the judgment call — "This project might need VRT down the road, so let's go with Playwright-bdd instead of Cucumber.js" — that's hard for AI. It requires understanding the project context and anticipating future requirements.

AI has knowledge. But the wisdom to choose the right technology for the right situation — that's something humans need to bring.

The Habit of Looking from Different Angles

Looking back, my thinking was one-directional. "I want BDD" → "Implement with Cucumber.js." That wasn't wrong, but I stopped there.

"Are there other ways to achieve BDD?" "How would each option handle future requirements?" If I'd had the habit of thinking sideways — not just digging deeper in one direction — I might have seen Playwright-bdd as an option from the start.

It's the same when identifying test perspectives. For any given condition, if there's a front, consider the back. If there's a right, think about the left. How far does the impact reach? What does general practice say? What about this specific domain? I've come to feel that the habit of thinking across multiple axes is essential — not just for test design, but for properly evaluating what AI gives us.

I'd like to explore this "way of broadening your thinking" more in the next article.

Summary

Having AI write test code is quickly becoming the norm. And AI does genuinely good work.

But if we stop at "it works, so it's fine," we may end up like I did — realizing too late that something was off. To evaluate AI's output, we need to understand the technology ourselves. At the very least, in our own area of expertise, I want to maintain knowledge equal to or greater than what AI has.

Gain knowledge. Turn it into wisdom.

That's a principle I want to keep holding onto as I continue working alongside AI.

(These are my thoughts as of late February 2026.)

This is Part 1 of the "AI × Software Testing" series.Next time, I'll take a closer look at how to broaden your thinking when identifying test perspectives.

Arrangility Sdn. Bhd.https://www.arrangility.com