title: "Beyond E2E Tests: AI Personas That Navigate Your App Like Real Users"
slug: 024-testing-through-the-eyes-of-real-users
date: 2026-02-17
category: systems
tags: [ai-agents, testing, personas, ux, open-source, persona-probe, browser-automation]
meansEndsRatio: 0.65
projects: [persona-probe]
conversationExcerpts: false
description: "Unit tests verify your code works. E2E tests verify your flows work. Neither verifies that a real user can find the button you spent a week building. AI personas fill the gap."
draft: false
---
# Testing Your SaaS Through the Eyes of Real Users

On January 18, 2026, I wrote a Playwright script and named it `sarah-test.mjs`. The script navigated to a staging deployment of a financial SaaS application I was building for municipal water utilities. It logged in, found the dashboard, created a scenario, and took screenshots at each step. Structurally, it was an E2E test. But the file was not named after a feature or a flow. It was named after a person.

Sarah Martinez is the Director of Riverside Water District. She manages 12,000 service connections, an $8 million annual budget, and a $10 million EPA-mandated treatment plant upgrade that she has 36 months to deliver. She needs to present 4-5 rate increase scenarios to her city council in 60 days. Her council members are not financial experts — they are local business owners, teachers, and retirees who need simple visuals they can explain to angry constituents at a town hall. Sarah is competent with Excel and QuickBooks but is not a software developer. She expects things to just work.

Sarah is not real. She is a persona definition. But the application had to work for her or it had to work for nobody, because she represented every actual customer I intended to serve. The first `sarah-test.mjs` was crude — a hardcoded Playwright script that navigated specific URLs and asserted that elements existed. It found real problems. The "Create Scenario" button was three clicks deep in a navigation menu that Sarah would never explore. The financial terminology assumed a CPA's vocabulary that a utility director might not have. The export function produced data that was correct and completely unpresentable to a city council.

These are not bugs. Every test passed. Every endpoint returned 200. Every component rendered. The application was functionally correct and practically unusable for its intended audience. The gap between those two conditions is what persona testing fills.

## From Script to Framework

Within a week, one script became three. Sarah was joined by Tom Kowalski, a veteran GM managing a $45 million capital improvement program across 28,000 connections, and Maria Chen, a newly promoted director with no formal training on the predecessor's systems, serving 7,500 connections. I expected the three personas to converge on the same priorities. They did not. Each persona had different goals, different levels of technical comfort, and different definitions of success.

The personas were implemented as Claude Code subagent definitions — markdown files that gave the AI model an identity, a backstory, a set of goals, and instructions to navigate the application using browser automation. The key constraint: the persona knows nothing about the codebase. It knows its job title, its problems, its success criteria, and a list of known routes. It navigates the application the way a real user would — by looking at labels, clicking things that seem relevant, and getting frustrated when they're not.

The results were immediate and specific. Sarah couldn't find the DSCR calculator because it was behind a dropdown she had no reason to open. Tom expected multi-year capital project entries but could only enter single-amount line items. Maria wanted a getting-started wizard and got a dashboard that assumed she already knew what she was looking at. Each persona produced findings that the others missed because each persona brought different assumptions about what the application should do.

The scoring matrix crystallized this. Each persona rated feature requests on a 1-5 scale. "Bill Impact Calculator" scored 5 from Sarah (she needed it for council presentations), 3 from Tom (useful but not critical for his capital planning), and 2 from Maria (she didn't know what it was yet). "Bond Sizing Calculator" reversed: 5 from Tom, 2 from Sarah, 1 from Maria. The same application seen through three different lenses produced three different priority lists. All three were correct.

## The Classification Problem

The first iteration of persona testing produced binary verdicts: PASS or FAIL. This was useless. A persona that fails because a page didn't render and a persona that fails because a label was confusing are not the same kind of failure. The first is a bug. The second is a design decision. Treating them identically means you either fix everything (expensive) or ignore everything (negligent).

The system evolved to classify every finding into one of three categories:

**Fixable**: CSS change, label update, navigation link, component tweak. Things that can be done in a pull request without architectural debate. The DSCR calculator being undiscoverable is fixable — add a navigation link. Toast notifications disappearing too fast is fixable — increase the duration.

**Tradeoff**: New feature, architectural change, significant design decision. Things that require deciding whether the effort is worth the return. Multi-year capital project entries require a new data model. A council presentation mode requires a new view. These are legitimate product decisions, not bugs.

**False positive**: Misunderstanding of an existing feature. The persona navigated to `/debt-service` and reported a 404, but the correct route was `/scenarios/{id}/debt`. The feature exists; the persona didn't find it. The fix is not to the application — it is to the persona's known routes list, injected as a clarification in subsequent runs.

This three-way classification changed what the system optimized for. Instead of minimizing failures, it minimized *fixable* failures. Tradeoffs were documented but deferred. False positives were fed back as context. The loop could exit cleanly only when there were zero fixable items remaining and all personas scored above 60% readiness — a threshold that meant the application was usable, not perfect.

## The Iterative Loop

The persona tests became one phase of an eight-phase automated improvement cycle. The full sequence: plan the fix, review the plan with a second AI model, implement, review the code, deploy to staging, run a visual fidelity inspection, run persona tests, triage the results. Phases six through eight — visual inspection, persona testing, and triage — were mandatory and could never be skipped, because they produced the quality scores that determined whether the loop continued or exited.

The visual fidelity inspection was a separate concern from persona testing. Personas judge the application from a user's perspective: can I accomplish my goal? Visual inspection judges the application from a design perspective: are the colors accessible, is the text readable, are the elements aligned? A persona might pass because the feature works while the visual inspector flags that the button contrast ratio fails WCAG guidelines. Both assessments are necessary. Neither subsumes the other.

The break-on-critical protocol was a practical concession to efficiency. Run the personas sequentially, not in parallel. If Maria Chen hits a critical bug — page fails to render, data doesn't load, authentication breaks — stop immediately. Fix it. Redeploy. Resume testing from where you stopped. Do not let Sarah and Tom discover the same critical bug independently. They will each report it differently, and you will triage three versions of the same problem.

## What Persona Testing Is Not

It is not unit testing. Unit tests verify that your functions produce correct output for given input. Persona tests do not care about your functions. They care about whether a water utility director can find the rate comparison feature and produce something her council will approve.

It is not E2E testing. E2E tests verify that a defined user flow — login, navigate, create, save — works end to end. Persona tests do not follow defined flows. They follow the persona's intuition about where things should be. Sarah's flow is different from Tom's flow is different from Maria's flow, and none of them are the flow you designed.

It is not usability testing. Usability testing requires humans, takes weeks to schedule, and produces qualitative feedback that resists automation. Persona testing runs on every deploy, takes minutes, and produces structured JSON with readiness percentages and actionable items. It is worse than human usability testing at capturing nuance and better at catching regressions, which makes it complementary rather than competitive.

It is not a replacement for talking to actual users. The personas are approximations. They encode my assumptions about what a water utility director needs, filtered through an AI model's interpretation of those assumptions, expressed through browser automation's limited interaction vocabulary. Every layer adds distortion. But the distortion is consistent and automated, which means it catches the things that would otherwise slip through the gap between "works on my machine" and "works for Sarah."

## The Persona Definition

A persona is a YAML file. The format was a deliberate choice: markdown agent definitions worked inside Claude Code, but YAML is portable across AI providers and easier to edit without understanding the agent framework.

```yaml
name: Sarah Martinez
title: Water Utility Director
experience: 15 years in utility management
technical_comfort: moderate

context: |
  Sarah needs to prepare a rate increase proposal for her city council.
  She has 30 days to build a convincing financial case.

goals:
  - Create a financial scenario for next fiscal year
  - Generate charts showing revenue projections
  - Export a presentation for council members

evaluation_criteria:
  - name: Onboarding Clarity
    weight: critical
    question: Did the app guide me without training?

  - name: Terminology
    weight: high
    question: Did the words make sense to me?

voice: |
  Speak as Sarah would: practical, time-pressured, not technical.
  "I don't have time to figure this out - show me the numbers."
```

The `voice` field is the one that surprised me. When the AI model adopts Sarah's voice, it doesn't just evaluate differently — it navigates differently. A persona with Sarah's impatient, time-pressured voice clicks faster, skips tooltips, and reports frustration where a patient persona would report a suggestion. The voice shapes the testing behavior, not just the report prose. This is a feature, not a bug: real users have temperaments, and those temperaments affect how they experience your software.

The `known_routes` field prevents false positives. The `critical_bug_protocol` field defines what stops the test versus what gets noted and continued past. Both evolved from operational problems — personas reporting features as missing when they existed at different URLs, and personas spending thirty minutes documenting a blank page when they should have stopped immediately.

## Why Open Source

When we researched the landscape during the pivot from revenue pipeline to open-source projects, the finding was clear: no open-source framework combines persona definitions, AI-driven navigation, structured classification, and an iterative feedback loop into a single reusable system. Plenty of tools automate browser interactions. Plenty of frameworks define test scenarios. A few projects use AI models to drive exploratory testing. None package the full workflow — define a persona in YAML, point it at your app, classify findings as fixable or tradeoff, feed false positives back as context, and iterate until quality thresholds are met. That gap is what persona-probe fills.

The framework is extracted and generalized from the three water director personas that tested a specific financial SaaS application. The application-specific details — particular routes, particular features, particular financial terminology — are stripped. What remains is the schema, the report format, the classification system, and the iterative loop that ties them together.

Repository: [github.com/AshitaOrbis/persona-probe](https://github.com/AshitaOrbis/persona-probe)

The three water director personas found more UX issues in automated runs than months of manual testing found before them. Not because the issues were hidden — they were obvious the moment someone with Sarah's priorities tried to use the software. The test suite didn't see them because the test suite wasn't Sarah. The framework makes it possible to be Sarah, automatically, on every deploy. That is what was missing.