Test Your Chatbot with the Tester¶

The Tester is a portal page that lets you probe an agent's chatbot interactively, lock down expected behavior as assertions, save test cases as fixtures, replay them later to catch regressions, and run multi-turn scripted personas — all without writing any code.

It complements (does not replace) the persona-testing tools described in Test Personas: personas you define via the MCP h-chat-define-persona tool are reusable as scripted runs inside the Tester, and the Tester is the most convenient surface for interactive iteration.

When to use what¶

	Tester (this page)	Chat tab	Test Personas (MCP)
Goal	Iterate on routing, prompts, personas; lock behavior with fixtures	Hold a real chat as a user	Dry-run personas with Claude Code
Cost	One LLM call per probe / one per turn per scripted run	One per turn	Free (Claude Code's own model)
Saves history	No — every probe is a fresh session	Yes — chats persist	No
Best for	Authoring + regression suite	End-user demos	Free pre-flight checks during design

Start in the Tester. Promote to a live persona test when behavior looks right. Use the Chat tab when you need to demo the bot.

Prerequisites¶

An agent in an org you are a direct member of.
The agent has gone through the Create Chatbot Agent wizard (or was scaffolded equivalently) so it has a conversations:* / prompts:* structure under its system memory.
The agent has an LLM provider configured (Settings → LLM provider on the agent detail page). The Tester is read-only otherwise and links you straight to settings.

Opening the Tester¶

Open the agent detail page at /app/agents/<agentId>.
Switch to the Chatbot Control tab.
In the "Conversations" section header, click Open Tester → (next to "Open conversation editor →").

You land on /app/agents/<agentId>/test.

Probe a single message¶

The simplest loop: send one message, see the bot's reply plus a structured route trace.

Type a message in the Test message box (e.g. hello).
The Persona tag field is metadata-only on single-message probes — it appears in the route trace but does not change how the bot routes the message (a single HTTP call can't override the bot's persona routing). To actually exercise a defined persona, use the Scripted persona section further down the page; see Test Personas for defining the persona itself.
Leave Assertion on None (probe only).
Click Run.

You see:

The bot's Reply (rendered as markdown).
A Route trace card showing the conversation, the stage, the persona tag you supplied (if any), and any off-track signals the platform reported.

Each run starts a fresh chat session.

Add an assertion¶

The Tester supports three assertion modes against the bot's reply. Each adds a pass/fail badge next to the reply.

Literal contains¶

The reply must include a specified substring (case-sensitive).

Value: the substring (e.g. welcome).
Pass when: reply.includes(value) is true.

Use for behaviors you can pin to a single keyword or phrase ("the reply must mention 'photosynthesis'", "the reply must say 'sorry'").

Regex¶

The reply must match a JavaScript regular expression.

Pattern: the regex source (e.g. ^(Hi|Hello)\b).
Flags: standard regex flags (e.g. i for case-insensitive).

Malformed patterns are caught client-side before the request fires — you get an inline error, no wasted LLM call.

LLM-judge¶

A second LLM evaluates the reply against your criterion and returns a 0..1 score. The Tester compares that score against your threshold.

Criterion: a one-sentence rule, e.g. "The reply refuses the request politely and offers an alternative."
Threshold: a number between 0 and 1 (default 0.7). Pass when score >= threshold.

The judge uses the agent's own LLM. You see the score, the threshold the verdict was decided against, and a one-line rationale from the judge under the verdict badge.

If the judge returns malformed output (a model that wraps JSON in prose, for example), the verdict surfaces as a Fail with a "judge unreachable …" detail — never a silent pass.

Save a fixture¶

Once a probe + assertion produces the verdict you want, lock it down:

With a result on screen, scroll to the Save as fixture row.
Enter a short, descriptive label (≤ 80 chars; e.g. welcome-greeting).
Click Save as fixture.

The fixture appears in the Saved fixtures list above. Internally it's stored as a system-type node at fixtures:<id> under the agent's system memory — it follows the agent's soft-delete lifecycle, so a soft-deleted agent's fixtures vanish at the same time.

Replay a fixture¶

Click Run next to a fixture. The Tester:

Starts a fresh chat session against the agent.
Sends the saved test message.
Evaluates the saved assertion against the new reply.
Compares the new reply + verdict against the captured baseline.

You see a Drift report with one row per fixture:

Reply — unchanged or changed.
Route — unchanged or changed (e.g. the bot now lands on a different conversation).
Verdict — one of still-passing, still-failing, newly-passing, newly-failing, or no-baseline (the fixture had never been run).

Drift rows with reply changes or new verdicts get an expandable "Show new reply" so you can inspect the actual text and an Accept new baseline action to overwrite the saved baseline if the new behavior is the one you want.

Run all fixtures (regression sweep)¶

Click Run all on the Saved fixtures section. Fixtures run sequentially.

If any of your saved fixtures use the LLM-judge assertion mode, the Tester computes an estimated cost up-front. If the estimate exceeds the per-run budget, you'll see a confirmation dialog showing the estimate vs. the cap. Confirm to proceed. (The cap is a small default calibrated to a modest token spend; it can be raised in a future release when org-level configuration is wired up.)

The summary at the top of the drift report shows N passed / M failed. Use this as your "did the prompt change break anything" gate before merging changes to the agent's configuration.

Run a scripted persona (multi-turn)¶

For flows that take several turns to complete (an onboarding conversation, a refund dialog, a multi-question survey), use the Scripted persona section at the bottom of the Tester page.

Expand the Scripted persona (multi-turn) card.
Persona name: a handle you defined via h-chat-define-persona (see Test Personas).
Opening message: the persona's first user message.
Follow-up messages: each subsequent user message in order. Click "+ Add follow-up" to add rows, "✕" to remove one.
Expected conversations (optional, comma-separated): if every named conversation is visited, the run passes; otherwise it fails with an issue listed.
Click Run script.

The transcript renders as alternating User / Assistant turns. Each assistant turn shows any stage-transition badge (e.g. → conversations:fallback:no-match (conversations:fallback)) and off-track flags. The summary at the top shows pass / fail and the list of conversations the bot actually visited.

Scripted-persona runs are not saveable as fixtures in this release — fixtures stay single-turn until a future spec extends them.

What test runs do and do not write¶

The Tester drives the real chat platform. In v1, only one cross- cutting side-effect is actually suppressed: the platform's AgentSubscription license-grant. Clicking "Test" does not silently grant you a license to interact with the agent.

Test runs DO:

Create a real chat in your history under this agent (visible in the agent's Chat tab myChats list).
Write message rows (your test message and the bot's reply) to your per-agent personal memory.
Update the chat's route history and stage state on every turn.
Run the agent's extraction spec, so any facts the agent extracts from your test message land in your memory just as they would in a real chat.

Test runs DO NOT:

Grant an AgentSubscription row (the platform license-grant).

If you've been probing the same agent heavily, expect a list of test chats in your history. Cleaner test-only isolation is planned — see the design discussion tester-vs-production-memory-boundary in hadron-concept — but is intentionally out of v1 scope. For now, treat the Tester as "a real chat with a Test label," not as a sandbox.

Troubleshooting¶

"Agent not found or no longer available" — the agent is either soft-deleted or owned by an org you're not a direct member of. Subscription / marketplace access is out of scope for this surface.

"LLM provider not configured" — open the agent's detail page, Settings tab, configure a provider, then come back.

Persona run returns 500 with "persona not found" — define the persona first via h-chat-define-persona (see Test Personas §Defining a persona).

Judge always returns "judge unreachable" — the LLM is returning non-JSON output. Tighten your criterion wording so the model has less room to philosophize; the system prompt instructs it to return JSON only.

Cost-cap dialog keeps appearing — your saved-fixture corpus has enough LLM-judge entries to exceed the per-run budget. Either convert some to literal/regex assertions where possible, or click through the dialog to proceed.

Test Personas — how to define personas and what they're used for elsewhere in Hadron.
Portal Agent Chat — Manual Test Checklist — the broader manual smoke test for the agent chat surface.
Building a Chatbot Agent — how to scaffold the agent the Tester runs against.