>_ Analyst Engineering

Acceptance Criteria for AI Systems: Testing the Non-Deterministic

Cover for a guide on writing acceptance criteria for AI systems and LLM features whose outputs are non-deterministic.

AI outputs are non-deterministic, so you cannot write acceptance criteria as exact-match assertions. You write them as properties that must always hold, guardrails that must never be crossed, and quality thresholds measured over an evaluation set. You test the distribution, not one answer.

To write acceptance criteria for an AI or large language model feature, you stop asserting a single expected output and start asserting bounds: the properties every valid output must hold, the guardrails it must never cross, the aggregate quality it must hit across a representative evaluation set, and the deterministic behavior of the system wrapped around the model. The reason is simple and it changes everything: the same input can produce different outputs, so the classic “given input X, assert output Y” acceptance criterion is meaningless. The analyst who keeps writing exact-match criteria for AI features is writing criteria that can never pass reliably, and the team will quietly stop trusting them.

This is a relatively new frontier for analysts, and it is squarely a QA analyst and functional analyst problem: how do you define done for something that does not give the same answer twice. The good news is that the discipline transfers. The same precision you bring to a payment rejection flow applies here; you just point it at properties and distributions instead of fixed values. The technical fluency to specify and test these systems is part of the modern analyst toolkit, and the prompt-level skills are in The Technical BA Prompt Toolkit.

Why exact-match acceptance criteria fail for AI

Exact-match criteria fail because AI outputs are non-deterministic: identical inputs yield different but often equally valid outputs. A criterion that asserts one specific answer will pass on one run and fail on the next, not because the system is broken but because it was never going to produce a fixed string.

Take a feature that summarizes a customer complaint for an analyst. Ask it to summarize the same complaint twice and you get two different summaries, both potentially correct. “Given this complaint, assert the summary equals this exact text” is a criterion that cannot hold. Worse, even pinning randomness to zero does not fully save you, because a prompt tweak, a model version bump, or a slightly different input phrasing all shift the output. The output space is large and fluid, and your criteria have to describe acceptable regions of that space rather than single points.

So the mental shift is from equality to invariance. Instead of “the output is exactly this,” you ask “what must be true of every acceptable output, what must never be true, and how good must the outputs be on average.” Those three questions map to properties, guardrails, and quality thresholds, and together with the deterministic wrapper they form a complete set of acceptance criteria for an AI feature. This is the same move from intent to precise behavior I describe for ordinary systems in user story vs specification, adapted to a probabilistic world.

What should AI acceptance criteria measure?

AI acceptance criteria should measure four distinct things: output properties, safety guardrails, aggregate quality, and the deterministic system around the model. Cover all four and you have specified done for an AI feature.

Output properties are invariants that must hold on every single run regardless of the exact words. For the summary feature: the output is valid for its purpose, stays under a length limit, contains no information absent from the source (no hallucinated facts), and is in the required language and format. These are testable on every output even though the output varies.

Guardrails are the lines the system must never cross. It must never reveal another customer’s data, never produce disallowed or offensive content, never output something that looks like financial advice if that is out of scope. Guardrail criteria are pass-or-fail and non-negotiable; a single crossing is a failure regardless of overall quality.

Aggregate quality is measured over an evaluation set, not a single input. You define a metric, accuracy, relevance, task success, and a threshold, for example “summaries are rated accurate on at least 95 percent of the evaluation set.” This is statistical acceptance, and it is the only honest way to express “good enough” for a non-deterministic system.

The deterministic wrapper is everything around the model that behaves normally: input validation, timeouts, retries, fallback when the model fails or returns low confidence, and logging. This part you test exactly as you would any software, and it is where a lot of the real reliability lives. Specifying these layers clearly is a functional analysis job, and the artifacts for capturing them are in Real-World BA Deliverables.

How do you build an evaluation set?

Build an evaluation set as a curated collection of representative inputs paired with expected properties or graded answers, then measure the feature’s quality across the whole set against a threshold. The evaluation set is to AI acceptance what the test suite is to ordinary software, and it is the single most important artifact.

A good evaluation set has three layers. Typical cases: the common inputs the feature will see, to measure everyday quality. Edge cases: the unusual but valid inputs, the very long complaint, the one in mixed languages, the one with almost no content, to find where quality degrades. Adversarial inputs: deliberate attempts to break the guardrails, prompt-injection attempts, requests for disallowed content, inputs designed to provoke a hallucination. Each entry records what a good output must satisfy, whether that is a property to check automatically or a human grade.

Once you have the set, your acceptance criterion becomes runnable: “across the evaluation set, accuracy is at least 95 percent, no guardrail is crossed, and all output properties hold.” And the set keeps paying off, because it doubles as a regression suite. When someone changes the prompt or upgrades the model, you re-run the set and immediately see whether quality dropped, which is the AI equivalent of catching a regression before release. This is the same happy-path-then-break-it test design I apply to systems in You Don’t Understand the System Until You Test It, pointed at a model.

How do you handle the things that will go wrong?

Specify the failure behavior explicitly, because a non-deterministic system has failure modes a deterministic one does not: it can be confidently wrong, it can produce a low-confidence output, and it can be slow or unavailable. The acceptance criteria for these are about the wrapper, and they are deterministic and fully testable.

Define what happens on low confidence: does the system fall back to a human, to a default response, or to a safe refusal. Define the timeout and what the user sees when the model is slow. Define how a flagged or guardrail-violating output is handled before it ever reaches a user. Define what gets logged so a bad output can be investigated, because in a regulated context you will be asked to explain a specific answer. In payments or any regulated domain, “the model decided” is not an acceptable answer to an auditor, so the criteria must require traceability: the input, the model version, the prompt, and the output all recorded.

These failure-handling criteria are where AI features become production-grade rather than demos. The model is the exciting part, but the deterministic scaffolding around it is what makes it safe to ship, and specifying that scaffolding is exactly the kind of precise, technical business analysis that separates an analyst who can ship AI features from one who can only prototype them. The broader skills to reason about these systems end to end are mapped in The Technical Skills Guide for BAs.

The takeaway

You cannot write acceptance criteria for an AI system as exact-match assertions, because the outputs are non-deterministic. Write them instead as four layers: properties every output must hold, guardrails it must never cross, aggregate quality measured over an evaluation set against a threshold, and the deterministic wrapper of validation, fallback, and logging that you test normally. You are specifying acceptable regions of an output space, not a single answer.

Build the evaluation set early, treat it as your regression suite, and specify the failure behavior as carefully as the happy path. Start with The Technical BA Prompt Toolkit and The Technical Skills Guide for BAs, or browse everything at The Tech BA Toolkit.

Ahmed is a Senior Technical Business Analyst with 10+ years in banking and payments. He builds practical guides and tools for analysts at The Tech BA Toolkit.

Tags: Business Analysis, Artificial Intelligence, Software Testing, Requirements, LLM

Newsletter

Subscribe

Practical, no-fluff playbooks for technical analysts who analyze, code, test, and support. New articles straight to your inbox.

No spam. Unsubscribe anytime.