Why do traditional acceptance criteria fail on AI projects?

Traditional acceptance criteria fail on AI projects because they assume a fixed input produces a fixed output, but AI and large language model outputs are non-deterministic, the same input can produce different valid outputs. An exact-match criterion that asserts one specific expected output will pass on one run and fail on the next, not because the system is broken but because it was never going to produce a fixed result.

How should acceptance criteria work for AI features?

Acceptance criteria for AI features should assert properties and bounds rather than exact matches: the properties every valid output must hold (format, grounding, no disallowed content), the guardrails it must never cross, the aggregate quality measured across an evaluation set against a threshold, and the deterministic wrapper around the model (validation, fallback, logging) tested normally. You test the distribution of outputs, not a single answer.

What is an evaluation set for AI testing?

An evaluation set is a curated collection of representative inputs with expected properties or graded answers, used to measure an AI feature's quality at scale. You run the whole set and measure metrics like accuracy or task success against a threshold, rather than testing one input. It includes typical cases, edge cases, and adversarial inputs, and doubles as a regression suite when the prompt or model changes.

Can you test AI systems at all if they are non-deterministic?

Yes. You test non-deterministic systems by asserting invariants and distributions instead of exact outputs: properties that must always hold, guardrails that must never be crossed, and aggregate quality over a representative evaluation set above a threshold. You also wrap the model in deterministic logic, validation, fallbacks, logging, that you test conventionally. The model's variability is handled statistically, while the surrounding system is tested precisely.

What is the biggest mistake teams make testing AI features?

The biggest mistake is applying deterministic, exact-match acceptance criteria to non-deterministic output, which produces criteria that cannot reliably pass and erode trust in the testing. The fix is to shift to property-based and statistical criteria measured over an evaluation set, and to invest early in that evaluation set, which becomes the foundation for both acceptance and regression.

Why Acceptance Criteria Failed on an AI Project

On the AI project where our acceptance criteria failed, the criteria were not wrong so much as the wrong shape. We had written “given this input, the output is this,” for a system that never produced the same output twice. The fix was to stop testing answers and start testing properties.

The acceptance criteria failed on that AI project because we wrote them the way we write criteria for every other system, “given input X, assert output Y”, and the system was non-deterministic, so the same input produced a different, often equally valid, output every time. The criteria passed on the demo and failed on the next run, not because anything was broken but because an exact-match assertion can never reliably hold against output that varies by design. We were measuring a moving target with a ruler, and it took a few embarrassing red builds to understand that the problem was our criteria, not the model. This is the lesson that reshaped how I write acceptance criteria for AI systems, and it sits at the intersection of QA and functional analysis. The prompt-level skills that came out of it are in The Technical BA Prompt Toolkit.

Here is what went wrong, why, and what we replaced the broken criteria with, because the failure was a genuinely useful one.

The criteria that looked right and were not

The feature summarized customer complaints for analysts, and our acceptance criteria looked perfectly reasonable on paper: given this complaint, the summary should be this. We wrote a set of them, each pairing a sample input with an expected output, exactly as we would for any deterministic feature, and everyone nodded because the format was familiar and the examples were good. The trouble started the moment we ran them more than once.

The same complaint, summarized twice, produced two different summaries, both accurate, both fine, neither matching our “expected output” string. So the criterion failed, even though the feature worked. We tried tightening the model’s settings to reduce variability, and it helped a little, but a prompt tweak or a slightly different input phrasing still shifted the output enough to fail the exact-match check. We were in the absurd position of a feature that demonstrably worked and a test suite that was red, and the suite was wrong, not the feature. That is a deeply uncomfortable place to be, because it erodes trust in the testing itself, and once people stop trusting the tests they stop running them.

The root mistake was treating a probabilistic system as a deterministic one. Our entire toolkit for acceptance criteria assumed a fixed input maps to a fixed output, which is true for a payment validation and false for a language model. We had brought the right discipline, precise, testable criteria, to the wrong model of the system, and the precision worked against us, because we had precisely specified an output the system was never going to reproduce. Recognizing that the format itself was the problem, not the specific examples, was the turning point. It is the same realization as understanding that AI output lives in a distribution, not a point.

What we should have been asserting

Once we understood the failure, the fix followed: stop asserting the exact output and start asserting the properties every acceptable output must hold, the guardrails it must never cross, and the aggregate quality across many inputs. We shifted from “the summary equals this” to “the summary is true to the source, under the length limit, in the right language, contains no invented facts, and reveals no other customer’s data.”

Those property-based criteria held on every run, because they described what acceptable outputs have in common rather than pinning one specific output. A summary could vary in wording and still satisfy “true to the source, under the length limit, no hallucinated facts,” so the criteria passed when the feature worked and failed only when it genuinely misbehaved, which is exactly what acceptance criteria are supposed to do. The guardrails were the easiest to agree and the most important: there were things the output must never do, leak data, produce disallowed content, and a single violation was a failure regardless of overall quality. The guardrails were pass-or-fail and non-negotiable, which felt familiar and reassuring after the disorientation of the exact-match failures.

The harder shift was accepting that overall quality could only be measured statistically, not per-output. “Is this one summary good” is partly subjective and varies; “are the summaries good across a representative set of complaints” is measurable. So quality became an aggregate: a metric, accuracy or task success, measured across many inputs against a threshold, rather than a verdict on a single output. That move from per-case judgment to distributional measurement was the conceptual leap, and it is the same one that separates functional analysis of deterministic systems from AI systems, you specify acceptable regions of an output space, not a single point.

The evaluation set was the thing we were missing

The artifact that actually fixed the project was an evaluation set, a curated collection of representative inputs with defined expected properties or graded answers, which we had not built and should have built first. Once we had it, “good enough” became something we could measure: run the whole set, measure quality against the threshold, check no guardrail was crossed, and that was the acceptance criterion.

Building the set was real work, and it was the work that mattered. We gathered typical complaints to measure everyday quality, edge cases, the very long complaint, the one in mixed languages, the nearly empty one, to find where quality degraded, and adversarial inputs, deliberate attempts to make the model leak data or hallucinate, to test the guardrails. Each entry recorded what a good output had to satisfy, whether a property we could check automatically or a human grade. With the set in place, our acceptance criterion became runnable and meaningful in a way the exact-match criteria never were: across the evaluation set, accuracy at least the target, no guardrail crossed, all properties hold.

The set kept paying off after acceptance, because it doubled as a regression suite, which solved a problem we had not yet realized we had. When someone changed the prompt or we considered a new model version, we re-ran the evaluation set and immediately saw whether quality dropped, the AI equivalent of catching a regression before release. Without it, every prompt tweak was a leap of faith; with it, we had a measurement. I now consider the evaluation set the single most important artifact on an AI project, and not having built it early was the real root cause behind the failed criteria. It is the AI version of the test design discipline, typical cases, edge cases, and adversarial inputs, applied to a model.

What I do differently now

The lasting lesson is to start every AI feature by deciding how I will know it works, and that means properties, guardrails, an evaluation set, and a deterministic wrapper, before writing a single exact-match criterion. The failure taught me that the testing approach for AI has to be designed up front, not retrofitted when the familiar criteria start failing, because by then trust is already damaged.

So now I specify four layers from the start. The properties every output must hold, checkable on every run. The guardrails it must never cross, pass-or-fail. The aggregate quality over an evaluation set against a threshold, measured statistically. And the deterministic wrapper, the validation, fallback, timeout, and logging around the model, which I test conventionally because that part is deterministic and is where a lot of real reliability lives. That last layer is easy to forget amid the excitement of the model, but a low-confidence fallback and proper logging are what make an AI feature shippable rather than a demo, and they are specified and tested exactly like any functional requirement.

The deeper shift is mindset: an AI feature is not done when it produces a good output once, it is done when it reliably holds its properties, never crosses its guardrails, hits its quality bar across a representative set, and degrades safely when it fails. Specifying that is harder than writing “given X, assert Y,” but it is the only honest way to define done for something non-deterministic, and getting it right is what lets an analyst ship AI features rather than just prototype them. That capability, defining and testing AI behavior rigorously, is increasingly part of the technical business analyst role, and the practical prompting and AI skills behind it are in The Technical BA Prompt Toolkit and the broader foundation in The Technical Skills Guide for BAs.

The takeaway

The acceptance criteria failed on that AI project because we wrote exact-match criteria for non-deterministic output, so they passed once and failed on the next run while the feature actually worked. The fix was to stop asserting answers and start asserting properties every output must hold, guardrails it must never cross, and aggregate quality measured across an evaluation set against a threshold, plus a deterministic wrapper tested conventionally. The evaluation set, which we should have built first, was the artifact that made acceptance and regression measurable.

The failure was worth it, because it rebuilt how I approach AI testing from the ground up: decide how you will know it works before you write a criterion. Start with The Technical BA Prompt Toolkit and The Technical Skills Guide for BAs, or browse everything at The Tech BA Toolkit.

Ahmed is a Senior Technical Business Analyst with 10+ years in banking and payments. He builds practical guides and tools for analysts at The Tech BA Toolkit.

Tags: Artificial Intelligence, Software Testing, Requirements, QA, LLM