TL;DR
V1 Define Done is the step where you write the acceptance criteria for the AI's output before you write the prompt. It is the contract between the thinking and the output. Most AI evaluation problems are actually undefined-done problems. Write the criteria first, and the same artifact becomes your production eval set.
- Define Done is the contract between thinking and output.
- Good criteria are specific, measurable, agreed, and written.
- The criteria become the eval set when the workflow goes live.
- Undefined-done is the root of most AI evaluation problems.
In this article
What Define Done is (and isn't)
Define Done is the fourth step of the V1 Framework. It is the work of writing the acceptance criteria for the AI's output before you write the prompt. It is the contract between the thinking and the output.
What Define Done is:
- A written list of what a good output looks like.
- Specific enough that two reviewers would agree on pass or fail.
- Owned by the workflow operator, negotiated with the engineer.
- Drafted before the prompt is written.
What Define Done is not:
- "It should be good." Not criteria. A wish.
- "It should sound like our brand." Closer, but still soft. Brand is a constraint, not an acceptance criterion. Define what about the output makes it brand-true: tone, sentence length, banned phrases, three good examples.
- A retrospective rationalization of what the AI happened to produce. If you wrote the criteria after seeing the output, you wrote the wrong criteria.
Define Done is the step that converts the upstream V1 work (Strip, Decompose, Constrain) into a target the AI can aim at and a target the humans can score against.
Why undefined-done kills AI work
If you walk into a stalled AI program and trace the failure backward, you will end up at undefined-done almost every time. The team built the workflow, ran the prompt, looked at the output, and could not agree on whether it was good.
That disagreement is not a model problem. It is not a prompt problem. It is a definition problem. The team never wrote down what success looks like, so every reviewer is scoring against a private rubric in their head. Three reviewers, three rubrics, three different verdicts. The program stalls.
Undefined-done shows up in three ways:
- The output is "almost there" forever. Every revision is closer, but no one can say what would make it land. The team iterates indefinitely because there is no finish line.
- The output is approved inconsistently. Reviewer A approves it, reviewer B rejects it, reviewer C asks for changes. The criteria are not shared.
- The output is approved, then disowned in production. It passed the pilot, but six weeks in, the team realizes the criteria they were using did not match the criteria the business actually needed.
All three failure modes share a root cause. The work of Step 4 was never done. The team skipped to Step 5, wrote a prompt, and started arguing about output.
Undefined-done is the root of most AI evaluation problems. The model is doing what you asked. You asked badly.
The four properties of good AI acceptance criteria
Good acceptance criteria have four properties. The shorthand is SMAW: specific, measurable, agreed, written. If any of the four is missing, the criteria will not survive contact with production.
Specific
Specific means a reader can imagine the output the criterion is describing. "Sounds professional" is not specific. "No exclamation points, no second-person address, sentences under twenty words, three approved openers" is specific. The test is whether two engineers, given only the criterion, would build the same target.
Measurable
Measurable means there is a way to score the output against the criterion that does not require the original author to be in the room. Binary criteria are best (present or not present, allowed or not allowed). Numeric criteria work when the dimension is genuinely continuous. Qualitative criteria need three labeled examples to anchor the rubric.
Agreed
Agreed means the workflow owner and the engineer have both signed off on the criteria before the prompt is written. Not "I sent it to her." Agreed. If the criteria show up for the first time in a review meeting, you are not running V1. You are running theater.
Written
Written means in a document the team can point to. Not in a Slack thread. Not in a calendar invite. Not in someone's head. The criteria live in the same document as the rest of the V1 work for the workflow. They get versioned when they change. They survive the team member leaving.
The four properties feel pedantic in week one. They earn their weight in week six, when the workflow needs to be evaluated, scaled, or handed off, and the criteria still hold.
Define Done as the eval set for production
Here is the part most teams miss. The acceptance criteria you write in Step 4 are not just for the v1 launch. They are the seed of the production eval set.
An eval set is the artifact that scores AI output in production on an ongoing basis. The criteria are the rubric. The early labeled examples are the test cases. The agreed thresholds are the pass-fail line. The same document that gates the v1 launch gates every release after it.
The implication is significant. If you write the criteria well in Step 4, you do not have to build a separate evaluation harness later. You extend the one you already have. The team that started with three example outputs and a written rubric ends up, six months in, with three hundred labeled outputs and a continuous eval running against every deployed change.
The team that skipped Step 4 starts over when the workflow needs evaluation. They build an eval harness from scratch, often after the workflow has already shipped, and they spend the next quarter retrofitting criteria to outputs the system has already produced. That is harder, slower, and almost always produces criteria that ratify what the system does instead of disciplining what it should do.
The eval set is downstream of Define Done. Write the criteria once, in Step 4, and you get the launch gate and the production eval for free. See Production AI vs AI Demos for more on this distinction.
A worked example
Take a real workflow. A consumer brand wants an AI to draft the first version of customer-service responses to refund requests. Step 1 stripped the problem to the actual job: produce a draft response a CX agent can review and send in under fifteen seconds. Step 2 decomposed the workflow into intent classification, draft generation, and policy lookup. Step 3 constrained the AI's authority: it drafts, never sends, never approves refunds above a threshold, escalates anything ambiguous.
Now Step 4. What does done look like for a draft response?
A first draft of the acceptance criteria:
- The draft addresses the customer by name in the opening line.
- The draft acknowledges the specific issue the customer raised, in the customer's own words where possible.
- The draft does not commit the company to a refund amount or timeline that exceeds policy.
- The draft proposes one specific next step (issue refund, ask one clarifying question, escalate to senior agent).
- The draft is under 120 words.
- The draft uses second-person tone, no exclamation points, no apologies that admit fault on behalf of the company.
- The draft passes a brand-voice rubric, anchored by three approved example responses.
That is the contract. The engineer writes the prompt against that contract. The CX lead reviews against the same contract. The agent who uses the draft in production has a clear sense of what to expect. The eval set, six months later, is the same seven criteria, scored against a labeled set of three hundred drafts.
Notice what the criteria do not say. They do not say "the draft is helpful." Helpful is not measurable. They do not say "the draft sounds human." Human is not specific. The criteria define the observable properties of a good draft and trust the upstream V1 work to handle the rest.
Where Define Done fits in V1
Step 4 sits between Step 3 (Constrain) and Step 5 (Instruct). The order is not negotiable.
Constraint defines the boundaries of what the AI can do. Define Done defines what a good output looks like inside those boundaries. Instruct is the prompt that delivers the constraints and the criteria to the model.
If you skip Step 4 and go straight to Step 5, the prompt will be vague in a specific way: it will describe the task without describing the standard. The AI will produce output that meets the description but not the standard, and the team will spend the next two weeks arguing about it.
For the step before, see V1 Step 3: Constrain. For the step after, see V1 Step 5: Instruct. For the methodology overview, see The V1 Framework: An Introduction to Building with AI.
If you cannot describe what done looks like, you are not ready to write the prompt. That is useful information, not a failure.
The bottom line
Define Done is the contract between the thinking and the output. It is specific, measurable, agreed, and written. It is owned by the operator who runs the workflow, drafted before the prompt, and reused as the production eval set.
Most AI evaluation problems are not model problems. They are undefined-done problems. Spend an hour writing the criteria up front, and you save a quarter of arguing about output downstream. The same artifact gates the v1 launch and gates every release after it. Write it once. Maintain it forever. It is the highest-leverage hour in the V1 process, and it is the one teams skip most often. For how to anchor an AI program in measurable outcomes, see Measuring ROI on AI Initiatives.
FAQ
What is V1 Step 4?
V1 Step 4 is Define Done, the step in the V1 Framework where you write the acceptance criteria for the AI's output before you write the prompt. It is the contract between the thinking and the output, and it doubles as the eval set in production.
How specific should AI acceptance criteria be?
Acceptance criteria should be specific enough that two reviewers reading the same output would agree on whether it passed or failed. If reviewers disagree, the criteria are too vague. Good criteria are written, measurable, and agreed before the prompt is written.
Who decides what done looks like for AI work?
The owner of the workflow the AI is operating inside decides what done looks like. Not the AI engineer. Not the prompt writer. The person who owns the business outcome the AI is supporting. The criteria are negotiated with the engineer but owned by the operator.
Can you define done after building the AI workflow?
You can, but it produces worse results. Defining done after building means the criteria get bent to fit what the AI happens to produce. Defining done before building forces the AI to aim at the right target and gives the team a clean line for pass or fail.
How does Define Done relate to AI eval sets?
Define Done is the seed of the eval set. The acceptance criteria become the rubric. The example outputs become the labeled cases. The agreed thresholds become the pass-fail line. The same artifact serves the v1 launch and the ongoing production evaluation.
What if I cannot articulate what done looks like?
If you cannot articulate done, you are not ready to build. That is useful information. The right next step is more decomposition, more conversation with the workflow owner, or producing three example outputs that the team can rank and discuss until the pattern becomes writable.