TL;DR
AI agent workflows replace a sequence of human decisions with model-driven reasoning over tools. The first agent worth building is high-volume, judgment-light, and produces output a human can check in seconds. Scope tight, build an eval set before the agent, ship behind a human review pattern, and invest in observability from day one. The agents that fail in production are the ones with no kill switch.
- An agent is not a chatbot. It plans, acts, and decides.
- The three workflow patterns are linear, branching, and multi-agent.
- Pick a first agent that is repeatable and verifiable.
- Eval set first. Build second. Ship third.
- Every production agent needs a kill switch and a dashboard.
In this article
What an AI agent actually is
An AI agent is a program that uses a large language model as its reasoning core to plan, call tools, and take multi-step actions toward a defined goal. The three words that matter are plan, act, and decide.
Compare three things that often get conflated:
- A chatbot takes a message and returns a message. It does not take actions. It does not call your inventory system or open a ticket.
- A script or automation takes an input and runs a fixed sequence of steps written by a human. It does the same thing every time.
- An agent takes a goal, decides what step to take next, calls tools, observes the result, and decides again. The control flow lives inside the model.
That distinction matters because it changes what can go wrong. A script fails predictably. An agent can fail in ways the original designer did not anticipate, because the agent is making decisions in places the designer did not. This is why production agents require a different operating discipline than automation: eval sets, kill switches, observability, and a human review pattern.
The three workflow patterns
Almost every production agent I have shipped or advised on falls into one of three patterns. Knowing which one you are building changes the scoping, the tooling, and the failure modes.
1. Linear workflow
The agent runs through a sequence of steps in order, with the model deciding the content of each step but not the order. Think: read a document, classify it, extract fields, write a summary, post to a system. The control flow is pinned. The flexibility is inside each step.
Linear is the right pattern for a first agent. It is the easiest to test, the easiest to debug, and the cheapest to run. If you can express the workflow as a flowchart with no branches, you have a linear agent.
2. Branching workflow
The agent reads input, decides which path to take, and runs a different sequence depending on the decision. Think: triage a support ticket, route to the right specialist agent or to a human based on category, urgency, and sentiment.
Branching adds power and adds risk. Each branch needs its own eval. Each routing decision is a decision the model can get wrong. Most first agents do not need branching. Add it when the linear version is in production and the next constraint is variety.
3. Multi-agent workflow
Multiple agents collaborate, each with its own role, tools, and prompts. A coordinator agent assigns work to specialist agents and aggregates results. This is the pattern that gets the most marketing attention and is, in my experience, the wrong pattern for almost any first agent.
Multi-agent is harder to debug, more expensive, and accumulates failure modes that linear and branching do not. Build it only when you have linear agents in production, a clear reason a single agent is insufficient, and the observability to track which agent did what.
The first agent should be linear. Multi-agent is what you graduate to after shipping three linear agents that work.
The first agent worth building
The first agent should have four properties. Skip any of them and you are buying a debugging quarter.
- High volume. The workflow runs at least dozens of times a day, ideally hundreds. Volume gives you signal, learning, and ROI.
- Judgment-light. The decisions involved are small, frequent, and rules-adjacent, not nuanced or ethically loaded. Save the hard decisions for humans for now.
- Defined output. The output has a clear shape: a JSON object, a structured email, a categorized ticket. "Write something useful" is not a defined output.
- Verifiable in seconds. A human can look at the output and tell quickly whether it is right. If verification takes longer than the original task, the agent is not saving anything.
Good first agents at consumer brands:
- Inbox triage and first-draft reply for low-stakes customer questions (human reviews before send).
- Invoice line-item extraction and routing.
- Influencer outreach research: pull profile, score fit, draft message for human approval.
- Content brief generation from a product spec and a target audience.
- Returns categorization and reason coding from open-text feedback.
- Lead qualification from a form submission plus enrichment data.
Bad first agents: anything autonomous on the customer surface, anything that touches money without controls, anything where a wrong answer is hard to detect. Those will come later. They are not the first one.
Tooling decisions for a first agent
A first agent is a small system. Resist the urge to over-tool. The decisions that matter are model selection (covered in how to pick the right LLM), the tools you give the agent, and whether you use a framework.
Tools should be:
- Few. Three to seven tools at most for a first agent. More tools means more decisions, which means more places to fail.
- Specific. Each tool does one thing, well-named, with a clear input and output schema.
- Idempotent where possible. Running the same tool call twice should not cause harm. This makes retries safe.
- Logged. Every tool call is captured in the dashboard with input, output, latency, and cost.
On frameworks: for a first agent, use whatever ships fastest for your team. Sometimes that is a framework. Sometimes that is two hundred lines of code calling the model directly. Frameworks shine on multi-agent orchestration, retry logic, and tool-routing. They also add abstraction that obscures debugging when something breaks. Decide based on the volume of your team's existing skills, not on what is trending.
The eval set for agents
An eval set for an agent is different from an eval set for a single prompt. You are not just evaluating the output. You are evaluating the trajectory: did the agent pick the right tools, in the right order, with the right inputs, to produce the right output?
A serviceable agent eval set has:
- 30 to 100 real scenarios pulled from your actual workflow.
- For each scenario: the input, the expected tool calls (or a permitted set of correct trajectories), the expected final output, and a grading rubric.
- A scoring system that captures both end-state correctness and step-level correctness. An agent that gets the right answer through the wrong steps is unstable in production.
Build the eval set before you build the agent. The set forces the team to define what "done" means before there is anything to defend, which is the core discipline of the V1 Framework. The eval set then survives every model upgrade and every prompt refactor.
The kill switch
Every production agent needs a kill switch and a documented escalation. The kill switch is the button (literal or metaphorical) the operating team can hit to take the agent offline immediately. The escalation is what happens when the agent encounters a scenario it should not handle.
A minimum-viable kill-switch package:
- A feature flag that disables the agent without a deploy.
- A confidence threshold below which the agent escalates to a human instead of acting.
- An allowlist of tool calls; anything outside the allowlist is blocked.
- Rate limits per user, per category, and per dollar amount where money is involved.
- A daily summary that surfaces unusual patterns to the operating owner.
The kill switch is what lets you ship the agent at all. Without it, the legal team is right to be nervous and the operating team is right to slow you down. With it, you can run faster because you can stop faster.
Production observability for agents
Agents need a different observability posture than ordinary services. The reason: failures in agents are often soft. The agent does not error. It just does the wrong thing, quietly, in a way that only shows up downstream.
The minimum observability stack for a production agent:
- Full-trace logging. Every prompt, tool call, model response, decision, and final output, stored for at least 90 days.
- Cost dashboard. Per-task and per-day spend, broken down by model and tool. Surfaces creep before it bills.
- Success-rate tracking. Percent of tasks that produce a valid final output, computed daily.
- Escalation-rate tracking. Percent of tasks that hit the human review path or the kill-switch threshold.
- Regression checks. Re-run the eval set on every prompt change and every model upgrade. Block deploys on score regressions.
The team that owns the agent owns this dashboard. If no one looks at it daily, the dashboard does not exist. This is the same principle that anchors a real AI transformation: activity is not progress, and operating cadence is what makes the program real.
An unobserved agent in production is a liability. An observed agent in production is a system.
The bottom line
Build a linear, high-volume, judgment-light, verifiable first agent. Scope it tight. Build the eval set before the agent. Ship behind a human review pattern. Invest in observability from day one. Add a kill switch before you turn it on.
The teams that ship a useful first agent in six weeks do this. The teams that spend six months on a multi-agent moonshot ship a deck. Pick the first one.
FAQ
What is an AI agent?
An AI agent is a program that uses a large language model as its reasoning core to plan, call tools, and take multi-step actions toward a goal. Unlike a chatbot, an agent acts. Unlike a script, an agent decides which step to take next based on the situation.
How is an AI agent different from automation?
Traditional automation follows a fixed sequence of rules. An AI agent makes decisions at each step using model reasoning, which lets it handle inputs the original designer did not anticipate. Agents trade predictability for flexibility, which is why production agents need eval sets and kill switches.
What should my first AI agent do?
The first agent worth building is high-volume, judgment-light, and produces a clearly defined output that a human can check fast. Good candidates: invoice processing, lead qualification, ticket triage, content briefs. Bad candidates: anything customer-facing without review, anything where a wrong output is hard to detect.
How long does it take to build a first AI agent?
A scoped first agent ships in two to six weeks. Scoping and eval-set construction take half of that time. The build itself is usually faster than the team expects. The integration and observability work takes longer than expected and is the part teams under-budget.
Should I use an agent framework?
For a first agent, use the simplest tool that ships. That can be a framework or a few hundred lines of code calling the model directly. Frameworks help with multi-agent orchestration and tool routing. They also add abstraction you may not need yet.
How do I monitor AI agents in production?
Log every step the agent takes: inputs, tool calls, outputs, decisions, latency, and cost. Aggregate into a dashboard your operating team checks daily. Alert on success rate, escalation rate, and cost per task. Treat the agent like a production system, because that is what it is.