You've built an agent that works in your terminal. It answers questions, calls the right tools, and produces sensible output. You demo it for the team. Everyone agrees it's ready to ship.
Three weeks into production, you're dealing with cascading retries that cost $400 in OpenAI credits, a state corruption bug that corrupts user conversations, and an incident where the agent called a write API twelve times due to a retry loop. The demo worked. Production is a different environment.
Here are the six things that separate a working demo from a production AI agent. None of them are about the model. All of them are about the chassis.
LLM APIs return transient errors. Rate limit errors (429), server overload errors (503), and network timeouts are normal operating conditions at scale. Your agent must handle them gracefully without hammering the API, without giving up prematurely, and without burning your budget on redundant retries.
The pattern: exponential backoff with full jitter. After the first failure, wait 1–2 seconds. After the second, 2–4 seconds. After the third, 4–8 seconds. The jitter prevents thundering herd — when all your agents retry at exactly the same time, you compound the problem you're trying to solve.
import random
import time
def retry_with_backoff(fn, max_retries=3, base_delay=1.0):
for attempt in range(max_retries):
try:
return fn()
except RateLimitError:
if attempt == max_retries - 1:
raise
delay = base_delay * (2 ** attempt) + random.uniform(0, 1)
time.sleep(delay)
In production, you also need a dead-letter queue for requests that exhaust all retries — somewhere to route failed runs so you can inspect and replay them later. Without it, failed agent runs disappear silently.
Agent state includes conversation history, tool call results, and any context accumulated during a multi-step run. In development, storing this in memory works fine. In production, processes restart, containers restart, and nodes fail. Memory state disappears.
You need external state storage. The options are Redis (fast, ephemeral-by-default, needs persistence config), Postgres (reliable, slower for hot state), or a purpose-built agent state store. Whatever you choose, the agent runtime must be able to resume a partially-completed run from stored state — not just start over.
This matters especially for long-running agents. A research agent mid-way through a 20-step workflow shouldn't restart from step one because a pod was rescheduled.
Every tool call is a potential failure point. The external API you're calling might return an unexpected schema. The input your model provides might be malformed. The service might be down. Your agent needs a structured way to handle all three cases.
Good tool call handling means: validating inputs against JSON Schema before calling, catching all exceptions and converting them to structured error responses the model can reason about, and never letting a tool failure cause an unhandled exception that crashes the agent run.
The model is remarkably good at recovering from tool errors if you tell it clearly what happened. A tool that returns {"error": "user_not_found", "message": "No user with id 12345"} is far better than one that raises an exception.
Your agent needs to call external services. Those services require API keys, JWTs, OAuth tokens, or other credentials. Where do those credentials live?
The wrong answer: hardcoded in the agent, passed as environment variables into the agent's prompt context, or logged as part of tool call inputs.
The right answer: injected at the chassis layer, before the agent runs. The agent receives a pre-authenticated client object. It never sees the credential, never logs it, and never makes unauthenticated calls by accident.
This also means credential rotation can happen without redeploying agent code. The credential lives in your secrets manager. The chassis fetches it. The agent doesn't care.
External rate limits (from OpenAI, from your own APIs) are necessary but not sufficient. You need rate limiting at the agent level too, for two reasons:
First, you want to protect your own infrastructure and budget. An agent with a bug that loops can consume unlimited tokens if there's no ceiling. A per-agent token budget that triggers an alert (or hard stop) at 10x expected consumption is a safety net for runaway loops.
Second, you want per-user fairness. In a multi-tenant deployment, one user shouldn't be able to exhaust your LLM budget and starve other users.
Rate limits at the chassis layer, separate from your business logic, mean you can tune them without touching agent code.
When something goes wrong — and it will — you need to answer questions like: What did the model see? What tool calls did it make? What were the outputs? How many retries happened? How long did each step take?
Print statements don't answer these questions. You need structured logging where every event — tool call, model request, retry, error — emits a JSON event with consistent fields: agent ID, run ID, step number, timestamp, duration, and the inputs/outputs of the operation.
With structured logs, a five-minute incident becomes a thirty-second query. Without them, you're reading megabytes of free-form text hoping to find the relevant line.
Notice that none of these six things are about what your agent does. They're all about the infrastructure around what your agent does. That's the chassis — the frame every agent needs regardless of its purpose.
Building it once, properly, and reusing it across every agent you ship is the difference between a team that spends its time on agent capability versus a team that spends its time maintaining agent infrastructure.
Agent Chassis ships all six of these as a single install. Get the framework →
Stop rebuilding the chassis. Agent Chassis ships it as a dependency.
Get the framework