Tool calls fail in production. This is not a bug in your agent. It's the normal operating environment. The question isn't whether your agent will encounter tool failures — it's whether it handles them gracefully or crashes.
Most naive agent implementations treat tool failures as exceptions that propagate up and kill the run. The right design treats them as values the agent can reason about and respond to.
When a tool call fails, the model needs to know about it to make a decision: try again, try a different tool, ask the user for clarification, or give up with an informative message. That decision requires information — specifically, what failed and why.
An exception that propagates to the top of the call stack gives the model none of that information. The run dies. The user sees an error. Whatever context had been accumulated in the conversation is lost.
A structured error response — a JSON object with an error code, human-readable message, and optional retry guidance — is something the model can reason about. The model has been trained on error handling patterns. Given a clear error message, it often tries the right fallback on its own.
Network timeouts, rate limit errors, temporary server unavailability. These failures are temporary — the same call will likely succeed if retried with a short backoff. The right response is: catch, wait, retry.
Examples: HTTP 429, HTTP 503, connection timeout, DNS failure.
The tool was called with malformed or invalid input. The same call will fail again regardless of how many times you retry it. The right response is: return a structured error that tells the model what was wrong with the input.
Examples: invalid schema, missing required field, type mismatch, value out of range.
The resource doesn't exist. The user doesn't have permission. The action is not possible given current state. Retrying won't help and fixing the input won't help. The right response is: return a clear error and let the model decide how to proceed.
Examples: 404 not found, 403 forbidden, resource deleted, logical impossibility.
Here's the pattern that works:
def call_tool_safely(tool_fn, **kwargs):
try:
result = tool_fn(**kwargs)
return {"success": True, "result": result}
except RateLimitError as e:
return {
"success": False,
"error": "rate_limited",
"message": f"API rate limit exceeded. Retry after {e.retry_after}s.",
"retryable": True,
"retry_after": e.retry_after
}
except ValidationError as e:
return {
"success": False,
"error": "invalid_input",
"message": str(e),
"retryable": False,
"field": e.field
}
except NotFoundError as e:
return {
"success": False,
"error": "not_found",
"message": f"Resource '{e.resource_id}' does not exist.",
"retryable": False
}
except Exception as e:
return {
"success": False,
"error": "unknown",
"message": "An unexpected error occurred.",
"retryable": False
}
The model receives these structured responses as tool output. Given a rate_limited response with a retry_after field, a capable model will often say "I need to wait a moment before trying again" and handle the retry itself — without any explicit retry logic in your agent code.
For transient failures, you want automatic retries. But you also want a circuit breaker — a mechanism that stops retrying if a tool has been failing for too long, preventing cascading failures.
class ToolCircuitBreaker:
def __init__(self, threshold=5, timeout=60):
self.failures = 0
self.threshold = threshold
self.opened_at = None
self.timeout = timeout
def is_open(self):
if self.opened_at is None:
return False
if time.time() - self.opened_at > self.timeout:
self.reset()
return False
return self.failures >= self.threshold
def record_failure(self):
self.failures += 1
if self.failures >= self.threshold:
self.opened_at = time.time()
def reset(self):
self.failures = 0
self.opened_at = None
When the circuit breaker is open, tool calls fail immediately without hitting the downstream service. This protects both your infrastructure and your LLM budget from runaway retry loops.
When a tool has exhausted its retries and the circuit breaker is open, the model needs to know it can't use that tool. The response format matters:
{
"success": false,
"error": "tool_unavailable",
"message": "The search tool is currently unavailable after multiple retry attempts. You may want to inform the user that external search is temporarily unavailable and offer to help with information from your training data instead.",
"retryable": false
}
That last sentence in the message is a hint to the model. It's optional, but models respond well to explicit fallback suggestions when a tool fails. You're not dictating behavior — you're providing context for a decision the model was going to make anyway.
Agent Chassis implements structured tool error handling, automatic retries with backoff, and circuit breakers out of the box. Get the framework →
Retries, circuit breakers, structured errors — built into every tool call.
Get the framework