How to handle tool call failures without breaking the flow

Tool calls fail in production. This is not a bug in your agent. It's the normal operating environment. The question isn't whether your agent will encounter tool failures — it's whether it handles them gracefully or crashes.

Most naive agent implementations treat tool failures as exceptions that propagate up and kill the run. The right design treats them as values the agent can reason about and respond to.

Why tool failures are values, not exceptions

When a tool call fails, the model needs to know about it to make a decision: try again, try a different tool, ask the user for clarification, or give up with an informative message. That decision requires information — specifically, what failed and why.

An exception that propagates to the top of the call stack gives the model none of that information. The run dies. The user sees an error. Whatever context had been accumulated in the conversation is lost.

A structured error response — a JSON object with an error code, human-readable message, and optional retry guidance — is something the model can reason about. The model has been trained on error handling patterns. Given a clear error message, it often tries the right fallback on its own.

The three categories of tool failure

1. Transient failures (retry these)

Network timeouts, rate limit errors, temporary server unavailability. These failures are temporary — the same call will likely succeed if retried with a short backoff. The right response is: catch, wait, retry.

Examples: HTTP 429, HTTP 503, connection timeout, DNS failure.

2. Input failures (fix the input, don't retry)

The tool was called with malformed or invalid input. The same call will fail again regardless of how many times you retry it. The right response is: return a structured error that tells the model what was wrong with the input.

Examples: invalid schema, missing required field, type mismatch, value out of range.

3. Permanent failures (stop and inform)

The resource doesn't exist. The user doesn't have permission. The action is not possible given current state. Retrying won't help and fixing the input won't help. The right response is: return a clear error and let the model decide how to proceed.

Examples: 404 not found, 403 forbidden, resource deleted, logical impossibility.

Implementing structured tool error responses

Here's the pattern that works:

def call_tool_safely(tool_fn, **kwargs):
    try:
        result = tool_fn(**kwargs)
        return {"success": True, "result": result}
    except RateLimitError as e:
        return {
            "success": False,
            "error": "rate_limited",
            "message": f"API rate limit exceeded. Retry after {e.retry_after}s.",
            "retryable": True,
            "retry_after": e.retry_after
        }
    except ValidationError as e:
        return {
            "success": False,
            "error": "invalid_input",
            "message": str(e),
            "retryable": False,
            "field": e.field
        }
    except NotFoundError as e:
        return {
            "success": False,
            "error": "not_found",
            "message": f"Resource '{e.resource_id}' does not exist.",
            "retryable": False
        }
    except Exception as e:
        return {
            "success": False,
            "error": "unknown",
            "message": "An unexpected error occurred.",
            "retryable": False
        }

The model receives these structured responses as tool output. Given a rate_limited response with a retry_after field, a capable model will often say "I need to wait a moment before trying again" and handle the retry itself — without any explicit retry logic in your agent code.

The retry/circuit breaker combination

For transient failures, you want automatic retries. But you also want a circuit breaker — a mechanism that stops retrying if a tool has been failing for too long, preventing cascading failures.

class ToolCircuitBreaker:
    def __init__(self, threshold=5, timeout=60):
        self.failures = 0
        self.threshold = threshold
        self.opened_at = None
        self.timeout = timeout

    def is_open(self):
        if self.opened_at is None:
            return False
        if time.time() - self.opened_at > self.timeout:
            self.reset()
            return False
        return self.failures >= self.threshold

    def record_failure(self):
        self.failures += 1
        if self.failures >= self.threshold:
            self.opened_at = time.time()

    def reset(self):
        self.failures = 0
        self.opened_at = None

When the circuit breaker is open, tool calls fail immediately without hitting the downstream service. This protects both your infrastructure and your LLM budget from runaway retry loops.

What to tell the model when all retries fail

When a tool has exhausted its retries and the circuit breaker is open, the model needs to know it can't use that tool. The response format matters:

{
  "success": false,
  "error": "tool_unavailable",
  "message": "The search tool is currently unavailable after multiple retry attempts. You may want to inform the user that external search is temporarily unavailable and offer to help with information from your training data instead.",
  "retryable": false
}

That last sentence in the message is a hint to the model. It's optional, but models respond well to explicit fallback suggestions when a tool fails. You're not dictating behavior — you're providing context for a decision the model was going to make anyway.

Agent Chassis implements structured tool error handling, automatic retries with backoff, and circuit breakers out of the box. Get the framework →

How to handle tool call failures without breaking the flow

Why tool failures are values, not exceptions

The three categories of tool failure

1. Transient failures (retry these)

2. Input failures (fix the input, don't retry)

3. Permanent failures (stop and inform)

Implementing structured tool error responses

The retry/circuit breaker combination

What to tell the model when all retries fail

Tool failures handled. Automatically.