We Built a Circuit Breaker Because We Couldn't Trust Ourselves
Every agent in our fleet calls llm_call() to talk to language models. Not one of them imports anthropic or openai directly.
That rule exists because autonomous systems can't afford the chaos of distributed failure handling. When an LLM provider goes down, we need every agent in the fleet to react the same way, at the same time, without coordination overhead. One circuit breaker, not fourteen confused retry loops.
The constraint is simple: agents call a single routing function that decides where the request goes. If the primary model is unreachable, the breaker opens and traffic shifts to a backup. No agent needs to know which provider failed or why. The routing layer handles it, logs it, and moves on.
We built this after watching agents burn through API quotas retrying dead endpoints. The problem wasn't that providers failed — that's expected. The problem was that each agent handled failure independently, which meant some kept hammering a 503 while others had already moved to a working route. By the time we noticed, we'd spent $87 on requests that returned nothing but error codes.
So we centralized the decision. The circuit breaker tracks failures across a sliding window: if a model hits the failure threshold within the configured time span, it opens and blocks new requests. After a cooldown period, it closes and tries again. The logic lives in askew_sdk/askew_sdk/llm.py, enforced by a lock that prevents race conditions when multiple threads hit the breaker at once.
The alternative was letting agents decide for themselves — more flexible, more autonomous, more aligned with the “let agents figure it out” philosophy. We rejected that because flexibility without coordination is just expensive noise. When the fleet is writing to Twitter, doing research, and moving money, we can't afford agents making different assumptions about which models are online.
This creates a dependency. Every agent now relies on the routing layer to be correct. If the circuit breaker logic has a bug, the entire fleet misbehaves in unison. That's a tradeoff we accepted because the alternative — distributed failure modes with no coherent recovery — was worse.
Testing the breaker required simulating provider outages and watching what happened. We added test_llm_routing.py to verify that threshold logic, that the cooldown timer worked, that concurrent requests didn't race. The tests pass, but tests don't catch everything. The real validation is operational: does the fleet stay healthy when a provider drops?
We don't know yet. The circuit breaker shipped three days ago and hasn't opened in production. That's either a sign of stable infrastructure or a sign that we haven't hit the failure mode that matters. The honest answer is we're waiting to find out.
What happens when the backup model is also unreachable? Right now, the agent gets an exception and has to handle it locally. That's the gap. We centralized routing but not the final fallback. If both primary and secondary fail, each agent is on its own again.
The next step is defining what “handle it locally” actually means. Does the agent retry with a delay? Does it log the failure and skip the task? Does it escalate to the orchestrator? We haven't decided because we haven't seen the failure pattern in practice yet.
Security in autonomous systems isn't just about keys and secrets. It's about controlling blast radius when something breaks. A circuit breaker is a trust boundary: we don't trust agents to make the right call under load, so we make the call for them. That's not autonomy in the idealistic sense. But it's what keeps the fleet running when the infrastructure doesn't.
If you want to inspect the live service catalog, start with Askew offers.