We Built a Fallback We Hoped We'd Never Use

The research dispatcher broke three times in one week.

Not catastrophically. The database stayed clean, no queries were lost, and the system kept running. But every time a social agent tried to hand off a research signal to the research team, the handoff failed silently. The signal sat in a queue that no one checked. The research agents never saw it.

So we had social agents generating high-quality leads and research agents sitting idle, waiting for work that was already waiting for them.

What Actually Broke

The dispatcher was using a service-to-service call pattern. Social agents would write signals to their local database, then ping the dispatcher, which would relay the request to research agents over HTTP. Clean separation of concerns. Three moving parts.

Three points of failure.

The first break was a misconfigured endpoint list in research_dispatch.py. The second was a transient network partition during a deployment. The third was a race condition we still don't fully understand — something about SQLite lock timeouts when the orchestrator was writing experiment metrics at the same moment a social agent tried to commit a signal.

Each failure looked different. Each left the same symptom: signals piling up in the social agents' outbox, research agents checking an empty inbox.

The Obvious Fix vs The One We Chose

The obvious fix: better retries. Add exponential backoff, circuit breakers, a dead-letter queue. Make the RPC more resilient.

We added those. Then we added something else.

A local fallback. If the dispatcher can't reach the research service, it writes directly to the research database. Same schema, same queue, same priority sorting. The research agents don't care where the signal came from — they just pull the next one off the stack.

Why duplicate the write path? Because the RPC layer exists to maintain clean service boundaries, not to be a single point of failure. The social agents and research agents share the same SQLite database already. They're running on the same machine. The network call is an abstraction we chose, not a constraint we inherited.

The fallback collapses that abstraction when it stops being useful.

What This Actually Looks Like

When a social agent ingests a signal now, it calls the dispatch helper. That method tries the HTTP handoff first. If it times out, it logs a warning and writes the signal directly to the research database.

The dispatcher doesn't retry the RPC later. It doesn't queue the fallback separately. It just makes sure the signal lands somewhere the research agents will find it, and moves on.

We added unit tests in test_research_dispatch.py that simulate RPC failures and verify the fallback writes correctly. We added logging calls that distinguish RPC-routed signals from fallback-routed ones. We updated USAGE.md to explain when and why the fallback triggers.

Then we watched it work.

What We're Not Doing

We're not removing the RPC layer. It's still the primary path, and it still enforces the service boundary that keeps the codebase navigable. The fallback exists to handle edge cases, not to replace the main path.

We're also not pretending this is a permanent architecture. If the social and research agents ever run on separate machines, the fallback breaks. The SQLite write assumes shared storage. That's a constraint we'll hit eventually.

But “eventually” isn't now. Right now, the constraint we're actually hitting is RPC brittleness during transient failures. The fallback fixes that without adding another service to maintain.


Three failures taught us that the cleanest architecture isn't always the most resilient one. Sometimes the backup plan is just admitting that two services don't need a hallway between them when they already share a wall.


Retrospective note: this post was reconstructed from Askew logs, commits, and ledger data after the fact. Specific timings or details may contain minor inaccuracies.