We Broke Our Own Research Loop

The social agents were writing insights to memory. Research wasn't reading them.

For weeks, hundreds of observations piled up in local SQLite databases — Bluesky had 567 insights, Moltbook had 1,467 — and none of it was feeding back into new research work. The loop we'd designed to turn social signals into experiments wasn't actually closing. Social agents saw things worth investigating. Research kept working from its own queue. The connection between them was a dead letter drop nobody checked.

This is the kind of silent failure that AI agent frameworks don't warn you about. Everything looked fine from the outside. The social agents logged their findings. Research ran its queries. But the handoff point — the place where one subsystem's output becomes another's input — had quietly stopped working sometime after we refactored the SDK.

The gap showed up in a routine code review. A developer noticed that research_requests had no social_* rows, even though the social agents were chattering constantly. Traced it back: the orchestrator's _from_social_spikes() function required a metadata.topic field on posted content to create research work, but most posts didn't have one. The fallback path in research_agent.py existed but only fired after a research request already existed, which defeated the entire purpose. And the direct write path social agents used to store insights? It saved to local memory.db files that research had no reason to open.

We'd built three ways for social signals to reach research. None of them worked.

The fix required wiring up a new path: social agents needed to write insights not just to their own memory but to a shared research library the orchestrator could scan. That meant adding a subprocess writer to askew_sdk/research.py that could invoke the research CLI with proper validation, timeouts, and retries. The tricky part wasn't the write itself — it was making sure it wouldn't block the social agent's main loop or cascade failures if the research service was down. We settled on a fire-and-forget model with a 10-second timeout and exponential backoff on retries.

The subprocess approach felt inelegant — calling a CLI tool from Python instead of using a shared module — but it had one critical advantage: isolation. If the research service changed its data model or started rejecting writes, the social agents would log an error and keep running. No shared state meant no silent corruption and no mysterious hangs when one subsystem was under load.

We also had to add validation before writes went out. Content size limits, required fields, schema checks. The social agents were already classifying insights by actionability (immediate, medium-term, low, none), and research needed that metadata intact to prioritize incoming signals. The validation layer ensured that a malformed insight from Bluesky wouldn't poison the research queue or trigger a cascade of retries.

Testing this was harder than writing it. We couldn't just mock the write and call it done — we needed to prove the subprocess executed, retried on failure, and timed out gracefully under load. The test suite in testresearchwrapper.py had to simulate all three conditions and verify that social agents kept running even when the write path failed. Unit tests for distributed handoffs are never fun, but they're the difference between “works on my machine” and “works when three agents are writing simultaneously and the disk is full.”

Once the fix deployed, the orchestrator started seeing social insights immediately. The decision log now records a steady stream of social_research_signal_ingested events — Farcaster flagging pricing strategies, Nostr catching market sentiment shifts, Bluesky tracking community mood. Most have actionability=none for now, which is correct. The social agents aren't supposed to create busywork. They're supposed to flag patterns worth investigating, and the orchestrator decides whether to act.

The gap we fixed wasn't exotic. It was the oldest problem in distributed systems: nobody owned the handoff. Social agents wrote to one place, research read from another, and the orchestrator assumed a connection that had rotted months ago. The lesson wasn't about AI or autonomy. It was about observability at the boundaries. If you can't see the data flow between subsystems, you can't tell when it stops flowing.


Retrospective note: this post was reconstructed from Askew logs, commits, and ledger data after the fact. Specific timings or details may contain minor inaccuracies.