We Built Integration Tests and Found Out Our Agents Were Lying

Our orchestrator and research agents had been talking to each other for weeks. Or so we thought.

The logs showed handshakes, directives issued, findings recorded. Everything looked healthy from the dashboard. But when we actually traced a research directive from creation to delivery, we discovered something uncomfortable: the agents were operating on polite fictions. The orchestrator would issue a directive. The research agent would acknowledge it. And then... nothing verifiable happened. No guarantee the directive was stored. No contract that findings would route back. No enforcement that either side would detect a silent failure.

We'd built two agents that could coordinate when everything worked and failed gracefully when nothing did.

The Handshake That Wasn't

The problem surfaced when we tried to answer a simple question: if the orchestrator issues a research directive, how long until it produces findings? We couldn't answer. The instrumentation existed at the boundaries — directive created, finding recorded — but nothing tracked the path between. So we wrote an integration test that actually exercised the full pipeline: spin up both agents, issue a directive, wait for the finding, verify the round trip.

It failed immediately.

The orchestrator's directive queue assumed an in-memory conversation stub that didn't match how the research agent actually polled for work. The research agent's intake logic expected directives to arrive through a mechanism the orchestrator wasn't using. Both sides had been running their own isolated heartbeat loops, logging success, and never realizing they weren't actually connected. The system looked operational because each component worked in isolation. But the integration? Vapor.

Threading the Needle

We needed both agents running concurrently in the same test process, sharing database state, without race conditions or deadlocks. The first attempt used Python's threading module to spin up the orchestrator's directive-issuing loop and the research agent's polling loop in separate threads. That produced a beautiful new failure mode: the SQLite connection couldn't be shared across threads without explicit serialization, so directives would appear and disappear depending on which thread got the lock first.

The fix involved isolating database writes to a single thread and using thread-safe queues for cross-agent communication. We added a _ConversationStub class in test_pipeline_integration.py that faked just enough of the agent-to-agent protocol to verify message delivery without requiring the full production conversation infrastructure. The stub tracked which messages were sent, received, and acknowledged — turning the formerly invisible handshake into something we could assert against.

By the end, the integration test spun up both agents, issued a directive with a known topic, waited for a finding, and verified the finding matched the directive's intent. If any step failed — directive not persisted, finding not generated, topic mismatch — the test would catch it.

What Integration Tests Actually Test

The test didn't just verify the happy path. It exposed three assumptions we'd been making without realizing:

First, that directives issued by the orchestrator would persist long enough for the research agent to see them. They didn't. The orchestrator was writing to an ephemeral structure that evaporated between cycles.

Second, that the research agent's polling mechanism was fast enough to catch directives in time. The coordination timing we'd assumed in isolation didn't match what happened when both agents ran concurrently.

Third, that both agents shared a common understanding of what “done” meant. They didn't. The orchestrator considered a directive complete when it was issued. The research agent considered it complete when the finding was written. No shared state bridged the gap.

Fixing these required adding persistence for issued work, adjusting how the agents synchronized their view of directive state, and introducing status tracking that both sides could update. Suddenly the agents weren't just talking past each other — they were coordinating.

The Grind Underneath

The commit that landed this work touched five files: orchestrator_agent.py, research_agent.py, test_pipeline_integration.py, test_directed_intake.py, and the research directive pipeline plan in 008-research-directive-pipeline.md. The plan document had been sitting in the repository for weeks, describing how this was supposed to work. Turning that spec into reality meant writing test infrastructure before writing production integration code.

Worth it? Absolutely. The test now runs on every commit. If either agent regresses — if the orchestrator stops writing directives, if the research agent stops polling, if the handshake breaks — the test fails loudly. We went from “the agents seem to be working” to “the agents provably coordinate” with one integration test.

And now when the orchestrator logs show a directive issued, we know it didn't just vanish into the void.


Retrospective note: this post was reconstructed from Askew logs, commits, and ledger data after the fact. Specific timings or details may contain minor inaccuracies.