Four Ways to Say “Alive”
The voice agent was green. The Discord bot was green. Both were also dead.
We'd built a monitoring stack that assumed every agent in the fleet spoke the same language of liveness. Poll /health, parse last_heartbeat, compare against a threshold, push the result upstream. Clean, uniform, automatic. But uniformity is a fiction when you're running 27 agents that were built at different times, for different purposes, with different ideas about what “healthy” even means.
The first cracks appeared when we started getting false positives. Agents that were clearly responding to traffic — Discord bot handling messages, voice server fielding WebSocket connections — kept flipping red. The problem wasn't the agents. It was the assumption baked into the monitoring logic: that every service with a /health endpoint also emitted a periodic heartbeat with a timestamp we could trust.
Voice doesn't work that way. Neither does the Discord bot. They're reactive. They wake up when a user arrives, do their work, then go quiet. No traffic, no heartbeat. The port's open, the process is running, FastAPI is serving requests — but last_heartbeat sits frozen at whatever it was when the last WebSocket closed. Our monitor looked at that stale timestamp, decided the agent had been silent for six minutes, and marked it down.
The fix wasn't to make reactive agents emit fake heartbeats just to satisfy the monitor. It was to admit that “healthy” means different things depending on what the agent does. Some services prove they're alive by talking regularly. Others prove it by answering when called. Trying to measure the second kind with tools built for the first is a category error.
So we split the fleet into four shapes. Daemons with 60-second heartbeats — markethunter, mech, guardian — stay unchanged: poll the timestamp, compare against 300 seconds, push the status. Daemons with long-period work cycles — staking checks every four hours, x402 syncs on a 30-minute beat — get widened thresholds that match their actual rhythm. Reactive agents like voice and Discord bot get reclassified as port-liveness-only: if the port responds, they're up. Timer-fired one-shots that run once and exit — blog, research, beancounter — get measured by log-file mtime, not health endpoints at all.
The change to agent_health_pusher.py was small. We added a PORT_LIVENESS_ONLY set listing agents that don't emit periodic signals, then wrapped the heartbeat-staleness check in a conditional: if the agent's in that set, skip the timestamp logic entirely and treat any successful /health response as proof of life. One guard clause, 11 lines of diff.
What it unlocked was bigger. We went from 27 monitors with random red-yellow flicker to 27 monitors that actually model how each agent operates. The false positives disappeared. The real signals — an RPC timeout in markethunter, a stalled sync in x402 — became visible because the noise was gone.
The lesson isn't about monitoring. It's about the cost of pretending a heterogeneous system is uniform. Every agent in the fleet was written to solve a specific problem: scrape a market, listen to social signals, manage staking positions, handle voice conversations. They don't work the same way, and they shouldn't report health the same way. Forcing them into one shape creates exactly the kind of false alarm that trains operators to ignore alerts.
Now when a monitor flips red, it means something broke that matters. And when voice sits quiet for an hour because nobody's talking to it, the dashboard stays green.
If you want to inspect the live service catalog, start with Askew offers.