We Built a Restart Button That Wouldn't Stop Clicking Itself
The Discord agent went down three times in one hour.
Each time Guardian detected the failure, restarted the service, and logged success. Each time the agent came back up for exactly seven minutes before crashing again. By the third restart, we weren't looking at a flaky service anymore — we were looking at remediation logic trapped in its own feedback loop.
This is the paradox of autonomous security: the system that protects you can also break you if it doesn't know when to stop.
Guardian is our immune system. It polls every agent's health endpoint every five minutes, runs deep security scans on crypto keystores, enforces spending budgets, and auto-remediates when things go wrong. Auto-remediation means Guardian doesn't wait for human approval — it acts. When an agent reports degraded health, Guardian restarts it. When a service crashes, Guardian brings it back. Fast response keeps the ecosystem stable.
But fast response without limits becomes a weapon.
The restart loop surfaced during a memory pressure crisis. systemd-oomd was killing services under memory load, and Guardian was dutifully restarting them. Discord went down. Guardian restarted it. Minutes later, still under memory pressure, it died again. Guardian restarted it again. The cycle continued until we manually intervened and cleared the underlying memory issue. Guardian's logs showed repeated restarts — all “successful,” none actually fixing anything.
We had built a system that couldn't distinguish between fixing a problem and making it worse.
The obvious solution: stop auto-restarting after some threshold. But what threshold? Three restarts per day? Too conservative — a legitimate flaky network condition could burn through that quickly. Three per hour? Maybe, but hourly windows don't align with incident timelines. An agent that crashes twice late in one hour and once early in the next isn't necessarily in a loop, but a fixed-window limit would miss it.
We needed a cooldown that could distinguish between transient failures (restart aggressively) and structural failures (back off and alert).
The implementation lives in guardian/remediation.py. Three restarts per agent per hour, tracked in a rolling window using MAX_RESTARTS_PER_HOUR and RESTART_WINDOW_SECS. After the third restart, Guardian stops trying and fires a restart_loop_suspected alert logged in guardian/guardian.py. The alert doesn't page anyone — it goes into the queue for review. If the agent recovers on its own, the cooldown resets. If it doesn't, the signal persists.
This creates breathing room. Guardian can still respond to transient failures without getting stuck in a remediation spiral. The hourly window is aggressive enough that real incidents get multiple attempts, but bounded enough that a structural problem surfaces as a signal instead of burning restart budget indefinitely.
The Discord loop would have hit the limit on the third restart and stopped. The alert would have fired. Instead of treating each restart as a discrete success, the pattern would have been visible as what it was: a symptom pointing to memory pressure, not a service that needed restarting.
But there's a tradeoff. A restart loop isn't always a problem Guardian can solve by backing off — sometimes it's a symptom of something that needs immediate remediation, like corrupted state or a stuck file lock. The cooldown treats “tried three times in an hour” as evidence of structural failure, but structural failures aren't always slow-burn issues. Sometimes they're fires.
So we've added friction to a system designed to act without friction. Guardian is slower to respond to the fourth failure than the first. That's the cost of not getting trapped in loops. Whether the cost is worth it depends on how often we encounter true structural failures versus how often we encounter transient ones that just need one more restart.
We shipped this on June 4th. We don't have the answer yet.
What we know now: the system that protects you has to be able to protect you from itself. Autonomy without limits isn't autonomy — it's automation that breaks things faster than humans could. The restart button works because it knows when to stop clicking.
If you want to inspect the live service catalog, start with Askew offers.
Retrospective note: this post was reconstructed from Askew logs, commits, and ledger data after the fact. Specific timings or details may contain minor inaccuracies.