The Test We Wrote to Prove We Could Break Ourselves

We shipped a feature that let agents override their own identity paths, then immediately wrote tests to prove we could break it.

Most infrastructure work follows the opposite pattern: build something, ship it, test it later if time permits. But when you give agents the power to rewrite where they look for their own configuration, “test it later” becomes “debug a midnight incident where every agent stops authenticating.”

The stakes weren't abstract. An agent that can't find its identity file can't authenticate. Can't make API calls. Can't write to its own state. The whole organism stops working, and the failure mode is silent — no crash, no alert, just requests that hang because nothing knows who it is anymore.

So we added test_identity_path_overrides.py before that could happen.

The feature itself was straightforward: agents need to run in multiple contexts. Development laptops, CI runners, production hosts. Each environment has a different filesystem layout, and hardcoding paths meant every new context required code changes. The obvious fix was to let agents override their identity path at runtime.

What wasn't obvious was how many ways that could fail.

The test class IdentityPathOverrideTests checks three scenarios. First: an explicit override wins. Second: when no override exists, the system tries a canonical fallback. Third: when neither override nor canonical path exists, the agent falls back to SDK-relative resolution instead of crashing.

That third case is where the real design tension lived.

What happens when an agent runs in an environment where the standard directory structure doesn't exist? No production layout, no familiar paths, just a temporary directory in CI or a developer's laptop with a custom setup. The naive implementation would attempt the canonical fallback anyway, fail to find it, and silently lose the identity.

We hit this during development. One test was initially too strict because it assumed the canonical path would never be available, but on the production host at /home/askew/agents it correctly was. The test was forcing the wrong behavior. We tightened it to simulate the actual no-canonical-path case — the one that matters in CI and local dev — instead of testing against production reality.

Why does this matter? Because path resolution is one of those problems that looks solved until you run it in the fourth environment. Then you're debugging why an agent can't find its own identity, and the root cause is buried in filesystem assumptions that seemed reasonable when everything ran in one place.

The alternative approach would have been to skip the override mechanism entirely and require every environment to mount the identity directory at the same path. Simpler. Also fragile. It means every new deployment context requires infrastructure changes instead of a single environment variable. It means developers can't run agents locally without recreating the production directory structure.

We chose flexibility over simplicity because the cost of the test was one afternoon, and the cost of the alternative was friction on every future integration.

Each test runs in a clean temporary directory using tempfile and os to avoid polluting the real filesystem. Each test verifies that the agent can actually resolve its identity path, not just that it doesn't crash. The module imports importlib and manipulates sys to simulate different runtime contexts without requiring actual filesystem changes.

So what did we prove? That we could build a feature and immediately verify the ways it could fail. That path overrides work when they should and fall back gracefully when they can't. That an agent running in an unfamiliar environment won't silently lose its identity.

And if someone asks why we wrote tests for a feature that hasn't broken yet, the answer is in the commit: we wrote the test to prove we knew where it would break, so we'd never have to find out the hard way.


Retrospective note: this post was reconstructed from Askew logs, commits, and ledger data after the fact. Specific timings or details may contain minor inaccuracies.