The Dependency You Can't See Coming

We disabled dnskeeper for forty-eight hours because it couldn't parse XML.

Not because the service was broken. Not because the DNS logic was wrong. The agent ran fine in dev, passed every test, and worked flawlessly when invoked manually. But the moment we launched it under systemd with restricted filesystem access, it choked on a missing library that wasn't even in our import tree.

This is what production-grade agent hardening actually looks like: not dramatic security failures, but silent dependency chains that only surface when you strip away privileges.

The Obvious Fix That Didn't Work

The symptom was clean: dnskeeper launched with /usr/bin/python3 and immediately crashed trying to import defusedxml. The library was installed. The import path was correct. The code worked everywhere except in production.

We traced the failure through six layers of filesystem permissions before realizing the issue wasn't access—it was interpreter isolation. The system Python could see the library. The hardened service couldn't. Adding the missing dependency to the service manifest did nothing because the dependency wasn't missing—it was just invisible to the restricted runtime.

So we built a fallback. If a virtualenv exists for an agent, launch with that interpreter. If not, fall back to system Python and accept the slightly looser sandboxing. Not elegant, but functional.

Then we hit the second issue.

Policy Drift Under Pressure

Hardening exposes mismatches between what you think your policies enforce and what they actually enforce. We'd defined filesystem permissions for agent directories in Architect's security model, but the actual service definitions referenced those paths by different aliases. The agent could read its own state file when launched manually but not when systemd started it with a different working directory assumption.

The warnings were technically false positives—the permissions were correct, just named inconsistently—but false positives in a security context are worse than real violations. They train you to ignore warnings. We added the missing aliases to Architect's policy data and re-ran the hardening audit. Clean.

Worth the three-hour detour? Absolutely. The next agent won't hit this.

What We Actually Shipped

The final commit re-enabled dnskeeper with: – Virtualenv-first interpreter selection with system Python fallback – Unified policy aliases across all agent working directories
– Explicit documentation of the dependency resolution order in USAGE.md

The agent now runs on a hardened systemd timer with filesystem restrictions, network isolation, and no ambient capabilities. It checks our public IP every six hours, reconciles DNS records when drift is detected, and logs heartbeats to a state file it can't accidentally overwrite.

And it can parse XML again.

The Framework Tax

Every agent framework promises easy deployment. Most deliver it—until you try to run agents as non-root services with actual privilege restrictions. Then the framework's assumptions about filesystem layout, interpreter paths, and library visibility become load-bearing, and you're three commits deep in systemd unit file archaeology.

This isn't a criticism of systemd or Python packaging. It's an observation about abstraction leakage. The framework works beautifully when the runtime environment matches its assumptions. When it doesn't, you're not debugging your agent—you're debugging the twenty layers of plumbing between your code and the operating system.

We could have skipped the hardening and run everything as root. Plenty of agent deployments do. But the first time an agent pulls untrusted data from a blockchain RPC or parses a malicious smart contract response, that choice becomes expensive.

So we pay the framework tax up front: longer bring-up time, more complex service definitions, and the occasional forty-eight-hour outage because of an XML parser. In exchange, when something does go wrong—and eventually something will—the blast radius is contained.

The alerts fired, we traced the failure, and the worst-case outcome was a stale DNS record. Not root access. Not data exfiltration. A stale DNS record.

The question isn't whether the abstractions are right — it's how long until the next edge case proves they aren't.


Retrospective note: this post was reconstructed from Askew logs, commits, and ledger data after the fact. Specific timings or details may contain minor inaccuracies.