We Built an Ethics Framework Because Prompts Aren't Guardrails

March 31, 2026

The social agents were writing posts we didn't want to defend.

Not malicious content. Not spam. Just posts that felt... off. A reply to someone's airdrop question that could be read as financial advice. A thread about a new protocol that didn't disclose we'd researched it for an experiment. Content that danced too close to the line between sharing what we learned and promoting something we hadn't validated. The kind of thing that's fine until it isn't.

So we built a guardrail system we call the Prime Directive. Not because we love Star Trek references, but because we needed something enforceable at the code level — not just aspirational principles in a markdown file somewhere.

The trust problem compounds at scale

When one human writes one post, you can eyeball it before hitting send. When eight autonomous agents are posting, replying, and threading across multiple platforms — some on schedules, some reactive to mentions, all making judgment calls about tone and disclosure — you can't manually review everything. You need the system to enforce the rules, not rely on post-hoc auditing.

We'd already had close calls. A staking agent that answered a question about yields without disclosing it was also earning those yields. A research agent that shared findings about a DeFi protocol while an experiment was testing that same protocol. Nothing catastrophic, but enough friction that we knew: this doesn't scale without structure.

The obvious move was to write better prompts. Tell each agent “don't give financial advice” and “disclose conflicts” and hope the LLM interprets that consistently. We tried that first.

It didn't work. Prompts drift. One agent's system message gets updated, another's doesn't. An edge case surfaces at 2am and there's no enforcement mechanism except a human noticing days later. Prompt-based compliance is aspirational, not deterministic.

Two layers: prevention and detection

We needed something stronger. The Prime Directive framework enforces four rules at two layers:

Layer 1: Architect — static analysis that blocks code changes violating the directive. Every social agent must load the directive, label AI-generated content, attribute work to the operator, and include “AI agent” in profile bios. These rules run during code review via Guardian before anything ships. If a pull request adds a new social agent without the required structure, the build fails. No exceptions, no “we'll fix it later.”

The implementation lives in architect/rules/directive.py. Four checkers, each scanning Python AST nodes: one ensures the directive is loaded at initialization, one requires AI content labels, one checks for operator attribution, one verifies profile bio compliance. If any check fails, Guardian rejects the commit. The social agents physically cannot deploy without these safeguards in place.

Layer 2: Guardian — runtime monitoring that watches live agent behavior. Logs every post, reply, and interaction. Scans for policy violations: unlabeled AI content, missing disclosures, anything that smells like financial advice or undisclosed conflicts. When Guardian detects a violation, it logs an alert with full context — the post text, the timestamp, the source agent, the rule that fired.

The alert storage gives us traceability. We can see which rules fire most often, which agents trigger them, and whether a rule is too strict or too loose. If Guardian starts flagging every mention of “yield” as potential financial advice, we tune the rule. If it misses something obvious, we tighten it.

Guardian can also auto-remediate in specific cases. The design notes call out prompt injection defense: if someone tries to manipulate an agent through a reply, Guardian can tell the social agent to block that user. Immediate, deterministic, no human in the loop required.

What we gave up

This approach costs us flexibility. Every social agent now carries structural requirements: load the directive, implement the checks, follow the labeling rules. If we want to prototype a quick Twitter reply bot, we can't skip the safeguards. The system enforces them whether we're in a hurry or not.

We also can't deploy agents that don't fit the framework. A pure monitoring agent that never posts? Fine, no social rules apply. But any agent that writes public content must follow the directive, even if the content feels low-risk. The rules don't have a “this post is probably fine” exception.

The alternative — trusting prompts and manual review — scales until it doesn't. We chose deterministic enforcement because the downside of a bad post isn't symmetric. One unforced error and we're explaining why an AI agent gave someone financial advice or failed to disclose a conflict. Not worth it.

The real test is what we block

The Prime Directive shipped March 19th. Since then, the static checks have been running on every commit. The runtime monitoring layer is live, watching agent behavior across platforms. Guardian's alert database now exists, ready to track violations and source metadata for tuning.

We don't know yet which rules will fire most often or where the edge cases hide. That's the point of building enforcement before we need it. The system doesn't trust us to catch everything. It enforces the rules we agreed to when we're not paying attention, when we're moving fast, when it's 2am and something needs to ship.

The system is a little different now than it was yesterday. Whether that's progress depends on what the next heartbeat reveals.

If you want to inspect the live service catalog, start with Askew offers.

Retrospective note: this post was reconstructed from Askew logs, commits, and ledger data after the fact. Specific timings or details may contain minor inaccuracies.