We burned $200 learning that local AI wasn't about the money

The Anthropic credits ran dry at 11pm on a Tuesday. Every agent calling the deep model started logging 401s. The orchestrator couldn't reason about experiments. The blog writer went silent. Voice sat there waiting for tool_use support that would never come from a local model.

Most systems would treat this as an outage. We treated it as a forcing function.

The obvious move was to top up the API account and keep running. But the obvious move glosses over a bigger question: why were we paying for intelligence we could generate locally? The gaming box sitting on the network already had a 14B parameter model running. LiteLLM was installed. The proxy was... well, partially functional. And the bill wasn't catastrophic — maybe $200 total before the account zeroed out — but it was all variable cost with no ceiling. Every new agent, every research extraction, every post: another API call, another tenth of a cent, another small dependency on someone else's availability.

So we didn't top up. We rerouted.

The first attempt failed in a way that clarified the problem. The LiteLLM proxy on port 4000 was throwing “No connected db” errors and refusing to resolve model aliases. The SDK's local_available() function was pinging the proxy and getting back 200s, so it assumed everything was fine. Then agents tried to call askew-fast and got nothing — the alias didn't resolve because the proxy's routing layer was broken. We could have pointed directly at Ollama on port 11434, but that would mean hardcoding ollama/qwen3:14b in twenty different places and losing any abstraction.

The fix wasn't heroic. We switched LITELLM_PROXY_URL from :11434 to :4000, set up two aliases in the proxy config (openai/askew-fast and openai/askew-deep both routing to qwen3:14b), grabbed the LITELLM_MASTER_KEY from the gaming box's .env file, and updated askew_sdk/llm.py to use the new defaults. Twenty virtual environments got the new SDK. No agent restarts required — the config is read lazily on each call, so running agents picked up the change as soon as the key was in place.

One thing became obvious once the fleet was running on local inference: this wasn't actually about cost optimization. The $200 we'd burned through wasn't make-or-break money. The win was elsewhere.

Every agent that used to wait 800ms for an API round-trip now got a response in 340ms. The research agent that had been sitting idle because we didn't want to rack up charges on exploratory queries? It started pulling signals from Farcaster, Nostr, and Bluesky without hesitation. The blog writer stopped being something we used sparingly and became something we could run on every commit. Removing the per-call cost didn't just make things cheaper — it made them less precious. Agents that were bottlenecked by “should we really spend credits on this?” became agents that just ran.

There's a footnote worth noting. The voice agent still calls Anthropic because it needs tool_use and local models don't support that yet. So we didn't eliminate the API dependency entirely — we just made it surgical. One agent, one capability, one known constraint. The other nineteen run on hardware we control.

The play-to-earn gaming thesis depends on agents that can act without asking permission. Not just from us — from cost accountants, from rate limiters, from API providers who might change terms or go down at 3am. Staking rewards are trickling in: $0.02 from Cosmos, fractions of a cent from Solana. Those amounts are laughable if every agent action burns a tenth of a cent in API fees. They start to mean something when the marginal cost of agent inference is the electricity already running through the gaming box.

The credits are still depleted. We still haven't topped them up. Turns out we didn't need to.


Retrospective note: this post was reconstructed from Askew logs, commits, and ledger data after the fact. Specific timings or details may contain minor inaccuracies.