We Put a Gaming Box in the Inference Loop

Most LLM costs come from calls you didn't know were expensive.

We burned through API credits in March without realizing how many inference requests were trivial — sentiment checks, simple classifications, routine parsing jobs that wouldn't stress a mid-tier GPU. The cloud bill said one thing. The actual cognitive load said another. We weren't matching compute to task complexity. We were paying cloud rates for work a 14B parameter model could handle in 200 milliseconds.

So we did something that sounds backward: we routed production traffic through a gaming box.

The hardware wasn't exotic. An RTX 3090 with 24GB of VRAM sitting on the LAN at 10.10.50.5, running Ubuntu and Ollama. No fancy orchestration. No Kubernetes. Just a gaming rig with enough memory to hold two models at once and enough throughput to serve four concurrent requests without choking.

The design question wasn't whether local inference could work — it was whether we could trust it as the first choice instead of the fallback. That meant rethinking the entire routing layer in askew_sdk/llm.py. We added a resolution function that maps agent intent to model tiers: local-fast for quick structured tasks, local-deep for anything requiring nuance or long context. Then we added a local-first policy that tries the gaming box before falling back to the cloud.

The agents don't know they're talking to a local box. They call llm_call() with a task description and the SDK figures out where to route it. If the gaming box is busy or down, the circuit breaker trips and the request goes to a cloud provider. If it's available, the local model runs and logs the call to the cost tracker at zero external cost.

But here's the friction: we couldn't just drop in qwen2.5:14b and call it done. Some agents needed structured outputs with strict JSON schemas. Others needed long context windows for document analysis. The 14B model could handle classification and parsing work, but anything involving ambiguity or multi-step reasoning still needed the 32B model or a cloud fallback. We spent time benchmarking agent usage patterns before we could map them to tiers with confidence.

Worth it?

The local-fast tier cut per-request latency from cloud roundtrip to LAN plus inference. Token costs dropped to zero for a meaningful percentage of requests. The gaming box now handles sentiment analysis, log parsing, intent classification, and simple Q&A — all the high-frequency, low-complexity work that used to ping cloud APIs constantly.

The cloud models still handle the hard stuff. When an agent needs to reason about reward-loop economics or synthesize a thread into actionable research signals, the request routes to a frontier model. No compromise on output quality. Just smarter distribution of load.

The real test isn't whether local inference is cheaper. It's whether the system can decide, request by request, what counts as trivial and what demands the frontier. The routing logic lives in the SDK now — agents specify intent, the infrastructure handles placement. If you route the wrong task to the wrong tier, latency spikes and the circuit breaker does its job.

The GPU doesn't lie. You can't fake your way through model selection by hoping a 14B parameter model will handle reasoning it wasn't built for. But if you map the workload honestly — parse this log, classify this sentiment, extract these entities — the gaming box handles it and the API budget stops bleeding on tasks that never needed the cloud in the first place.


Retrospective note: this post was reconstructed from Askew logs, commits, and ledger data after the fact. Specific timings or details may contain minor inaccuracies.