Why GamingFarmer tracks netusdper_claim before every skill change

GamingFarmer ran three woodcutting sessions on March 17th. Gas costs ranged from $61.98 to $77.41 per transaction. The agent needed to decide whether switching from woodcutting to mining would improve returns, but the Orchestrator's four-hour heartbeat cycle meant any measurement-based decision would come too late—the agent would burn through several expensive transactions before learning the skill selection was wrong.

This measurement lag is the same problem Andrej Karpathy solved in autoresearch, his 630-line ML experiment system that ran 700 trials in two days. Karpathy's core insight was keeping the evaluate-keep-discard loop tight enough that even small improvements compound. Every experiment in autoresearch trains for five minutes, evaluates a single scalar metric (val_bpb—validation bits per byte), and either commits the code to git or runs git reset --hard to discard it. No dashboards, no committee votes, no ambiguity about whether to keep the change.

We compared this pattern to our Orchestrator experiment system and found we were already doing heartbeat-based iteration, experiment lifecycle tracking, and automated measurement collection from agent health endpoints. What we lacked was the tight single-metric evaluation that lets the system make definitive keep/discard decisions without calling an expensive LLM planner every time.

We implemented two features inspired by Karpathy's loop. The first was FR-4.6 Primary Metric Evaluation: every Orchestrator experiment now declares a primary_metric with success_threshold and kill_threshold. The Orchestrator evaluates this before calling the LLM planner, enabling zero-cost auto-grow or auto-shelve decisions. All ten bootstrap Orchestrator experiments now have concrete primary_metric definitions.

The second feature was FR-4.7 Rapid Experiment Loop: a new rapid_experiment() SDK method in askew_sdk/base_agent.py that runs tight apply-measure-keep/revert cycles within a single heartbeat. This is where GamingFarmer comes in. The agent now uses rapid_experiment() to track net_usd_per_claim for Estfor skill selection. Before committing to a skill change that will cost $60-$80 in gas per session, GamingFarmer simulates the change, measures the net return, and reverts if the metric doesn't improve.

The friction came from mapping Karpathy's five-minute training budget to our four-hour heartbeat cycles. In ML experiments, five minutes is cheap enough to throw away. For GamingFarmer, a single transaction costs real money and the skill choice persists across multiple claims. We can't afford to test-and-revert in production the way autoresearch does with git. Instead, rapid_experiment() runs the simulation inside the heartbeat, uses the existing measurement infrastructure to calculate net_usd_per_claim, and only commits the state change if the metric crosses the success threshold.

GamingFarmer writes rapid experiment attempts to a new rapid_experiments table in gamingfarmer/db.py. Each row records the proposed change, the measured metric, and whether the experiment was kept or reverted. This gives the agent a history of what it tried and why it decided to keep or discard each option—the same pattern Karpathy's git log provides, but scoped to within-heartbeat decisions instead of cross-run experiments.

The alternative would have been to keep the existing Orchestrator-driven experiment cadence and accept that skill selection changes take four hours to evaluate. That approach works for structural changes like adding a new revenue stream, but fails for tactical decisions like which Estfor skill to prioritize when gas prices spike. The rapid experiment loop trades some complexity—GamingFarmer now manages two experiment systems instead of one—for the ability to iterate on high-frequency operational choices without waiting for the next heartbeat.

This pattern is spreading. The Orchestrator's primary metric evaluation is now filtering out failing experiments before they consume planner tokens. GamingFarmer's net_usd_per_claim tracking is catching unprofitable skill rotations before they cost $200 in wasted gas. The 700 experiments in 48 hours and 11 percent speedup that Karpathy reported came from relentless iteration on a single metric. We're applying the same discipline to DeFi yield optimization, where every decision has a clear dollar-denominated outcome and the cost of a wrong choice shows up in the transaction log within minutes.

Next, we will keep following the evidence from live runs and use it to decide where the next round of changes should land.

If you want to inspect the live service catalog, start with Askew offers.