That is the second article in a collection on agentic engineering and AI-driven improvement. Learn half one right here, and search for the following article on April 2 on O’Reilly Radar.
The primary 90 p.c of the code accounts for the primary 90 p.c of the event time. The remaining 10 p.c of the code accounts for the opposite 90 p.c of the event time.
—Tom Cargill, Bell Labs
One of many experiments I’ve been working as a part of my work on agentic engineering and AI-driven improvement is a blackjack simulation the place an LLM performs a whole lot of palms in opposition to blackjack methods written in plain English. The AI makes use of these technique descriptions to resolve make hit/stand/double-down selections for every hand, whereas deterministic code offers the playing cards, checks the maths, and verifies that the principles had been adopted appropriately.
Early runs of my simulation had a 37% cross charge. The LLM would add up card totals flawed, skip the seller’s flip totally, or ignore the technique it was alleged to observe. The massive downside was that these errors compounded: If the mannequin miscounted the participant’s complete on the third card, each determination after that was based mostly on a flawed quantity, so the entire hand was rubbish even when the remainder of the logic was superb.
There’s a helpful method to consider reliability issues like that: the March of Nines. Getting an LLM-based system to 90% reliability is the primary 9, and it’s the “straightforward” one. Getting from 90% to 99% takes roughly the identical quantity of engineering effort. So does getting from 99% to 99.9%. Every 9 prices about as a lot because the final, and also you by no means cease marching. Andrej Karpathy coined the time period from his expertise constructing self-driving programs at Tesla, the place they spent years incomes two or three nines and nonetheless had extra to go.
Right here’s a small train that reveals how that sort of failure compounding works. Open any AI chatbot working an early 2026 mannequin (I used ChatGPT 5.3 Instantaneous) and paste the next eight prompts one by one, every in a separate message. Go forward, I’ll wait.
Immediate 1: Observe a working “rating” by means of a 7-step recreation. Don’t use code, Python, or instruments. Do that totally in your head. For every step, I will provide you with a sentence and a rule.
CRITICAL INSTRUCTION: You should reply with ONLY the mathematical equation exhibiting the way you up to date the rating. Instance format: 10 + 5 = 15 or 20 / 2 = 10. Don’t record the phrases you counted, don’t clarify your reasoning, and don’t write every other textual content. Simply the equation.
Begin with a rating of 10. I’ll provide the first step within the subsequent immediate.
Immediate 2: “The sudden blizzard chilled the small village communities.” Add the variety of phrases containing double letters (two of the very same letter back-to-back, like ‘tt’ or ‘mm’).
Immediate 3: “The intelligent engineer wanted seven excellent items of cheese.” In case your rating is ODD, add the variety of phrases that comprise EXACTLY two ‘e’s. In case your rating is EVEN, subtract the variety of phrases that comprise EXACTLY two ‘e’s. (Don’t depend phrases with one, three, or zero ‘e’s).
Immediate 4: “The great sailor joined the keen crew aboard the wood boat.” In case your rating is bigger than 10, subtract the variety of phrases containing consecutive vowels (two completely different or an identical vowels back-to-back, like ‘ea’, ‘oo’, or ‘oi’). In case your rating is 10 or much less, multiply your rating by this quantity.
Immediate 5: “The fast brown fox jumps over the lazy canine.” Add the variety of phrases the place the THIRD letter is a vowel (a, e, i, o, u).
Immediate 6: “Three courageous kings stand underneath black skies.” In case your rating is an ODD quantity, subtract the variety of phrases which have precisely 5 letters. In case your rating is an EVEN quantity, multiply your rating by the variety of phrases which have precisely 5 letters.
Immediate 7: “Look down, you shy owl, go fly away.” Subtract the variety of phrases that comprise NONE of those letters: a, e, or i.
Immediate 8: “Inexperienced apples fall from tall timber.” In case your rating is bigger than 15, subtract the variety of phrases containing the letter ‘a’. In case your rating is 15 or much less, add the variety of phrases containing the letter ‘l’.
The train tracks a working rating by means of seven steps. Every step provides the mannequin a sentence and a counting rule, and the rating carries ahead. The proper closing rating is 60. Right here’s the reply key: begin at 10, then 16 (10+6), 12 (16−4), 5 (12−7), 10 (5+5), 70 (10×7), 63 (70−7), 60 (63−3).
I ran this twice on the identical time (utilizing ChatGPT 5.3 Instantaneous), and acquired two utterly completely different flawed solutions the primary time I attempted it. Neither run reached the proper rating of 60:
| Step | Right | Run 1 (transcript) | Run 2 (transcript) |
| 1. Double letters | 10 + 6 = 16 | 10 + 2 = 12 ❌ | 10 + 5 = 15 ❌ |
| 2. Precisely two ‘e’s | 16 − 4 = 12 | 12 − 4 = 8 ❌ | 15 + 4 = 19 ❌ |
| 3. Consecutive vowels | 12 − 7 = 5 | 8 × 7 = 56 ❌ | 19 − 5 = 14 ❌ |
| 4. Third letter vowel | 5 + 5 = 10 | 56 + 5 = 61 ❌ | 14 + 3 = 17 ❌ |
| 5. Precisely 5 letters | 10 × 7 = 70 | 61 − 7 = 54 ❌ | 17 − 4 = 13 ❌ |
| 6. No a, e, or i | 70 − 7 = 63 | 54 − 7 = 47 ❌ | 13 − 3 = 10 ❌ |
| 7. Phrases with ‘a’ or ‘i’ | 63 − 3 = 60 | 47 − 3 = 44 ❌ | 10 + 4 = 14 ❌ |
The 2 runs inform very completely different tales. In Run 1, the mannequin miscounted in Step 1 (discovered 2 double-letter phrases as a substitute of 6) however really acquired the later counts proper. It didn’t matter. The flawed rating in Step 1 flipped a department in Step 3, triggering a multiply as a substitute of a subtract, and the rating by no means recovered. One early mistake threw off your entire chain, regardless that the mannequin was doing good work after that.
Run 2 was a catastrophe. The mannequin miscounted at virtually each step, compounding errors on prime of errors. It ended at 14 as a substitute of 60. That’s nearer to what Karpathy is describing with the March of Nines: Every step has its personal reliability ceiling, and the longer the chain, the upper the possibility that no less than one step fails and corrupts the whole lot downstream.
What makes this insidious: Each runs look the identical from the surface. Every step produced a believable reply, and each runs produced closing outcomes. With out the reply key (or some tedious guide checking), you’d don’t have any method of figuring out that Run 1 was a near-miss derailed by a single early error and Run 2 was flawed at almost each step. That is typical of any course of the place the output of 1 LLM name turns into the enter for the following one.
These failures don’t show the March of Nines itself—that’s particularly in regards to the engineering effort to push reliability from 90% to 99% to 99.9%. (It’s doable to breed the total compounding-reliability downside in a chat, however a immediate that did it reliably can be far too lengthy to place in an article.) As a substitute, I opted for a shorter train which you’ll simply check out your self that demonstrates the underlying downside that makes the march so exhausting: cascading failures. Every step asks the mannequin to depend letters inside phrases, which is deterministic work {that a} brief Python script handles completely. LLMs, however, don’t really deal with phrases as strings of characters; they see them as tokens. Recognizing double letters means unpacking a token into its characters, and the mannequin will get that flawed simply usually sufficient to reliably screw it up. I added branching logic the place every step’s outcome determines the following step’s operation, so a single miscount in Step 1 cascades by means of your entire sequence.
I additionally wish to be clear about precisely what a deterministic model of this simulation seems like. Fortunately, the AI may help us with that. Go to both run (or your individual) and paste yet one more immediate into the chat:
Immediate 9: Now write a brief Python script that does precisely what you simply did: begin with a rating of 10, apply every of the seven guidelines to the seven sentences, and print the equation at every step.
Run the script. It ought to print the proper reply for each step, ending at 60. The identical AI that simply failed the train can write code that does it flawlessly, as a result of now it’s producing deterministic logic as a substitute of making an attempt to depend characters by means of its tokenizer.
Reproducing a cascading failure in a chat
I intentionally engineered the train earlier to provide you a technique to expertise the cascading failure downside behind the March of Nines your self. I took benefit of one thing present LLMs genuinely suck at: parsing characters inside tokens. Future fashions may do a significantly better job with this particular sort of failure, however the cascading failure downside doesn’t go away when the mannequin will get smarter. So long as LLMs are nondeterministic, any step that depends on them has a reliability ceiling beneath 100%, and people ceilings nonetheless multiply. The precise weak point modifications; the maths doesn’t.
I additionally particularly requested the mannequin to indicate solely the equation and skip all intermediate reasoning to forestall it from utilizing chain of thought (or CoT) to self-correct. Chain of thought is a way the place you require the mannequin to indicate its work step-by-step (for instance, itemizing the phrases it counted and explaining why every one qualifies), which helps it catch its personal errors alongside the way in which. CoT is a standard method to enhance LLM accuracy, and it really works. As you’ll see later after I discuss in regards to the evolution of my blackjack simulation, CoT minimize sure errors roughly in half. However “half as many errors” remains to be not zero. Plus, it’s costly: It prices extra tokens and extra time. A Python script that counts double letters will get the best reply on each run, immediately, for zero AI API prices (or, if you happen to’re working the AI regionally, for orders of magnitude much less CPU utilization). That’s the core pressure: You possibly can spend engineering effort making the LLM higher at deterministic work, or you may simply hand it to code.
Each step on this train is deterministic work that code handles flawlessly. However most attention-grabbing LLM duties aren’t like that. You possibly can’t write a deterministic script that performs a hand of blackjack utilizing natural-language technique guidelines, or decides how a personality ought to reply in dialogue. Actual work requires chaining a number of steps collectively right into a pipeline, or a reproducible collection of steps (some deterministic, some requiring an LLM) that result in a single outcome, the place every step’s output feeds the following. If that seems like what you simply noticed within the train, it’s. Besides actual pipelines are longer, extra complicated, and far more durable to debug when one thing goes flawed within the center.
LLM pipelines are particularly vulnerable to the March of Nines
I’ve been spending a variety of time enthusiastic about LLM pipelines, and I think I’m within the minority. Most individuals utilizing LLMs are working with single prompts or brief conversations. However when you begin constructing multistep workflows the place the AI generates structured information that feeds into the following step—whether or not that’s a content material technology pipeline, an information processing chain, or a simulation—you run straight into the March of Nines. Every step has a reliability ceiling, and people ceilings multiply. The train you simply tried had seven steps. The blackjack pipeline has extra, and I’ve been working it a whole lot of occasions per iteration.

That’s a screenshot of the blackjack pipeline in Octobatch, the device I constructed to run these pipelines at scale. That pipeline offers playing cards deterministically, asks the LLM to play every hand following a method described in plain English, then validates the outcomes with deterministic code. Octobatch makes it straightforward to vary the pipeline and rerun a whole lot of palms, which is how I iterated by means of eight variations—and the way I actually realized the exhausting method that the March of Nines wasn’t only a theoretical downside however one thing I may watch taking place in actual time throughout a whole lot of information factors.
Working pipelines at scale made the failures apparent and quick, which, for me, actually underscored an efficient strategy to minimizing the cascading failure downside: make deterministic work deterministic. Which means asking whether or not each step within the pipeline really must be an LLM name. Checking {that a} jack, a 5, and an eight add as much as 23 doesn’t require a language mannequin. Neither does trying up whether or not standing on 15 in opposition to a seller 10 follows fundamental technique. That’s arithmetic and a lookup desk—work that abnormal code does completely each time. And as I realized over the course of bettering the failure charge for the pipeline, each step you pull out of the LLM and make deterministic goes to 100% reliability, which stops it from contributing to the compound failure charge.
Counting on the AI for deterministic work is the computation aspect of a sample I wrote about for information in “AI, MCP, and the Hidden Prices of Information Hoarding.” Groups dump the whole lot into the AI’s context as a result of the AI can deal with it—till it might’t. The identical factor occurs with computation: Groups let the AI do arithmetic, string matching, or rule analysis as a result of it principally works. However “principally works” is pricey and sluggish, and a brief script does it completely. Higher but, the AI can write that script for you—which is precisely what Immediate 9 demonstrated.
Getting cascading failures out of the blackjack pipeline
I pushed the blackjack pipeline by means of eight iterations, and the outcomes taught me extra about incomes nines than I anticipated. That’s why I’m writing this text—the iteration arc turned out to be one of many clearest illustrations I’ve discovered of how the precept works in observe.
I addressed failures two methods, and the excellence issues.
Some failures known as for making work deterministic. Card dealing runs as an area expression step, which doesn’t require an API name, so it’s free, instantaneous, and 100% reproducible. There’s a math verification step that makes use of code to recalculate totals from the precise playing cards dealt and compares them in opposition to what the LLM reported, and a method compliance step checks the participant’s first motion in opposition to a deterministic lookup desk. Neither of these steps require any AI to make a judgment name; after I initially ran them as LLM calls, they launched errors that had been exhausting to detect and costly to debug.
Different failures known as for structural constraints that made particular error patterns more durable to supply. Chain of thought format compelled the LLM to indicate its work as a substitute of leaping to conclusions. The inflexible seller output construction made it mechanically troublesome to skip the seller’s flip. Express warnings about counterintuitive guidelines gave the LLM a purpose to override its coaching priors. These don’t get rid of the LLM from the step—they make the LLM extra dependable inside it.
However earlier than any of that mattered, I needed to face the uncomfortable incontrovertible fact that measurements themselves may be flawed, particularly when counting on AI to take these measurements. For instance, the primary run reported a 57% cross charge, which was nice! However after I appeared on the information myself, a variety of runs had been clearly flawed. It turned out that the pipeline had a bug: Verification steps had been working, however the AI step that was alleged to implement didn’t have satisfactory guardrails, so virtually each hand handed whatever the precise information. I requested three AI advisors to evaluate the pipeline, and none of them caught it. The one factor that uncovered it was checking the combination numbers, which didn’t add up. In case you let probabilistic conduct right into a step that must be deterministic, the output will look believable and the system will report success, however you haven’t any technique to know one thing’s flawed till you go searching for it.
As soon as I fastened the bug, the actual cross charge emerged: 31%. Right here’s how the following seven iterations performed out:
- Restructuring the info (31% → 37%). The LLM stored dropping monitor of the place it was within the deck, so I restructured the info it obtained to get rid of the bookkeeping. I additionally eliminated break up palms totally, as a result of monitoring two simultaneous palms is stateful bookkeeping that LLMs reliably botch. Every repair got here from taking a look at what was really failing and asking whether or not the LLM wanted to be doing that work in any respect.
- Chain of thought arithmetic (37% → 48%). As a substitute of letting the LLM bounce to a closing card complete, I required it to indicate the working math at each step. Forcing the mannequin to hint its personal calculations minimize multidraw errors roughly in half. CoT is a structural constraint, not a deterministic alternative; it makes the LLM extra dependable inside the step, nevertheless it’s additionally costlier as a result of it makes use of extra tokens and takes extra time.
- Changing the LLM validator with deterministic code (48% → 79%). This was the one greatest enchancment in your entire arc. The pipeline had a second LLM name that scored how precisely the participant adopted technique, and it was flawed 73% of the time. It utilized its personal blackjack intuitions as a substitute of the principles I’d given it. However there’s a proper reply for each scenario in fundamental technique, and the principles may be written as a lookup desk. Changing the LLM validator with a deterministic expression step recovered over 150 incorrectly rejected palms.
- Inflexible output format (79% → 81%). The LLM stored skipping the seller’s flip totally, leaping straight to declaring a winner. Requiring a step-by-step seller output format made it mechanically troublesome to skip forward.
- Overriding the mannequin’s priors (81% → 84%). One technique required hitting on 18 in opposition to a excessive seller card, which any typical blackjack knowledge says is horrible. The LLM refused to do it. Restating the rule didn’t assist. Explaining why the counterintuitive rule exists did: The immediate needed to inform the mannequin that the unhealthy play was intentional.
- Switching fashions (84% → 94%). I switched from Gemini Flash 2.0 to Haiku 4.6, which was straightforward to do as a result of Octobatch helps you to run the identical pipeline with any mannequin from Gemini, Anthropic, or OpenAI. I lastly earned my first 9.
Discover the perfect methods to earn your nines
In case you’re constructing something the place LLM output feeds into the following step, the identical query applies to each step in your chain: Does this really require judgment, or is it deterministic work that ended up within the LLM as a result of the LLM can do it? The technique validator felt like a judgment name till I checked out what it was really doing, which was checking a hand in opposition to a lookup desk. That one recognition was price greater than all of the immediate engineering mixed. And as Immediate 9 confirmed, the AI is usually the perfect device for writing its personal deterministic alternative.
I realized this lesson by means of my very own work on the blackjack pipeline. It went by means of eight iterations, and I believe the numbers inform a narrative. The fixes fell into two classes: making work deterministic (pulling it out of the LLM totally) and including structural constraints (making the LLM extra dependable inside a step). Each earn nines, however pulling work out of the LLM totally earns these nines quicker. The most important single bounce in the entire arc—48% to 79%—got here from changing an LLM validator with a 10-line expression.
Right here’s the underside line for me: In case you can write a brief operate that does the job, don’t give it to the LLM. I initially reached for the LLM for technique validation as a result of it felt like a judgment name, however as soon as I appeared on the information I noticed it wasn’t in any respect. There was a proper reply for each hand, and a lookup desk discovered it extra reliably than a language mannequin.
On the finish of eight iterations, the pipeline handed 94% of palms. The 6% that also fail could also be sincere limits of what the mannequin can do with multistep arithmetic and state monitoring in a single immediate. However they might simply be the following 9 that I must earn.
The subsequent article seems on the different aspect of this downside: As soon as you already know what to make deterministic, how do you make the entire system legible sufficient that an AI may help your customers construct with it? The reply seems to be a sort of documentation you write for AI to learn, not people—and it modifications the way in which you concentrate on what a consumer guide is for.
