Karpathy’s March of Nines shows why 90% AI reliability isn’t even close to enough

“When you get a demo and something works 90% of the time, that’s just the first nine.” — Andrej Karpathy

The “March of Nines” frames a common production reality: You can reach the first 90% reliability with a strong demo, and each additional nine often requires comparable engineering effort. For enterprise teams, the distance between “usually works” and “operates like dependable software” determines adoption.

The compounding math behind the March of Nines

“Every single nine is the same amount of work.” — Andrej Karpathy

Agentic workflows compound failure. A typical enterprise flow might include: intent parsing, context retrieval, planning, one or more tool calls, validation, formatting, and audit logging. If a workflow has n steps and each step succeeds with probability p, end-to-end success is approximately p^n.

In a 10-step workflow, the end-to-end success compounds due to the failures of each step. Correlated outages (auth, rate limits, connectors) will dominate unless you harden shared dependencies.

Per-step success (p)

10-step success (p^10)

Workflow failure rate

At 10 workflows/day

What does this mean in practice

90.00%

34.87%

65.13%

~6.5 interruptions/day

Prototype territory. Most workflows get interrupted

99.00%

90.44%

9.56%

~1 every 1.0 days

Fine for a demo, but interruptions are still frequent in real use.

99.90%

99.00%

1.00%

~1 every 10.0 days

Still feels unreliable because misses remain common.

99.99%

99.90%

0.10%

~1 every 3.3 months

This is where it starts to feel like dependable enterprise-grade software.

Define reliability as measurable SLOs

“It makes a lot more sense to spend a bit more time to be more concrete in your prompts.” — Andrej Karpathy

Teams achieve higher nines by turning reliability into measurable objectives, then investing in controls that reduce variance. Start with a small set of SLIs that describe both model behavior and the surrounding system:

Set SLO targets per workflow tier (low/medium/high impact) and manage an error budget so experiments stay controlled.

Nine levers that reliably add nines

1) Constrain autonomy with an explicit workflow graph

Reliability rises when the system has bounded states and deterministic handling for retries, timeouts, and terminal outcomes.

2) Enforce contracts at every boundary

Most production failures start as interface drift: malformed JSON, missing fields, wrong units, or invented identifiers.

3) Layer validators: syntax, semantics, business rules

Schema validation catches formatting. Semantic and business-rule checks prevent plausible answers that break systems.

4) Route by risk using uncertainty signals

High-impact actions deserve higher assurance. Risk-based routing turns uncertainty into a product feature.

5) Engineer tool calls like distributed systems

Connectors and dependencies often dominate failure rates in agentic systems.

6) Make retrieval predictable and observable

Retrieval quality determines how grounded your application will be. Treat it like a versioned data product with coverage metrics.

7) Build a production evaluation pipeline

The later nines depend on finding rare failures quickly and preventing regressions.

8) Invest in observability and operational response

Once failures become rare, the speed of diagnosis and remediation becomes the limiting factor.

9) Ship an autonomy slider with deterministic fallbacks

Fallible systems need supervision, and production software needs a safe way to dial autonomy up over time. Treat autonomy as a knob, not a switch, and make the safe path the default.

Implementation sketch: a bounded step wrapper

A small wrapper around each model/tool step converts unpredictability into policy-driven control: strict validation, bounded retries, timeouts, telemetry, and explicit fallbacks.

def run_step(name, attempt_fn, validate_fn, *, max_attempts=3, timeout_s=15):

    # trace all retries under one span

    span = start_span(name)

    for attempt in range(1, max_attempts + 1):

        try:

            # bound latency so one step can’t stall the workflow

            with deadline(timeout_s):

                out = attempt_fn()

# gate: schema + semantic + business invariants

            validate_fn(out)

            # success path

            metric("step_success", name, attempt=attempt)

            return out

        except (TimeoutError, UpstreamError) as e:

            # transient: retry with jitter to avoid retry storms

            span.log({"attempt": attempt, "err": str(e)})

            sleep(jittered_backoff(attempt))

        except ValidationError as e:

            # bad output: retry once in “safer” mode (lower temp / stricter prompt)

            span.log({"attempt": attempt, "err": str(e)})

            out = attempt_fn(mode="safer")

    # fallback: keep system safe when retries are exhausted

    metric("step_fallback", name)

    return EscalateToHuman(reason=f"{name} failed")

Why enterprises insist on the later nines

Reliability gaps translate into business risk. McKinsey’s 2025 global survey reports that 51% of organizations using AI experienced at least one negative consequence, and nearly one-third reported consequences tied to AI inaccuracy. These outcomes drive demand for stronger measurement, guardrails, and operational controls.

Closing checklist

The nines arrive through disciplined engineering: bounded workflows, strict interfaces, resilient dependencies, and fast operational learning loops.

Nikhil Mungel has been building distributed systems and AI teams at SaaS companies for more than 15 years.