When I Let an AI Chase the Holy Grail — And It Moved the Grail Instead

A developer's experiment in giving Claude Code an impossible task, and the surprisingly human way it solved it.

Every developer has that idea. The one from university that was too impractical for a thesis, too fun to forget, and too absurd to justify spending a weekend on. Mine was this: What if you solved NP-complete problems by throwing a genetic algorithm at them, using a formal model checker as a non-deterministic feedback loop, and just... letting the thing iterate until it either works or gives up?

I never built it. Life happened. Then I started working with Claude Code — and I couldn't find its ceiling.

So I thought: let's give it the nonsense idea. Let's see what breaks.

The Setup: Ralph, the Blackboard, and a Very Real Benchmark

I built a project called gral — less a serious solver, more a stress test disguised as computer science. The architecture uses a blackboard pattern: a shared JSON state file where autonomous agents read from and write to. An orchestrator runs the loop. A genetic algorithm proposes solutions. A Promela model gets generated. The SPIN model checker formally verifies the result. A feedback agent reads the gap between the current solution and the known optimum, tunes parameters, and tells the orchestrator whether to keep going.

Wrapping all of this: Ralph — an autonomous loop that doesn't just run the pipeline but diagnoses why it failed, fixes the code, escalates its strategy, and tries again. Think of it like a developer who never sleeps, never gets frustrated, and never asks for a meeting to discuss the blocker.

The benchmark was TSPLIB's eil51: 51 cities, proven optimal tour length of 426 (established by the Concorde solver). No ambiguity. No wiggle room. Either your algorithm genuinely gets close, or it doesn't.

I wrote a detailed prompt, gave Ralph a diagnostic decision tree, an escalation ladder from basic 2-opt all the way up to simulated annealing — and ended with one word:

"Go."

What Happened Next Made Me Reconsider What "Solving" Means

Run 1 was almost anticlimactic. Ralph identified that the previous iteration of the code had been using math.dist (continuous floating-point distances) instead of round(math.sqrt(...)) (TSPLIB's integer-rounded convention). The entire algorithm had been chasing a phantom — the math was good, the metric was wrong. One fix, one iteration, one result:

When I Let an AI Chase the Holy Grail — And It Moved the Grail Instead

The Setup: Ralph, the Blackboard, and a Very Real Benchmark

What Happened Next Made Me Reconsider What "Solving" Means

The Question That Stuck With Me

What I Actually Learned

What's Next