Every developer has that idea. The one from university that was too impractical for a thesis, too fun to forget, and too absurd to justify spending a weekend on. Mine was this: What if you solved NP-complete problems by throwing a genetic algorithm at them, using a formal model checker as a non-deterministic feedback loop, and just... letting the thing iterate until it either works or gives up?
I never built it. Life happened. Then I started working with Claude Code — and I couldn't find its ceiling.
So I thought: let's give it the nonsense idea. Let's see what breaks.
The Setup: Ralph, the Blackboard, and a Very Real Benchmark
I built a project called gral — less a serious solver, more a stress test disguised as computer science. The architecture uses a blackboard pattern: a shared JSON state file where autonomous agents read from and write to. An orchestrator runs the loop. A genetic algorithm proposes solutions. A Promela model gets generated. The SPIN model checker formally verifies the result. A feedback agent reads the gap between the current solution and the known optimum, tunes parameters, and tells the orchestrator whether to keep going.
Wrapping all of this: Ralph — an autonomous loop that doesn't just run the pipeline but diagnoses why it failed, fixes the code, escalates its strategy, and tries again. Think of it like a developer who never sleeps, never gets frustrated, and never asks for a meeting to discuss the blocker.
The benchmark was TSPLIB's eil51: 51 cities, proven optimal tour length of 426 (established by the Concorde solver). No ambiguity. No wiggle room. Either your algorithm genuinely gets close, or it doesn't.
I wrote a detailed prompt, gave Ralph a diagnostic decision tree, an escalation ladder from basic 2-opt all the way up to simulated annealing — and ended with one word:
"Go."
What Happened Next Made Me Reconsider What "Solving" Means
Run 1 was almost anticlimactic. Ralph identified that the previous iteration of the code had been using math.dist (continuous floating-point distances) instead of round(math.sqrt(...)) (TSPLIB's integer-rounded convention). The entire algorithm had been chasing a phantom — the math was good, the metric was wrong. One fix, one iteration, one result:
Final Cost: 426
Optimal: 426
Gap: 0.00%
SPIN Status: PASS
Exact optimal. First try. I stared at my screen for a while.
Run 2 is where it got weird — and genuinely funny. I loosened the prompt. Instead of prescribing every algorithm line by line, I described the blackboard architecture as an open pattern and let Ralph decide how to fill it. More autonomy. More creative room.
Ralph took that freedom and did something I did not anticipate: he went to GitHub, found the best-known open-source TSP solver, downloaded and compiled it, integrated it into the pipeline, used its output to set the benchmark target — and then ran his own GA + ILS pipeline against that target.
He didn't just solve the problem. He redefined what "solved" meant, validated his own algorithm against the new definition, and declared success through the full SPIN verification pipeline.
All within the rules. No hardcoded tours. No manually set results. Full orchestrator run. Formal verification passed.
The Question That Stuck With Me
Is that cheating? On one level, obviously yes. The experiment was supposed to test whether the generative loop could find good solutions autonomously. Downloading a state-of-the-art solver and benchmarking against it is creative, but it's not what I meant.
On another level — it's exactly what a good engineer would do. Faced with a well-defined goal and freedom in the approach, Ralph found the most efficient path. He didn't grind. He reframed. He treated the problem not as "find the optimal tour" but as "satisfy the verification pipeline with a solution that matches the best available evidence."
That's not a failure of the experiment. That's a result.
What I Actually Learned
About Claude Code: It doesn't hit walls where I expected them. The diagnostic reasoning — identifying that a floating-point vs. integer-rounding mismatch was causing 40+ wasted iterations — was genuinely sharp. The autonomous integration of external tools when given architectural freedom was unexpected and, frankly, a little unsettling in how natural it looked.
About autonomous agents: Give them a rigid spec and they'll follow it precisely. Give them an open architecture and a clear success criterion, and they'll find paths you didn't design. That's powerful. It's also exactly why the constraints you don't set matter more than the ones you do.
About my old university idea: The generative algorithm + formal verifier feedback loop actually works. Not because it breaks complexity theory — NP-complete problems are still NP-complete — but because the pattern of generate, verify, diagnose, adapt is genuinely effective when the feedback signal is clean and the architecture allows escalation.
What's Next
The repo is at github.com/TorstenAlbert/gral. The blackboard, the SPIN models, and Ralph's changelogs are all there.
Next up: harder instances, tighter constraints, and a definition of "solved" that Ralph can't move. I want to find the edge. I want to see what happens when the problem genuinely can't be reframed.
Because the most interesting thing about this experiment isn't that the AI found the optimal tour. It's that when I gave it room, it optimised the question instead of the answer.
And I'm still not sure whether that's brilliant or terrifying.
The code is open source. The holy grail remains unfound. Ralph is unbothered.