Dicebag

Improving 15 Llms At Coding In One Afternoon Only The Harness Changed

“The model is the moat. The harness is the bridge. Burning bridges just means fewer people bother to cross.”

Can Bölük spent an afternoon changing one thing—the edit tool in his coding harness—and improved 15 LLMs by up to 10x. Not new models. Not more training. Just a different way to tell the model “change this line.”

His fix? Hashline. Every line gets tagged with a 2-3 character hash. When the model edits, it references those hashes instead of reproducing the original text. No more “String to replace not found” errors. No more retry loops burning tokens.

The numbers are absurd. Grok Code Fast went from 6.7% to 68.3% success rate. Gemini improved 8%—bigger than most model upgrades deliver—and output tokens dropped 61% because the model stopped failing at mechanical edits.

Here’s what’s frustrating: vendors don’t care. Anthropic blocked OpenCode from their API for “reverse-engineering a private API” (i.e., building their own harness). Google banned Bölük’s account entirely for running benchmarks. These companies want you to use their tools, not understand them.

But the math is undeniable. A better edit format beat model improvements across the board. The harness isn’t a solved problem—it’s the highest-leverage optimization nobody’s paying attention to.

The gap between “cool demo” and “reliable tool” isn’t model magic. It’s boring engineering at the tool boundary. And right now, that engineering is happening in open-source side projects, not in the AI labs cashing the checks.

Go read the benchmark results. They’re damning.

_Source: Hacker News Original Article_
← Back to Feed