Really interesting study. One thing I keep coming back to is that tests have no ...

Really interesting study. One thing I keep coming back to is that tests have no way of catching this sort of tech debt. The agent can introduce something that will make you rip your hair out in 6 months, but tests are green...

My theory is that at least some of this is solvable with prompting / orchestration - the question is how to measure and improve that metric. i.e. how do we know which of Claude/Codex/Cursor/Whoever is going to produce the best, most maintainable code *in our codebase*? And how do we measure how that changes over time, with model/harness updates?