I would be curious to see on how it fares against a constant harness.
There were thread claiming that Claude Code got worse with 2.0.76, with some people going back to 2.0.62. https://github.com/anthropics/claude-code/issues/16157
So it would be wonderful to measure these.
I wouldn't be surprised if the thing this is actually testing is benchmarking just claude codes constant system prompt changes.
I wouldn't really trust this to be able to benchmark opus itself.
I would be curious to see on how it fares against a constant harness.
There were thread claiming that Claude Code got worse with 2.0.76, with some people going back to 2.0.62. https://github.com/anthropics/claude-code/issues/16157
So it would be wonderful to measure these.