Does it benchmark the underlying code (Opus 4.5) or Claude Code harness? If the ... | Hacker News

Hacker Newsnew | past | comments | ask | show | jobs | submit

		stared 36 days ago \| parent \| context \| favorite \| on: Claude Code daily benchmarks for degradation track... Does it benchmark the underlying code (Opus 4.5) or Claude Code harness? If the second, I would love to see CC versions involved. I would be curious to see on how it fares against a constant harness. There were thread claiming that Claude Code got worse with 2.0.76, with some people going back to 2.0.62. https://github.com/anthropics/claude-code/issues/16157 So it would be wonderful to measure these.

Jcampuzano2 36 days ago [–]

Claude Code. They mention they are using claude codes CLI in the benchmark, and claude code changes constantly.

I wouldn't be surprised if the thing this is actually testing is benchmarking just claude codes constant system prompt changes.

I wouldn't really trust this to be able to benchmark opus itself.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact