There should be a way to turn the questions we ask LLMs into benchmarks. That wa... | Hacker News

Hacker Newsnew | past | comments | ask | show | jobs | submit

		amelius 2 days ago \| parent \| context \| favorite \| on: Qwen3.5 122B and 35B models offer Sonnet 4.5 perfo... There should be a way to turn the questions we ask LLMs into benchmarks. That way, we can have a benchmark that is always up to date.

		help

lurkshark 1 day ago [–]

There are a few “updating” benchmarks out there. I periodically take a look at these two:

https://swe-rebench.com/

https://livebench.ai/

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact