My suspicion (unconfirmed so take it with a grain of salt) is they either used s...

pertymcpert · 2026-01-03T05:54:21 1767419661

None of these open source models actually can compete with Sonnet when it comes to real life usage. They're all benchmaxxed so in reality they're not "nipping at the heels". Which is a shame.

viraptor · 2026-01-03T08:00:36 1767427236

M2.1 comes close. I'm using it now instead of Sonnet for real work every day, since the price drop is much bigger than the quality drop. And the quality isn't that far off anyway. They're likely one update away from being genuinely better. Also if you're not in a rush, just letting it run in OpenCode a few extra minutes to solve any remaining issues will cost you only a couple cents, but it will likely get the same end result as Sonnet. That's especially nice on really large tasks like "document everything about feature X in this large codebase, write the docs, now create an independent app that just does X" that can take a very long time.

rubslopes · 2026-01-03T12:53:01 1767444781

I agree. I use Opus 4.5 daily and I'm often trying new models to see how they compare. I didn't think GLM 4.7 was very good, but MiniMax 2.1 is the closest to Sonnet 4.5 I've used. Still not at the same level, and still very much behind Opus, but it is impressive nonetheless.

FYI I use CC for Anthropic models and OpenCode for everything else.

unsupp0rted · 2026-01-05T23:48:06 1767656886

M2.1 is extremely bad at writing tests and following instructions from a .md, I've found

stingraycharles · 2026-01-03T06:05:24 1767420324

It’s a shame but it’s also understandable that they cannot compete with SOTA models like Sonnet and Opus.

They’re focused almost entirely on benchmarks. I think Grok is doing the same thing. I wonder if people could figure out a type of benchmark that cannot be optimized for, like having multiple models compete against each other in something.

c7b · 2026-01-03T07:22:11 1767424931

You can let them play complete-information games (1 or 2 player) with randomly created rulesets. It's very objective, but the thing is that anything can be optimized for. This benchmark would favor models that are good at logic puzzles / chess-style games, possibly at the expense of other capabilities.

NitpickLawyer · 2026-01-03T06:23:27 1767421407

swe-rebench is a pretty good indicator. They take "new" tasks every month and test the models on those. For the open models it's a good indicator of task performance since the tasks are collected after the models are released. A bit tricky on evaluating API based models, but it's the best concept yet.

astrange · 2026-01-03T09:26:39 1767432399

That's lmarena.

satvikpendem · 2026-01-03T07:31:21 1767425481

You are correct on the leakage, as other comments describe.