I don't think they are missing the point, because they're pointing out that the tools are still the most useful for patterns that are extremely widely known and repeated. I use Gemini 2.5 Pro every day for coding, and even that one still falls over on tasks that aren't well known to it (which is why I break the problem down into small parts that I know it'll be able to handle properly).
It's kind of funny, because sometimes these tools are magical and incredible, and sometimes they are extremely stupid in obvious ways.
Yes, these are impressive, and especially so for local models that you can run yourself, but there is a gap between "absolutely magical" and "pretty cool, but needs heavy guiding" depending on how heavily the ground you're treading has been walked upon.
For a heavily explored space, it's like being impressed that you're 2.5 year old M2 with 64 GB RAM can extract some source code from a zip file. It's worth being impressed and excited about the space and the pace of improvement, but it's also worth stepping back and thinking rationally about the specific benchmark at hand.
> because they're pointing out that the tools are still the most useful for patterns that are extremely widely known and repeated
I agree with you, but your take is much more nuanced than what the GP comment said! These models don't simply regurgitate the training set. That was my point with gpt3. The models have advanced from that, and can now "generalise" over the context in ways they could not do ~3 years ago. We are now at a point where you can write a detailed spec (10-20k tokens) for an unseen scripting language, and have SotA models a) write a parser and b) start writing scripts for you in that language, even though it never saw that particular scripting language anywhere in its training set. Try it. You'll be surprised.
It's kind of funny, because sometimes these tools are magical and incredible, and sometimes they are extremely stupid in obvious ways.
Yes, these are impressive, and especially so for local models that you can run yourself, but there is a gap between "absolutely magical" and "pretty cool, but needs heavy guiding" depending on how heavily the ground you're treading has been walked upon.
For a heavily explored space, it's like being impressed that you're 2.5 year old M2 with 64 GB RAM can extract some source code from a zip file. It's worth being impressed and excited about the space and the pace of improvement, but it's also worth stepping back and thinking rationally about the specific benchmark at hand.