Qwen3.5 122B and 35B models offer Sonnet 4.5 performance on local computers

Aurornis · 2026-03-01T00:09:55 1772323795

If you're new to this: All of the open source models are playing benchmark optimization games. Every new open weight model comes with promises of being as good as something SOTA from a few months ago then they always disappoint in actual use.

I've been playing with Qwen3-Coder-Next and the Qwen3.5 models since they were each released.

They are impressive, but they are not performing at Sonnet 4.5 level in my experience.

I have observed that they're configured to be very tenacious. If you can carefully constrain the goal with some tests they need to pass and frame it in a way to keep them on track, they will just keep trying things over and over. They'll "solve" a lot of these problems in the way that a broken clock is right twice a day, but there's a lot of fumbling to get there.

That said, they are impressive for open source models. It's amazing what you can do with self-hosted now. Just don't believe the hype that these are Sonnet 4.5 level models because you're going to be very disappointed once you get into anything complex.

kir-gadjello · 2026-03-01T01:14:30 1772327670

Respectfully, from my experience and a few billions of tokens consumed, some opensource models really are strong and useful. Specifically StepFun-3.5-flash https://github.com/stepfun-ai/Step-3.5-Flash

I'm working on a pretty complex Rust codebase right now, with hundreds of integration tests and nontrivial concurrency, and stepfun powers through.

I have no relation to stepfun, and I'm saying this purely from deep respect to the team that managed to pack this performance in 196B/11B active envelope.

jasonni · 2026-03-01T08:16:22 1772352982

What coding agent do you use with StepFun-3.5-flash? I just tried it from siliconflow's api with opencode. The toolcalling is broken: AI_InvalidResponseDataError: Expected 'function.name' to be a string.

kir-gadjello · 2026-03-01T16:53:12 1772383992

I use pi, but I'm almost done writing a better alternative that doesn't have pi's stability issues. 80K Rust SLOC and a few hundred tests btw.

echion · 2026-03-02T00:05:22 1772409922

Any place we can look for you to release this?

kir-gadjello · 2026-03-02T01:59:54 1772416794

Yeah, my github is in the profile. Soon (tm). Feel free to follow.

copperx · 2026-03-01T04:33:47 1772339627

Are you using stepfun mostly because it's free, or is it better than other models at some things?

kir-gadjello · 2026-03-01T07:19:00 1772349540

I think we are at this point where the hard ceiling of a strong model is pretty hard to delineate reliably (at least in coding, in research work it's clearer ofc) - and in a good sense, meaning with suitable task decomposition or a test harness or a good abstraction you can make the model do what you thought it could not. StepFun is a strong model and I really enjoyed studying and comparing it to others by coding pretty complex projects semi-autonomously (will do a write up on this soon tm).

Even purely pragmatically, StepFun covers 95% of my research+SWE coding needs, and for the remaining 5% I can access the large frontier models. I was surprised StepFun is even decent at planning and research, so it is possible to get by with it and nothing else (1), but ofc for minmaxing the best frontier model is still the best planner (although the latest deepseek is surprisingly good too).

Finally we are at a point where there is a clear separation of labor between frontier & strong+fast models, but tbh shoehorning StepFun into this "strong+fast" category feels limiting, I think it has greater potential.

CapsAdmin · 2026-03-01T07:33:36 1772350416

I pay for copilot to access anthropic, google and openai models.

Claude code always give me rate limits. Claude through copilot is a bit slow, but copilot has constant network request issues or something, but at least I don't get rate limited as often.

At least local models always work, is faster (50+ tps with qwen3.5 35b a4b on a 4090) and most importantly never hit a rate limit.

acchow · 2026-03-01T09:45:32 1772358332

> Claude code always give me rate limits

> 50+ tps with qwen3.5 35b a4b on a 4090

But qwen3.5 35b is worse than even Claude Haiku 4.5. You could switch your Claude Code to use Haiku and never hit rate limits. Also gets similar 50tps.

CapsAdmin · 2026-03-01T13:17:05 1772371025

I haven't tried 4.5 haiku much, but i was not impressed with previous haiku versions.

My goto proprietary model in copilot for general tasks is gemini 3 flash which is priced the same as haiku.

The qwen model is in my experience close to gemini 3 flash, but gemini flash is still better.

Maybe it's somewhat related to what we're using them for. In my case I'm mostly using llms to code Lua. One case is a typed luajit language and the other is a 3d luajit framework written entirely in luajit.

I forgot exactly how many tps i get with qwen, but with glm 4.7 flash which is really good (to be local) gets me 120tps and a 120k context.

Don't get me wrong, proprietary models are superior, but local models are getting really good AND useful for a lot of real work.

nodakai · 2026-03-01T06:18:02 1772345882

I also started playing with 3.5 Flash and was impressed.

It’s 2× faster than its competitors. For tasks where “one-shotting” is unrealistic, a fast iteration loop makes a measurable difference in productivity.

mycall · 2026-03-01T14:51:59 1772376719

TDD is really the delineation between being successful or not when using [local] LLMs.

Aurornis · 2026-03-01T16:27:40 1772382460

> some opensource models really are strong and useful

To be clear I never said they weren’t strong or useful. I use them for some small tasks too.

I said they’re not equivalent to SOTA models from 6 months ago, which is what is always claimed.

Then it turns into a Motte and Bailey game where that argument is replaced with the simpler argument that they’re useful for open weights models. I’m not disagreeing with that part. I’m disagree with the first assertion that they’re equivalent to Sonnet 4.5

kir-gadjello · 2026-03-01T16:52:05 1772383925

They are not equivalent 1:1, esp. in knowledge coverage (given OOM param size difference) and in taste (Sonnet wins, but for taste one can also use Kimi K2.5), but in my hardcore use (high-performance realtime simulations of various kinds) I would prefer StepFun-3.5-Flash to Sonnet 4 strongly and to 4.5 often enough without a decisive advantage in using exclusively Sonnet 4.5. For truly hard tasks or specifications I would turn to 5.2 or 5.3-codex of course - but one KPI for quality of my work as a lead engineer is to ensure that truly hard tasks are known, bounded and planned-for in advance.

Maybe my detailed, requirement-based/spec-based prompting style makes the difference between anthropic's and OSS models smaller and people just like how good Anthropic's models are at reading the programmer's intent from short concise prompts.

Frankly, I think the 1:1 equivalent is an impossible standard given the set of priorities and decisions frontier labs make when setting up their pre-, mid- and post-training pipelines, and benchmark-wise it is achievable for a smaller OSS model to align with Sonnet 4.5 even on hard benchmarks.

Given the relatively underwhelming Sonnet 4.5 benchmarks [1], I think StepFun might have an edge over it esp. in Math/STEM [2] - even an old deepseek-3.2 (not speciale!) had a similar aggregate score. With 4.6 Anthropic ofc vastly improved their benchmark game, and it now truly looks like a frontier model.

1. https://artificialanalysis.ai/models/claude-4-5-sonnet-think... 2. https://matharena.ai/models/stepfun_3_5_flash

aappleby · 2026-03-01T01:15:22 1772327722

What are you running that model on?

kir-gadjello · 2026-03-01T01:18:20 1772327900

I just use openrouter, it's free for now. But I would pay 30-100$ to use it 24/7.

aappleby · 2026-03-01T01:36:09 1772328969

Ah, I thought you meant you were running it locally.

Aerroon · 2026-03-01T18:08:28 1772388508

Have you tried Minimax M2.5? How did it compare?

kir-gadjello · 2026-03-02T19:17:18 1772479038

Much worse - from my experience minimax is not suitable for high autonomy on hard projects. The real distant second in my experience is mimo flash v2 (but I did not try the latest version, might be closer to parity). I would not use minimax for serious work.

StepFun 3.5 Flash is better compared to google's gemini 3 flash which is surprisingly good and pretty costly, and to GLM-5.

I find this outcome ironic given minimax's more aggressive marketing and large-scale distillation accusations from Anthropic specifically accusing minimax but not StepFun.

I can only wonder about the true underlying reasons, but deducing from public information I suspect that minimax simply has weaker, benchmaxx-targeting post-training R&D and leans more on distillation of western frontier models, while StepFun has extensive post-training with lots of hard-won custom R&D and internal large-scale distillation teachers.

Aerroon · 2026-03-02T19:41:49 1772480509

Interesting. I'm surprised you feel that it's better than GLM 5 - these models are in different weight classes after all.

I tried it out a bunch and it seems good. I can't really tell if it's better or worse than most of these other models in such a short time though.

kir-gadjello · 2026-03-02T19:53:53 1772481233

I don't think it's strictly better than GLM 5, more like they are peers (but in math competitions StepFun is stronger than most), and in my experience have similar coding/bugfix ceiling where world knowledge is not the deciding factor. But I didn't test GLM 5 for more than 30 hours, and my agentic harness (opencode) might be suboptimal - I'm open to the idea that GLM 5 with the right agentic harness is ready for ultra-long autonomy, but I have yet to see it myself.

Where GLM 5 is strictly worse for me though, compared to StepFun, is long-form content generation (planning, research documents) - but this can be said about geminis too and these are obviously very smart models.

Given the free option I'd explore GLM 5 more, but if I had to pay for it myself ofc I'd choose stepfun every time. Basically I think right now the optimal configuration for maximizing output of correct software features per dollar involves using StepFun or its future class competitor for bulk coding and first stage code review.

Maybe I need to write a blogpost about it after all.

Aerroon · 2026-03-03T02:17:11 1772504231

I tried them both out with a task of creating a todo-like web app (you can use the chat interface for GLM 5 for free if there's capacity). GLM 5 ended up with a working version. Sadly StepFun didn't quite function right. The main issue was that it ended up putting everything that should be in different columns into a single one. I didn't prompt it further to fix it, but it seems relatively capable. I think it beat what the big Qwen model came up with.

What's really surprising to me is the cost of the model. It's definitely very good for its price. DeepSeek is the only one that offers and competition to it at that price point (GLM 5 is literally 10x more expensive).

FuckButtons · 2026-03-01T02:38:48 1772332728

A 3 bit quant will run on a 128gb MacBook Pro, it works pretty well.

nl · 2026-03-01T02:43:20 1772333000

A 3 bit quant is quite a lot weaker than the OpenRouter version the OP is using.

lend000 · 2026-03-01T06:08:00 1772345280

Yes and no. "Last-gen" (like, from 6 months ago) frontier models do still tend to outperform the best open source models. But some models, especially GLM-5, really have captured whatever circuitry drives pattern matching in the models they were trained off of.

I like this benchmark that competes models against one another in competitive environments, which seems like it can't really be gamed: https://gertlabs.com

Aurornis · 2026-03-01T16:29:37 1772382577

> Yes and no. "Last-gen" (like, from 6 months ago) frontier models do still tend to outperform the best open source models

That’s exactly what I said, though. The headline we’re commenting under claims they’re Sonnet 4.5 level but they’re not.

I don’t disagree that they’re powerful for open models. I’m pointing out that anyone reading these headlines who expects a cheap or local Sonnet 4.5 is going to discover that it’s not true.

wolvoleo · 2026-03-01T01:03:54 1772327034

All models are doing that. Not only the open source ones.

I bet the cloud ones are doing it a lot more because they can also affect the runtime side which the open source ones can't.

red75prime · 2026-03-01T05:15:05 1772342105

I wouldn't mind them benchmaxing my queries.

dimgl · 2026-03-01T02:42:51 1772332971

I'm using Qwen 3.5 27b on my 4090 and let me tell you. This is the first time I am seriously blown away by coding performance on a local model. They are almost always unusable. Not this time though...

jonaustin · 2026-03-05T02:16:22 1772676982

122b is probably better; especially on a mac with 128gb memory.

localllama thread on this: https://www.reddit.com/r/LocalLLaMA/comments/1rk01ea/qwen351... (see comments for actual real usage rather thank benchmarks)

But for nvidia gpus 27b on a 3090 or similar is where it's at for sure.

smahs · 2026-03-01T13:43:08 1772372588

27B dense model is probably the best in the 3.5 lot, not absolutely but for perf:size. Its also pretty good at prose, which is a rarity for a Qwen.

bibstha · 2026-03-01T06:56:11 1772348171

You don't need a coding version of model from Qwen? the 3.5 works?

rudhdb773b · 2026-03-01T02:01:11 1772330471

Are there any up-to-date offline/private agentic coding benchmark leaderboards?

If the tests haven't been published anywhere and are sufficiently different from standard problems, I would think the benchmarks would be robust to intentional over optimization.

Edit: These look decent and generally match my expectations:

https://www.apex-testing.org/

chaboud · 2026-03-01T01:28:38 1772328518

"When a measure becomes a target, it ceases to be a good measure."

Goodhart's law shows up with people, in system design, in processor design, in education...

Models are going to be over-fit to the tests unless scruples or practical application realities intervene. It's a tale as old as machine learning.

spwa4 · 2026-03-01T13:28:32 1772371712

This is because of the forbidden argument in statistics. Any statistic, even something so basic as an average, ONLY works if you can guarantee the independence of the individual facts it measures.

But there's a problem with that: of course the existence of the statistical measure itself is very much a link between all those individual facts. In other words: if there is ANY causal link between the statistical measure and the events measured ... it has now become bullshit (because the law of large numbers doesn't apply anymore).

So let's put it in practice, say there's a running contest, and you display the minimum, maximum and average time of all runners that have had their turns. We all know what happens: of course the result is that the average trends up. And yet, that's exactly what statistics guarantees won't happen. The average should go up and down with roughly 50% odds when a new runner is added. This is because showing the average causes behavior changes in the next runner.

This means, of course, that basing a decision on something as trivial as what the average running time was last year can only be mathematically defensible ONCE. The second time the average is wrong, and you're basing your decision on wrong information.

But of course, not only will most people actually deny this is the case, this is also how 99.9% of human policy making works. And it's mathematically wrong! Simple, fast ... and wrong.

ekianjo · 2026-03-01T04:26:54 1772339214

> That said, they are impressive for open source models.

there is nothing open "source" about them. They are open weights, that's all.

crystal_revenge · 2026-03-01T01:41:52 1772329312

> they always disappoint in actual use.

I’ve switched to using Kimi 2.5 for all of my personal usage and am far from disappointed.

Aside from being much cheaper than the big names (yes, I’m not running it locally, but like that I could) it just works and isn’t a sycophant. Nice to get coding problems solved without any “That’s a fantastic idea!”/“great point” comments.

At least with Kimi my understanding is that beating benchmarks was a secondary goal to good developer experience.

regularfry · 2026-03-01T12:37:37 1772368657

Just going to echo this. Been using K2.5 in opencode as a switch away from Opus because it was too expensive for the sorts of things I was playing with, and it's been great. There's definitely a bit of learning to get the hang of what sort of prompts to give it and to make sure there's enough documentation in the project for it, but it's remarkably capable once you're in the swing of it.

amelius · 2026-03-01T00:14:18 1772324058

Are you saying that the benchmarks are flawed?

And could quantization maybe partially explain the worse than expected results?

TrainedMonkey · 2026-03-01T00:29:21 1772324961

No, what he is saying is that benchmarks are static and there is tremendous reputational and financial pressure to make benchmark number go up. So you add specific problems to training data... The result is that the model is smarter, but the benchmarks overstate the progress. Sure there are problem sets designed to be secret, but keeping secrets is hard given the fraction of planetary resources we are dedicating to making the AI numbers go up.

I have two of my own comments to add to that. First one is that there is problem alignment at play. Specifically - the benchmarks are mostly self-contained problems with well defined solutions and specific prompt language, humans tasks are open ended with messy prompts and much steerage. Second is that it would be interesting to test older models on brand new benchmarks to see how those compare.

Aurornis · 2026-03-01T00:33:45 1772325225

> No, what he is saying is that benchmarks are static and there is tremendous reputational and financial pressure to make benchmark number go up.

That's a much better way to say it than I did.

These models are known for being open weights but they're still products that Alibaba Cloud wants is trying to sell. They have Product Managers and PR and marketing people under pressure to get people using them.

This Venture Beat article is basically a PR piece for the models and Alibaba Cloud hosting. The pricing table is right in the article.

It's cool that they release the models for us to use, but don't think they're operating entirely altruistically. They're playing a business game just like everyone else.

amelius · 2026-03-01T01:24:13 1772328253

There should be a way to turn the questions we ask LLMs into benchmarks.

That way, we can have a benchmark that is always up to date.

lurkshark · 2026-03-01T15:14:05 1772378045

There are a few “updating” benchmarks out there. I periodically take a look at these two:

https://swe-rebench.com/

https://livebench.ai/

Aurornis · 2026-03-01T00:18:55 1772324335

The models outperform on the benchmarks relative to general tasks.

The benchmarks are public. They're guaranteed to be in the training sets by now. So the benchmarks are no longer an indicator of general performance because the specific tasks have been seen before.

> And could quantization maybe explain the worse than expected results?

You can use the models through various providers on OpenRouter cheaply without quantization.

girvo · 2026-03-01T00:21:32 1772324492

Flawed? Possibly, but I think it's more that any kind of benchmark then becomes a target, and is inherently going to be a "lossy" signal as to the models actual ability in practice.

Quantisation doesn't help, but even running full fat versions of these models through various cloud providers, they still don't match Sonnet in actual agentic coding uses: at least in my experience.

noosphr · 2026-03-01T01:14:47 1772327687

It's not just the open source ones.

The only benchmarks worth anything are dynamic ones which can be scaled up.

jackblemming · 2026-03-01T00:37:19 1772325439

Death by KPIs. Management makes it too risky to do anything but benchmaxx. It will be the death of American AI companies too. Eventually, people will notice models aren’t actually getting better and the money will stop flowing. However, this might be a golden age of research as cheap GPUs flood the market and universities have their own clusters.

warpspin · 2026-03-01T12:13:02 1772367182

Hmm, I second this. Haven't compared Qwen3.5 122B yet, but played around with OpenCode + Qwen3-Coder-Next yesterday and did manual comparisons with Claude Code and Claude Code is still far ahead in general felt "intelligence quality".

neop1x · 2026-03-02T09:25:58 1772443558

Depends on what you expect from the model. For coding/agentic tasks there is SWE Bench https://www.swebench.com/ which gives a better picture. MiniMax, GLM and Kimi K2 seem to be better models for this purpose than Qwen. And it matches my (limited) actual experience.

ekjhgkejhgk · 2026-03-01T14:03:14 1772373794

I've been trying to get these things to local host and use tools. Am I right in understanding that it's impossible for these things to use tools from within llama.cpp? Do I need another "thing" to run the models? What exactly is the mechanism by which the models became aware that they're somewhere where they have tools availbale? So many questions...

baq · 2026-03-01T10:12:39 1772359959

they're distilling claude and openai obviously.

that said, sonnet 4.5 is not a good model today, March 1st 2026. (it blew my mind on its release day, September 29th, 2025.)

weirdmantis69 · 2026-03-02T23:04:28 1772492668

How much computer do you need to make them work like Sonnet 4.5 from claude but locally?

eurekin · 2026-03-01T01:05:04 1772327104

Very good point. I'm playing with them too and got to the same conclusion.

mstaoru · 2026-02-28T22:59:05 1772319545

I periodically try to run these models on my MBP M3 Max 128G (which I bought with a mind to run local AI). I have a certain deep research question (in a field that is deeply familiar to me) that I ask when I want to gauge model's knowledge.

So far Opus 4.6 and Gemini Pro are very satisfactory, producing great answers fairly fast. Gemini is very fast at 30-50 sec, Opus is very detailed and comes at about 2-3 minutes.

Today I ran the question against local qwen3.5:35b-a3b - it puffed for 45 (!) minutes, produced a very generic answer with errors, and made my laptop sound like it's going to take off any moment.

Wonder what am I doing wrong?.. How am I supposed to use this for any agentic coding on a large enough codebase? It will take days (and a 3M Peltor X5A) to produce anything useful.

lm28469 · 2026-02-28T23:14:21 1772320461

> Wonder what am I doing wrong?

You're comparing 100b parameters open models running on a consumer laptop VS private models with at the very least 1t parameters running on racks of bleeding edge professional gpus

Local agentic coding is closer to "shit me the boiler plate for an android app" not "deep research questions", especially on your machine

vlovich123 · 2026-02-28T23:26:43 1772321203

The hardware difference explains runtime performance differences, not task performance.

Speculation is that the frontier models are all below 200B parameters but a 2x size difference wouldn’t fully explain task performance differences

nl · 2026-03-01T03:01:32 1772334092

> Speculation is that the frontier models are all below 200B parameters

Some versions of some the models are around that size, which you might hit for example with the ChatGPT auto-router.

But the frontier models are all over 1T parameters. Source: watch interview with people who have left one of the big three labs and now work at the Chinese labs and are talking about how to train 1T+ models.

NamlchakKhandro · 2026-03-01T02:08:58 1772330938

> The hardware difference explains runtime performance differences, not task performance.

Yes it does.

MrDrMcCoy · 2026-03-01T05:11:29 1772341889

Care to elaborate?

BoredomIsFun · 2026-03-01T10:17:02 1772360222

Certainly not Opus. That beast feels very heavy - the coherence of longer form prose is usually a good marker, and it is able to spit 4000 words coherent short stories from a single shot.

827a · 2026-03-01T03:20:29 1772335229

He's running a 35B parameter model. Frontier models are well over a trillion parameters at this point. Parameters = smarts. There are 1T+ open source models (e.g. GLM5), and they're actually getting to the point of being comparable with the closed source models; but you cannot remotely run them on any hardware available to us.

Core speed/count and memory bandwidth determines your performance. Memory size determines your model size which determines your smarts. Broadly speaking.

regularfry · 2026-03-01T15:38:17 1772379497

The architecture is also important: there's a trade-off for MoE. There used to be a rough rule of thumb that a 35bxa3b model would be equivalent in smarts to an 11b dense model, give or take, but that's not been accurate for a while.

BoredomIsFun · 2026-03-01T10:18:16 1772360296

> There are 1T+ open source models (e.g. GLM5),

GLM-5 is ~750B model.

ses1984 · 2026-02-28T23:30:06 1772321406

Who would have thought ai labs with billions upon billions of r&d budget would have better models than a free alternative.

shlomo_z · 2026-03-01T06:22:08 1772346128

I'll add, AI Labs put a lot of resources into allowing the AI to search the web.. that makes a big difference

mstaoru · 2026-03-01T09:06:27 1772355987

I use search as well via openwebui + searxng.

delaminator · 2026-02-28T23:32:44 1772321564

Looks at the headline: Qwen3.5 122B and 35B models offer Sonnet 4.5 performance on local computers

lm28469 · 2026-02-28T23:40:41 1772322041

Yes and Devstral 2 24b q4 is supposed to be 90% as good but it can't even reliably write to a file on my machine.

There are the benchmarks, the promises, and what everybody can try at home

8note · 2026-02-28T23:59:42 1772323182

maybe a harness problem?

SyneRyder · 2026-03-01T15:20:31 1772378431

Having tried the Mistral Vibe harness that was supposedly designed for Devstral, that thing is abysmal. I feel sorry for whatever they did to that model, it didn't deserve it.

The thing I most noticed was asking it for help with configuring local MCP servers in Mistral Vibe - something it supports, it literally shows how many MCP servers are connected on the startup screen - it then begins scanning my local machine for servers running "MineCraft Protocol".

I want Mistral to do well, and I use their Voxtral Transcribe 2, that one has been useful. I'd even like a well made Mistral Vibe (c'mon, "oui oui baguette" is a hilarious replacement for "thinking"). But Mistral are so far behind, and they don't seem to even know or accept that they are.

aspenmartin · 2026-02-28T23:08:48 1772320128

Well Opus and Gemini are probably running on multiple H200 equivalents, maybe multiple hundreds of thousands of dollars of inference equipment. Local models are inherently inferior; even the best Mac that money can buy will never hold a candle to latest generation Nvidia inference hardware, and the local models, even the largest, are still not quite at the frontier. The ones you can plausibly run on a laptop (where "plausible" really is "45 minutes and making my laptop sound like it is going to take off at any moment". Like they said -- you're getting sonnet 4.5 performance which is 2 generations ago; speaking from experience opus 4.6 is night and day compared to sonnet 4.5

zozbot234 · 2026-02-28T23:15:20 1772320520

> Well Opus and Gemini are probably running on multiple H200 equivalents, maybe multiple hundreds of thousands of dollars of inference equipment.

But if you've got that kind of equipment, you aren't using it to support a single user. It gets the best utilization by running very large batches with massive parallelism across GPUs, so you're going to do that. There is such a thing as a useful middle ground. that may not give you the absolute best in performance but will be found broadly acceptable and still be quite viable for a home lab.

aspenmartin · 2026-02-28T23:50:38 1772322638

Batching helps with efficiency but you can’t fit opus into anything less than hundreds of thousands of dollars in equipment

Local models are more than a useful middle ground they are essential and will never go away, I was just addressing the OPs question about why he observed the difference he did. One is an API call to the worlds most advanced compute infrastructure and another is running on a $500 CPU.

Lots of uses for small, medium, and larger models they all have important places!!

holoduke · 2026-03-01T06:51:25 1772347885

Your Gemini or Opus question got send to a Texas datacenter where it got queued and processed by a subunit of 80 h200 140gb 1000w cards running a many billion or trillion parameter model. It took less that 200ms to process a single request. Your Claude cliënt decided to spawn 30 sub agents and iterated in a total of 90 requests totalling about 45000ms. Now compare that to your 100b transistor cpu doing something similar. Yes that would be slow.

mstaoru · 2026-03-01T09:13:29 1772356409

Right, it was more of a rhetorical question :) With my point being - how are these local models really useful to me now? Is the Only Way ™ to sell my house and build a 8x5090 monster?.. How does that compare to $20/month Opus? (Privacy aside.)

The second order thought from this is... will we get a value-based price leveling soon? If the alternative to a hosted LLM is to build $10-20k+ machine with $500+ monthly energy bills, will hosted price asymptotically climb up to reflect this reality?

Something to think about.

regularfry · 2026-03-01T14:03:41 1772373821

Looked at from the other end of the telescope, the other factor is how fast low-end local models can gain capability. This 35b model is absolutely fine on a 4090 in a machine that was about £3000 when I bought it three years ago. Where will what you can run on a 4090, or a 5090, be in six months? That's the interesting question, but we're already well past the point where the uses to which you will be able to put a local model dramatically increase within the depreciation lifespan of the hardware.

etyhhgfff · 2026-03-01T10:13:30 1772360010

We would need a super high end AI accelerator with specialised cooling for less than 3k bucks to make it happen. Consumer gaming graphics card wont fit the bill. Problem is all TSMC capacity is already booked for years to come by the big players to build data center grade hardware with price tags and setup requirements out of consumer reach.

wolvoleo · 2026-03-01T01:10:27 1772327427

Well first of all you're running a long intense task on a thermally constrained machine. Your MacBook Pro is optimised for portability and battery life, not max performance under load. And apple's obsession with thinness overrules thermal performance for them. Short peaks will be ok but a 45 minute task will thoroughly saturate the cooling system.

Even on servers this can happen. At work we have a 2U sized server with two 250W class GPUs. And I found that by pinning the case fans at 100% I can get 30% more performance out of GPU tasks which translates to several days faster for our usecase. It does mean I can literally hear the fans screaming in the hallway outside the equipment room but ok lol. Who cares. But a laptop just can't compare.

Something with a desktop GPU or even better something with HBM3 would run much better. Local models get slow when you use a ton of context and the memory bandwidth of a MacBook Pro while better than a pc is still not amazing.

And yeah the heaviest tasks are not great on local models. I tend to run the low hanging fruit locally and the stuff where I really need the best in the cloud. I don't agree local models are on par, however I don't think they really need to be for a lot of tasks.

pamcake · 2026-03-01T01:41:20 1772329280

To your point, one can get a great performance boost by propping the laptop onto a roost-like stand in front of a large fan. Nothing like a cooling system actually built for sustained load but still.

meatmanek · 2026-03-01T03:54:16 1772337256

I've seen reports of qwen3.5-35b-a3b spending a ton of time reasoning if the context window is nearly empty-- supposedly it reasons less if you provide a long system prompt or some file contents, like if you use it in a coding agent.

I'm too GPU-poor to run it, but r/LocalLLaMa is full of people using it.

boutell · 2026-03-01T12:14:54 1772367294

Can confirm. I gave it a variant of the car wash question on a MacBook M4 with 32 GB of RAM. It produced output at a conversational speed, sure, but that started with 6 minutes of thinking output. 6 minutes.

On the plus side, it did figure out the question even without the first sentence that's intended as a bit of a giveaway.

regularfry · 2026-03-01T13:54:32 1772373272

There's definitely something wrong with the thinking mode on this one. I wouldn't be surprised if it gets fixed, either by qwen themselves or with a fine-tune.

adam_patarino · 2026-03-01T12:48:14 1772369294

The biggest gaps are not in hardware or model size. There is a lot of logical fallacy in the industry. Most people believe bigger is better. For model size, compute, tools, etc.

The reality in ML is that small models can perform better at a narrow problem set than large ones.

The key is the narrow problem set. Opus can write you a poem, create a shopping list, and analyze your massive code base.

We trained our model to only focus on coding with our specific agent harness, tools, and context engine. And it’s small enough to fit on an M2 16GB. It’s as good as sonnet 4.5 and way better than qwen3.5:35b-a3b

Our beta will be out soon / rig.ai

amritananda · 2026-03-02T00:44:42 1772412282

No benchmarks, no information about training methods/datasets, template placeholder vibe-coded website. Waste of time.

__mharrison__ · 2026-02-28T23:37:00 1772321820

Were you using mlx-lm? I've had good performance with that on Macs. (Sadly, the lead developer just left Apple.)

Admittedly, I haven't tried these models on my Mac, but I have on my DGX Spark, and they ran fine. I didn't see the slowdown you're mentioning.

mstaoru · 2026-03-01T09:09:11 1772356151

(I think) yes, via the latest openwebui + ollama.

CamperBob2 · 2026-02-28T23:56:39 1772322999

Try the 27B dense model. It will likely do much better than the 35b MoE with only 3B active experts.

Also, performance on research-y questions isn't always a good indicator of how the model will do for code generation or agent orchestration.

regularfry · 2026-03-01T14:05:10 1772373910

Currently sat waiting for the unsloth fixed quants to drop, but I'm on the edge of my seat for this.

Balinares · 2026-03-01T21:56:51 1772402211

Wait, didn't they drop like two days ago?

regularfry · 2026-03-02T15:22:50 1772464970

The 35b did but not the 27b. Looks like the latter has been updated in the last half hour.

Balinares · 2026-03-02T22:22:16 1772490136

Neat! Thanks for correcting me there. I'll go and take a look.

zozbot234 · 2026-02-28T23:01:57 1772319717

Running local AI models on a laptop is a weird choice. The Mini and especially the Studio form factor will have better cooling, lower prices for comparable specs and a much higher ceiling in performance and memory capacity.

stavros · 2026-02-28T23:09:49 1772320189

I can never see the point, though. Performance isn't anywhere near Opus, and even that gets confused following instructions or making tool calls in demanding scenarios. Open weights models are just light years behind.

I really, really want open weights models to be great, but I've been disappointed with them. I don't even run them locally, I try them from providers, but they're never as good as even the current Sonnet.

vunderba · 2026-03-01T00:16:29 1772324189

I can't speak to using local models as agentic coding assistants, but I have a headless 128GB RAM machine serving llama.cpp with a number of local models that I use on a daily basis.

- Qwen3-VL picks up new images in a NAS, auto captions and adds the text descriptions as a hidden EXIF layer into the image, which is used for fast search and organization in conjunction with a Qdrant vector database.

- Gemma3:27b is used for personal translation work (mostly English and Chinese).

- Llama3.1 spins up for sentiment analysis on text.

stavros · 2026-03-01T00:34:47 1772325287

Ah yeah, self-contained tasks like these are ideal, true. I'm more using it for coding, or for running a personal assistant, or for doing research, where open weights models aren't as strong yet.

vunderba · 2026-03-01T00:58:23 1772326703

Understood. Research would make me especially leery; I’d be afraid of losing any potential gains as I'd feel compelled to always go and validate its claims (though I suppose you could mitigate it a little bit with search engine tooling like Kagi's MCP system).

andoando · 2026-02-28T23:18:45 1772320725

They're great for some product use cases where you dont need frontier models.

stavros · 2026-02-28T23:21:40 1772320900

Yeah, for sure, I just don't have many of those. For example, the only use I have for Haiku is for summarizing webpages, or Sonnet for coding something after Opus produces a very detailed plan.

Maybe I should try local models for home automation, Qwen must be great at that.

lm28469 · 2026-02-28T23:17:28 1772320648

They're like 6 months away on most benchmarks, people already claimed coding wad solved 6 months ago, so which is it? The current version is the baseline that solves everything but as soon as the new version is out it becomes utter trash and barely usable

zozbot234 · 2026-02-28T23:21:46 1772320906

That's very large models at full quantization though. Stuff that will crawl even on a decent homelab, despite being largely MoE based and even quantization-aware, hence reducing the amount and size of active parameters.

stavros · 2026-02-28T23:20:01 1772320801

That's just a straw man. Each frontier model version is better than the previous one, and I use it for harder and harder things, so I have very little use for a version that's six months behind. Maybe for simple scripts they're great, but for a personal assistant bot, even Opus 4.6 isn't as good as I'd like.

mstaoru · 2026-03-01T15:32:27 1772379147

So it's back to the original question, why spend $5-10k on the Studio, when it will still be 10x slower and half the intelligence vs. $20 Sonnet?.. What is the point (besides privacy) to use local models now for coding?

PS: I can understand that isolated "valuable" problems like sorting photo collection or feeding a cat via ESPHome can be solved with local models.

NorwegianDude · 2026-03-01T22:37:20 1772404640

At least for me, it's cheap. Even Claude Haiku 4.5 would cost over $60 each day for the same token amount, after accounting for electricity costs. I have the hardware for other reasons anyway, so why not use it, avoid privacy issues and save money.

Are the LLMs very useful? That is a whole other discussion...

zozbot234 · 2026-03-01T17:43:11 1772386991

You can't use a $20 Sonnet subscription for general agentic use cases, you have to pay for API use on a per-token basis. The $20 and $200 subscriptions are widely considered unsustainable as such. If anything, the real competition is third-party cheap inference providers.

wat10000 · 2026-03-01T00:04:35 1772323475

I have a laptop already, so that's what I'm going to use.

satvikpendem · 2026-02-28T23:38:57 1772321937

I can take a laptop on the train.

notreallya · 2026-02-28T23:04:33 1772319873

Sonnet 4.5 level isn't Opus 4.6 level, simple as

xtn · 2026-03-01T03:57:49 1772337469

I think knowledge of frontier research certainly scale with number of parameters. Also, US labs can pay more money to have researchers provide training data on these frontier research areas.

On the other hand, if indeed open source models and Macbooks can be as powerful as those SOTA models from Google, etc, then stock prices of many companies would already collapsed.

jonaustin · 2026-03-08T22:38:10 1773009490

Try with qwen 3.5 122b; it has more parameters so a larger corpus of knowledge to draw from than 35b.

gigatexal · 2026-03-01T03:53:54 1772337234

I have the exact same hardware. Was going to do the same thing with the 122B model … I’ll just keep paying Anthropic and he models are just that good. Trying out Gemini too. But won’t pay OpenAI as they’re going to be helping Pete Kegseth to develop autonomous killing machines.

muyuu · 2026-03-01T07:12:51 1772349171

Depending on the specificity of the research, having a model with fewer parameters will come with a higher penalty. If you want a model to perform better at something specific while staying smaller, generally it will take specific training to achieve that.

culi · 2026-02-28T23:08:31 1772320111

Well you can't run Gemini Pro or Opus 4.6 locally so are you comparing a locally run model to cloud platforms?

furyofantares · 2026-02-28T23:08:53 1772320133

Can you try asking Sonnet 4.5 the same question, since that is what this model is claimed to be on par with?

rienko · 2026-02-28T23:39:45 1772321985

use a larger model like Qwen3.5-122B-A10B quantized to 4/5/6 bits depending on how much context you desire, MLX versions if you want best tok/s on Mac HW.

if you are able to run something like mlx-community/MiniMax-M2.5-3bit (~100gb), my guess if the results are much better than 35b-a3b.

andxor · 2026-02-28T23:55:49 1772322949

You're not doing anything wrong. The Chinese models are not as good as advertised. Surprise surprise!

alexpotato · 2026-02-28T22:23:33 1772317413

I recently wrote a guide on getting:

- llama.cpp

- OpenCode

- Qwen3-Coder-30B-A3B-Instruct in GGUF format (Q4_K_M quantization)

working on a M1 MacBook Pro (e.g. using brew).

It was bit finicky to get all of the pieces together so hopefully this can be used with these newer models.

https://gist.github.com/alexpotato/5b76989c24593962898294038...

freeone3000 · 2026-02-28T22:52:21 1772319141

We can also run LM Studio and get it installed with one search and one click, exposed through an OpenAI-compatible API.

kpw94 · 2026-02-28T23:07:01 1772320021

On my 32GB Ryzen desktop (recently upgraded from 16GB before the RAM prices went up another +40%), did the same setup of llama.cpp (with Vulkan extra steps) and also converged on Qwen3-Coder-30B-A3B-Instruct (also Q4_K_M quantization)

On the model choice: I've tried latest gemma, ministral, and a bunch of others. But qwen was definitely the most impressive (and much faster inference thanks to MoE architecture), so can't wait to try Qwen3.5-35B-A3B if it fits.

I've no clue about which quantization to pick though ... I picked Q4_K_M at random, was your choice of quantization more educated?

zargon · 2026-03-01T02:23:03 1772331783

Quant choice depends on your vram, use case, need for speed, etc. For coding I would not go below Q4_K_M (though for Q4, unsloth XL or ik_llama IQ quants are usually better at the same size). Preferably Q5 or even Q6.

robby_w_g · 2026-02-28T22:51:57 1772319117

Does your MBP have 32 GB of ram? I’m waiting on a local model that can run decently on 16 GB

copperx · 2026-02-28T22:24:51 1772317491

How fast does it run on your M1?

jackcosgrove · 2026-03-01T02:02:42 1772330562

I am a total neophyte when it comes to LLMs, and only recently started poking around into the internals of them. The first thing that struck me was that float32 dimensions seemed very generous.

I then discovered what quantization is by reading a blog post about binary quantization. That seemed too good to be true. I asked Claude to design an analysis assessing the fidelity of 1, 2, 4, and 8 bit quantization. Claude did a good job, downloading 10,000 embeddings from a public source and computing a similarity score and correlation coefficient for each level of quantization against the float32 SoT. 1 and 2 bit quantizations were about 90% similar and 8 bit quantization was lossless given the precision Claude used to display the results. 4 bit was interesting as it was 99% similar (almost lossless) yet half the size of 8 bit. It seemed like the sweet spot.

This analysis took me all of an hour so I thought, "That's cool but is it real?" It's gratifying to see that 4 bit quantization is actually being used by professionals in this field.

deepsquirrelnet · 2026-03-01T02:22:15 1772331735

4-bit quantization on newer nvidia hardware is being supported in training as well these days. I believe the gpt-oss models were trained natively in MXFP4, which is a 4-bit floating point / e2m1 (2-exponent, 1 bit mantissa, 1 bit sign).

It doesn't seem terribly common yet though. I think it is challenging to keep it stable.

[1] https://www.opencompute.org/blog/amd-arm-intel-meta-microsof...

[2] https://www.opencompute.org/documents/ocp-microscaling-forma...

zozbot234 · 2026-03-01T17:53:45 1772387625

mxfp4 is a block-based floating point format. The E2M1 format applies to individual values, but each 32-values block also has a shared 8-bit floating point exponent to provide scaling information about the whole block.

regularfry · 2026-03-01T14:07:18 1772374038

There's also work on ternary models that's quite interesting, because the arithmetic operations are super fast and they're extremely cache efficient. Well worth looking into if that's the sort of thing that interests you.

silisili · 2026-03-01T02:34:23 1772332463

Mind sharing any resources? I've been thinking about trying to understand them better myself.

jackcosgrove · 2026-03-01T02:55:36 1772333736

This is an ongoing course at CMU you can shadow.

https://modernaicourse.org/

tymscar · 2026-03-01T02:06:38 1772330798

Thats cool.

I do wonder where that extra acuity you get from 1% more shows up in practice. I hate how I have basically no way to intuitively tell that because of how much of a black box the system is

doctorpangloss · 2026-03-01T02:08:39 1772330919

Well why would Claude know any of this? Obviously it's the wrong criteria. If you have your own dataset to benchmark, created your own calibration for quantization with it. Scientifically, you wouldn't really believe in the whole process of gradient descent if you didn't think tiny differences in these values matter. So...

tymscar · 2026-03-01T02:18:12 1772331492

I think you might be answering to a different person or misunderstanding what I said but you are right that just as I don’t have an intuition for where the acuity shows up in the corpus, I don’t think Claude does either

solarkraft · 2026-02-28T22:05:07 1772316307

Smells like hyperbole. A lot of people making such claims don’t seem to have continued real world experience with these models or seem to have very weird standards for what they consider usable.

Up until relatively recently, while people had already long been making these claims, it came with the asterisks of „oh, but you can’t practically use more than a few K tokens of context“.

derekp7 · 2026-02-28T23:11:47 1772320307

"Create a single page web app scientific RPN calculator"

Qwen 3.5 122b/a10b (at q3 using unsloth's dynamic quant) is so far the first model I've tried locally that gets a really usable RPN calculator app. Other models (even larger ones that I can run on my Strix Halo box) tend to either not implement the stack right, have non-functional operation buttons, or most commonly the keypad looks like a Picasso painting (i.e., the 10-key pad portion has buttons missing or mapped all over the keypad area).

This seems like such as simple test, but I even just tried it in chatgpt (whatever model they serve up when you don't log in), and it didn't even have any numerical input buttons. Claude Sonet 4.6 did get it correct too, but that is the only other model I've used that gets this question right.

rienko · 2026-02-28T23:44:28 1772322268

We tend to find Qwen3-Coder-Next better at coding at least on our anecdotal examples from our codebases. It's somewhat better at tool calling, maybe the current templates for Qwen3.5 are still not enjoying as "mature" support as Qwen3 on vllm. I can say in my team MiniMax2.5 is the currently favorite.

airstrike · 2026-03-01T02:14:08 1772331248

is your prompt literally 1-sentence?

if so, a better approach would be to ask it to first plan that entire task and give it some specific guidance

then once it has the plan, ask it to execute it, preferably by letting it call other subagents that take care of different phases of the implementation while the main loop just merges those worktrees back

it's how you should be using claude code too, btw

nl · 2026-03-01T03:05:19 1772334319

Claude Sonnet can easily one-shot that without specifically asking for plan first.

airstrike · 2026-03-01T15:22:01 1772378521

I believe you, but performance on 10-word prompts is pretty useless as a metric

nl · 2026-03-02T01:25:57 1772414757

Why? Seems like a valid requirement to me?

I build micro apps from 10-word prompts multiple times a day.

tempest_ · 2026-02-28T22:26:03 1772317563

Qwen3-Coder-30B-A3B-Instruct is good I think for in line IDE integration or operating on small functions or library code but I dont think you will get too far with one shot feature implementation that people are currently doing with Claude or whatever.

rubyn00bie · 2026-02-28T23:30:37 1772321437

I could be doing something wrong, but I have not had any success with one shot feature implementations for any of the current models. There are always weird quirks, undesired behaviors, bad practices, or just egregiously broken implementations. A week or so ago, I had instructed Claude to do something at compile-time and it instead burned a phenomenal amount of tokens before yeeting the most absurd, and convoluted, runtime implementation—- that didn’t even work. At work I use it (or Codex) for specific tasks, delegating specific steps of the feature implementation.

The more I use the cloud based frontier models, the more virtue I find in using local, open source/weights, models because they tend to create much simpler code. They require more direct interaction from me, but the end result tends to be less buggy, easier to refactor/clean up, and more precisely what I wanted. I am personally excited to try this new model out here shortly on my 5090. If read the article correctly, it sounds like even the quantized versions have a “million”[1] token context window.

And to note, I’m sure I could use the same interaction loop for Claude or GPT, but the local models are free (minus the power) to run.

[1] I’m a dubious it won’t shite itself at even 50% of that. But even 250k would be amazing for a local model when I “only” have 32GB of VRAM.

andy_ppp · 2026-02-28T22:53:08 1772319188

I have been adding a one shot feature to a codebase with ChatGPT 5.3 Codex in Cursor and it worked out of the box but then I realised everything it had done was super weird and it didn't work under a load of edge cases. I've tried being super clear about how to fix it but the model is lost. This was not a complex feature at all so hopefully I'm employed for a few more years yet.

HarHarVeryFunny · 2026-03-01T18:40:06 1772390406

Presumably not, but the better approach is anyways to first plan using a powerful/expensive model like Opus, then you can use something less capable and cheaper for the coding part. This would be true even if you just want to use Anthropic models, but makes even more sense if you want to use something cheaper like Qwen3.5 or Kimi K2.5 for the coding part.

__mharrison__ · 2026-02-28T23:41:44 1772322104

I used the 35b model to create a polars implementation of PCA (no sklearn or imports other than math and polars). In less than 10 minutes I had the code. This is impressive to me considering how poorly all models were with polars until very recently. (They always hallucinated pandas code.)

pram · 2026-03-01T08:55:45 1772355345

I decided to try Qwen3.5 122B in LM Studio with Opencode and I am impressed. It's not super slow (M4 Max/128GB) and it's pretty close to how Claude Code feels. Getting pretty good code analysis, definitely feels Sonnet-esque. I'm hyped completely local alternatives are getting so good.

jjcm · 2026-03-01T00:55:25 1772326525

Getting better, but definitely not there yet, nor near Sonnet 4.5 performance.

What these open models are great for are for narrow, constrained domains, with good input/output examples. I typically use them for things like prompt expansion, sentiment analysis, reformatting or re-arranging flow of code.

What I found they have trouble with is going from ambiguous description -> solved problem. Qwen 3.5 is certainly the best of the OSS models I've found (beating out GPT 120b OSS which was the previous king), and it's just starting to demonstrate true intelligence in unbound situations, but it isn't quite there yet. I have a RTX 6000 pro, so Qwen 3.5 is free for me to run, but I tend to default to Composer 1.5 if I want to be cheap.

The trend however is super encouraging. I bought my vid card with the full expectation that we'll have a locally running GPT 5.2 equiv by EoY, and I think we're on track.

oscord · 2026-02-28T23:55:37 1772322937

SWE chart is missing Claude on front page, interesting way to present your data. Mix and match at will. Grown up people showing public school level sneakiness. That fact alone disqualifies your LL. Business/marketing leaders usually are brighter than average developers... so there.

solarkraft · 2026-02-28T22:02:22 1772316142

What are the recommended 4 bit quants for the 35B model? I don’t see official ones: https://huggingface.co/models?other=base_model:quantized:Qwe...

Edit: The unsloth quants seem to have been fixed, so they are probably the go-to again: https://unsloth.ai/docs/models/qwen3.5/gguf-benchmarks

nu11ptr · 2026-02-28T23:10:32 1772320232

Thinking about getting a new MBP M5 Max 128GB (assuming they are released next week). I know "future proofing" at this stage is near impossible, but for writing Rust code locally (likely using Qwen 3.5 for now on MLX), the AIs have convinced me this is probably my best choice for immediate with some level of longevity, while retaining portability (not strictly needed, but nice to have). Alternatively was considering RTX options or a mac studio, but was leaning towards apple for the unified memory. What does HN think?

cmenge · 2026-03-01T00:48:39 1772326119

I've been mulling the same, but decided against (for now)

Using Claude Code Max 20 so ROI would be maybe 2+ years.

CC gives me unlimited coding in 4-6 windows in parallel. Unsure if any model would beat (or even match) that, both in terms in quality and speed.

I wouldn't gamble on that now. With a subscription, I can change any time. With the machine, you risk that this great insane model comes out but you need 138GB and then you'll pay for both.

nu11ptr · 2026-03-01T14:32:28 1772375548

We are on the same wavelength. I'm thinking maybe a pass for now.

pamcake · 2026-03-01T01:49:23 1772329763

> What does HN think?

Thermals. Your workloads will be throttled hard once it inevitably runs hot. See comments elsewhere in thread about why LLMs on laptops like MBP is underwhelming. The same chips in even a studio form factor would perform much better.

nl · 2026-03-01T03:13:26 1772334806

Strix Halo machines are a good option too if you are at all price sensitive. AMD (with all the downsides of that for AI work) but people are getting decent performance from them.

Also Nvidia Spark.

shell0x · 2026-03-01T03:07:54 1772334474

I have a Mac Studio with 128GB and a M4 Max and I'd recommend it. The power usage is also pretty good, but you may not care if you live somewhere where energy is cheap.

nu11ptr · 2026-03-01T14:33:20 1772375600

Have you used this for Rust coding by chance? I'm curious how it compares to Opus 4.6. I realize it isn't going to think to the same level, but curious how code quality is for a more straight forward task.

shell0x · 2026-03-01T03:05:59 1772334359

Can't wait to try that out locally. Keen to reduce my dependence on American products and services.

xmddmx · 2026-03-01T00:13:14 1772323994

Ollama users: there are notable bugs with ollama and Qwen3.5 so don't let your first impression be the last.

Theory is that some of the model parameters aren't set properly and this encourages endless looping behavior when run under ollama:

https://github.com/ollama/ollama/issues?q=is%3Aissue%20state... (a bunch of them)

sunkeeh · 2026-02-28T22:31:40 1772317900

Qwen3.5-122B-A10B BF16 GGUF = 224GB. The "80Gb VRAM" mentioned here will barely fit Q4_K_S (70GB), which will NOT perform as shown on benchmarks.

Quite misleading, really.

CamperBob2 · 2026-03-01T00:06:42 1772323602

The larger 3.5 quants are actually pretty close to the full-blown 397B model's performance, at least looking at the numbers. Qwen 3.5 seems more tolerant of quantization than most.

syntaxing · 2026-03-01T01:26:51 1772328411

A big part that a lot of local users forget is inference is hard. Maybe you have the wrong temperature. Maybe you have the wrong min P. Maybe you have the wrong template. Maybe the implementation in llama cpp has a bug. Maybe Q4 or even Q8 just won’t compare to BF16. Reality is, there’s so many knobs to LLM inferencing and any can make the experience worse. It’s not always the model’s fault.

mark_l_watson · 2026-02-28T21:50:15 1772315415

The new 35b model is great. That said, it has slight incompatibility's with Claude Code. It is very good for tool use.

johnnyApplePRNG · 2026-02-28T22:13:17 1772316797

Claude code is designed for anthropic models. Try it with opencode!

mark_l_watson · 2026-03-01T01:37:01 1772329021

I will, right now.

EDIT: opencode was a bit slow with qwen3.5:35b using Ollama. Faster/nicer to use with Liquid lfm2:latest

johnnyApplePRNG · 2026-03-01T02:44:18 1772333058

Try llama.cpp - it usually excels with these MoE models imho.

kristianpaul · 2026-02-28T22:23:19 1772317399

Or Pi

copperx · 2026-02-28T22:25:11 1772317511

Or Oh My Pi

stavros · 2026-02-28T23:11:26 1772320286

Have you tried the 122B one?

erelong · 2026-02-28T21:51:18 1772315478

What kind of hardware does HN recommend or like to run these models?

xienze · 2026-02-28T21:55:30 1772315730

It's less than you'd think. I'm using the 35B-A3B model on an A5000, which is something like a slightly faster 3080 with 24GB VRAM. I'm able to fit the entire Q4 model in memory with 128K context (and I think I would probably be able to do 256K since I still have like 4GB of VRAM free). The prompt processing is something like 1K tokens/second and generates around 100 tokens/second. Plenty fast for agentic use via Opencode.

rahimnathwani · 2026-02-28T22:06:21 1772316381

There seem to be a lot of different Q4s of this model: https://www.reddit.com/r/LocalLLaMA/s/kHUnFWZXom

I'm curious which one you're using.

suprjami · 2026-02-28T22:11:35 1772316695

Unsloth Dynamic. Don't bother with anything else.

rahimnathwani · 2026-03-01T15:34:36 1772379276

For anyone else trying to run this on a Mac with 32GB unified RAM, this is what worked for me:

First, make sure enough memory is allocated to the gpu:

  sudo sysctl -w iogpu.wired_limit_mb=24000

Then run llama.cpp but reduce RAM needs by limiting the context window and turning off vision support. (And turn off reasoning for now as it's not needed for simple queries.)

  llama-server \
    -hf unsloth/Qwen3.5-35B-A3B-GGUF:UD-Q4_K_XL \
    --jinja \
    --no-mmproj \
    --no-warmup \
    -np 1 \
    -c 8192 \
    -b 512 \
    --chat-template-kwargs '{"enable_thinking": false}'

You can also enable/disable thinking on a per-request basis:

  curl 'http://localhost:8080/v1/chat/completions' \
  --data-raw '{"messages":[{"role":"user","content":"hello"}],"stream":false,"return_progress":false,"reasoning_format":"auto","temperature":0.8,"max_tokens":-1,"dynatemp_range":0,"dynatemp_exponent":1,"top_k":40,"top_p":0.95,"min_p":0.05,"xtc_probability":0,"xtc_threshold":0.1,"typ_p":1,"repeat_last_n":64,"repeat_penalty":1,"presence_penalty":0,"frequency_penalty":0,"dry_multiplier":0,"dry_base":1.75,"dry_allowed_length":2,"dry_penalty_last_n":-1,"samplers":["penalties","dry","top_n_sigma","top_k","typ_p","top_p","min_p","xtc","temperature"],"chat_template_kwargs": { "enable_thinking": true }}'|jq .

If anyone has any better suggestions, please comment :)

suprjami · 2026-03-02T04:59:03 1772427543

Shouldn't you be using MLX because it's optimised for Apple Silicon?

Many user benchmarks report up to 30% better memory usage and up to 50% higher token generation speed:

https://reddit.com/r/LocalLLaMA/comments/1fz6z79/lm_studio_s...

As the post says, LM Studio has an MLX backend which makes it easy to use.

If you still want to stick with llama-server and GGUF, look at llama-swap which allows you to run one frontend which provides a list of models and dynamically starts a llama-server process with the right model:

https://github.com/mostlygeek/llama-swap

(actually you could run any OpenAI-compatible server process with llama-swap)

rahimnathwani · 2026-03-02T05:04:44 1772427884

I didn't know about llama-swap until yesterday. Apparently you can set it up such that it gives different 'model' choices which are the same model with different parameters. So, e.g. you can have 'thinking high', 'thinking medium' and 'no reasoning' versions of the same model, but only one copy of the model weights would be loaded into llama server's RAM.

Regarding mlx, I haven't tried it with this model. Does it work with unsloth dynamic quantization? I looked at mlx-community and found this one, but I'm not sure how it was quantized. The weights are about the same size as unsloth's 4-bit XL model: https://huggingface.co/mlx-community/Qwen3.5-35B-A3B-4bit/tr...

suprjami · 2026-03-02T08:40:51 1772440851

Yes that's right. The config is described by the developer here:

https://www.reddit.com/r/LocalLLaMA/comments/1rhohqk/comment...

And is in the sample config too:

https://github.com/mostlygeek/llama-swap/blob/main/config.ex...

iiuc MLX quants are not GGUFs for llama.cpp. They are a different file format which you use with the MLX inference server. LM Studio abstracts all that away so you can just pick an MLX quant and it does all the hard work for you. I don't have a Mac so I have not looked into this in detail.

BoredomIsFun · 2026-03-01T10:21:50 1772360510

FYI UD quants of 3.5-35BA3B are broken, use bartowski or AesSedai ones.

regularfry · 2026-03-01T14:15:42 1772374542

They've uploaded the fix. If those are still broken something bad has happened.

rahimnathwani · 2026-02-28T22:36:08 1772318168

UD-Q4_K_XL?

msuniverse2026 · 2026-02-28T22:13:41 1772316821

I've had an AMD card for the last 5 years, so I kinda just tuned out of local LLM releases because AMD seemed to abandon rocm for my card (6900xt) - Is AMD capable of anything these days?

pja · 2026-02-28T23:07:41 1772320061

> I've had an AMD card for the last 5 years, so I kinda just tuned out of local LLM releases because AMD seemed to abandon rocm for my card (6900xt) - Is AMD capable of anything these days?

Sure. Llama.cpp will happily run these kinds of LLMs using either HIP or Vulcan.

Vulkan is easier to get going using the Mesa OSS drivers under Linux, HIP might give you slightly better performance.

wirybeige · 2026-02-28T22:32:46 1772317966

The vulkan backend for llama.cpp isn't that far behind rocm for pp and tp speeds

mmis1000 · 2026-03-03T12:58:08 1772542688

I think AMD just add support of rocm to rdna2 recently? I can run torch and aisudio with it just fine.

They also finally fix all ai related stuff building on windows, so you are no longer limited to linux for these.

suprjami · 2026-02-28T22:01:50 1772316110

The cheapest option is two 3060 12G cards. You'll be able to fit the Q4 of the 27B or 35B with an okay context window.

If you want to spend twice as much for more speed, get a 3090/4090/5090.

If you want long context, get two of them.

If you have enough spare cash to buy a car, get an RTX Ada with 96G VRAM.

barrkel · 2026-02-28T22:34:33 1772318073

Rtx 6000 pro Blackwell, not ada, for 96GB.

suprjami · 2026-03-01T13:17:22 1772371042

Ah thanks.

The names are so good and not repetitious.

No not the RTX 6000. No not the A6000...

chr15m · 2026-03-01T03:36:12 1772336172

Thanks this is a great summary of the tradeoffs!

dajonker · 2026-02-28T22:13:14 1772316794

Radeon R9700 with 32 GB VRAM is relatively affordable for the amount of RAM and with llama.cpp it runs fast enough for most things. These are workstation cards with blower fans and they are LOUD. Otherwise if you have the money to burn get a 5090 for speeeed and relatively low noise, especially if you limit power usage.

cyberax · 2026-03-01T01:20:34 1772328034

I have a pair of Radeon AI PRO R9700 with 32Gb, and so far they have been a pleasure to use. Drivers work out-of-the-box, and they are completely quiet when unused. They are capped at 300W power, so even at 100% utilization they are not too loud.

I was thinking about adding after-market liquid cooling for them, but they're fine without it.

rubiquity · 2026-03-01T07:03:03 1772348583

This is great to hear! Out of curiosity, which brand did you go with? I tend to stick to Sapphire but the prices are within $200 of each other.

cyberax · 2026-03-02T02:13:11 1772417591

I got Sapphires because they were the ones available at the time of purchase :)

CamperBob2 · 2026-02-28T22:24:29 1772317469

I think the 27B dense model at full precision and 122B MoE at 4- or 6-bit quantization are legitimate killer apps for the 96 GB RTX 6000 Pro Blackwell, if the budget supports it.

I imagine any 24 GB card can run the lower quants at a reasonable rate, though, and those are still very good models.

Big fan of Qwen 3.5. It actually delivers on some of the hype that the previous wave of open models never lived up to.

MarsIronPI · 2026-02-28T22:34:07 1772318047

I've had good experience with GLM-4.7 and GLM-5.0. How would you compare them with Qwen 3.5? (If you have any experience with them.)

CamperBob2 · 2026-02-28T23:48:08 1772322488

No experience with 5 and not much with 4.7, but they both have quite a few advocates over on /r/localllama.

Unsloth's GLM-4.7-Flash-BF16.gguf is quite fast on the 6000, at around 100 t/s, but definitely not as smart as the Qwen 3.5 MoE or dense models of similar size. As far as I'm concerned Qwen 3.5 renders most other open models short of perhaps Kimi 2.5 obsolete for general queries, although other models are still said to be better for local agentic use. That, I haven't tried.

andsoitis · 2026-02-28T22:20:28 1772317228

For fast inference, you’d be hard pressed to beat an Nvidia RTX 5090 GPU.

Check out the HP Omen 45L Max: https://www.hp.com/us-en/shop/pdp/omen-max-45l-gaming-dt-gt2...

laweijfmvo · 2026-02-28T22:37:10 1772318230

I never would have guessed that in 2026, data centers would be measured in Watts and desktop PCs measured in liters.

andsoitis · 2026-02-28T23:41:45 1772322105

The Omen was neigh.

zozbot234 · 2026-02-28T22:25:23 1772317523

It depends. How much are you willing to wait for an answer? Also, how far are you willing to push quantization, given the risk of degraded answers at more extreme quantization levels?

throwdbaaway · 2026-03-01T03:47:23 1772336843

For 27B, just get a used 3090 and hop on to r/LocalLLaMA. You can run a 4bpw quant at full context with Q8 KV cache.

elorant · 2026-02-28T22:25:32 1772317532

Macs or a strix halo. Unless you want to go lower than 8-bit quantization where any GPU with 24GBs of VRAM would probably run it.

plastic3169 · 2026-03-01T07:44:29 1772351069

Anyone have recommendations on EU services where one could run open models before buying expensive hardware?

zos_kia · 2026-03-01T08:12:17 1772352737

Koyeb (recently acquired by Mistral if I'm not mistaken) have GPUs you can rent by the minute and they also have one-click deploy of some open models.