Just ordered a $12k mac studio w/ 512GB of integrated RAM.
Can't wait for it to arrive and crank up LM Studio. It's literally the first install. I'm going to download it with safari.
LM Studio is newish, and it's not a perfect interface yet, but it's fantastic at what it does which is bring local LLMs to the masses w/o them having to know much.
Exo is this radically cool tool that automatically clusters all hosts on your network running Exo and uses their combined GPUs for increased throughput.
Like HPC environments, you are going to need ultra fast interconnects, but it's just IP based.
Because waiting forever for initial prompt processing with realistic number of MCP tools enabled on a prompt is going to suck without the most bandwidth possible
And you are never going to sit around waiting for anything larger than the 96+gb of ram that the RTX pro has.
If you’re using it for background tasks and not coding it’s a different story
If the MPC tools come first in the conversation, it should be technically possible to cache the activations, so you do not have to recompute them each time.
> And you are never going to sit around waiting for anything larger than the 96+gb of ram that the RTX pro has.
Am I the only person that gives aider instructions and leaves it alone for a few hours? This doesn't seem that difficult to integrate into my workflow.
> Am I the only person that gives aider instructions and leaves it alone for a few hours?
Probably not, but in my experience, if it takes longer than 10-15 minutes it's either stuck in a loop or down the wrong rabbit hole. But I don't use it for vibe coding or anything "big scope" like that, but more focused changes/refactors so YMMV
Initial prompt processing with a large static context (system prompt + tools + whatever) could technically be improved by checkpointing the model state and reusing for future prompts. Not sure if any tools support this.
Dropping in late into this discussion, but is there any way to "comfortably" use multiple precomputed kv-caches with current models, in the style of this work: https://arxiv.org/abs/2212.10947 ?
Meaning, I pre-parse multiple documents, and the prompt and completion attention sees all of them, but there is no attention between the documents (they are all encoded in the same overlapping positions).
This way you can include basically unlimited amount of data in the prompt, paying for it with the perfomance.
You are correct that inference speed per $ is not optimized with this purchase.
What is optimized is the ability to find tune medium size models (~200GB) / $
You just can't get 500GB of VRAM for less than $100k. Even with $9k Blackwell cards, you have $10k in a barebones GPU server. You can't use commodity hardware and cluster it because you need fast interconnects. I'm talking 200-400GB/s interconnects. And those take yet another PCIe slot and require expensive Infiniband switches.
Shit gets costly fast. I consternated about this purchase for weeks. Eventually deciding that it's the easiest path to success for my purposes. Not for everyone's, but for mine.
I can run full deepseek r1 on m1 max with 64GB of ram. Around 0.5 t/s with small quant. Q4 quant of Maverick (253 GB) runs at 2.3 t/s on it (no GPU offload).
Practically, last gen or even ES/QS EPYC or Xeon (with AMX), enough RAM to fill all 8 or 12 channels plus fast storage (4 Gen5 NVMEs are almost 60 GB/s) on paper at least look like cheapest way to run these huge MoE models at hobbyist speeds.
If you're talking about Deepseek r1 with llama.cpp and mmap, then at this point you can run deepseek r1 on a raspberry zero with a 256GB micro sdcard and a phone charger. The only metric left to know is one's patience.
RTX is nice, but it's memory limited and requires to have a full desktop machine to run it in. I'd take slower inference (as long as it's not less than 15tk/s) for more memory any day!
I'd love to see more Very-Large-Memory Mac Studio benchmarks for prompt processing and inference. The few benchmarks I've seem either missed to take prompt processing into account, didn't share exact weights+setup that were used or showed really abysmal performance.
If the primary use case is input heavy, which is true of agentic tools, there’s a world where partial GPU offload with many channels of DDR5 system RAM leads to an overall better experience. A good GPU will process input many times faster, and with good RAM you might end up with decent output speed still. Seems like that would come in close to $12k?
And there would be no competition for models that do fit entirely inside that VRAM, for example Qwen3 32B.
RTX Pro 6000 can't do DeepSeek R1 671B Q4, you'd need 5-6 of them, which makes it way more expensive. Moreover, MacStudio will do it at 150W whereas Pro 6000 would start at 1500W.
> Moreover, MacStudio will do it at 150W whereas Pro 6000 would start at 1500W.
No, Pro 6000 pulls max 600W, not sure where you get 1500W from, that's more than double the specification.
Besides, what is the token/second or second/token, and prompt processing speed for running DeepSeek R1 671B on a Mac Studio with Q4? Curious about those numbers, because I have a feeling they're very far off each other.
I'm using it on MacBook Air M1 / 8 GB RAM with Qwen3-4B to generate summaries and tags for my vibe-coded Bloomberg Terminal-style RSS reader :-) It works fine (the laptop gets hot and slow, but fine).
Probably should just use llama.cpp server/ollama and not waste a gig of memory on Electron, but I like GUIs.
8 GB of RAM with local LLMs in general is iffy: a 8-bit quantized Qwen3-4B is 4.2GB on disk and likely more in memory. 16 GB is usually the minimum to be able to run decent models without compromising on heavy quantization.
I've heard good things about how macOS handles memory relative to other operating systems. But Linux and Windows both have memory compression nowadays. So the claim is then not that memory compression makes your RAM twice as effective, but that macOS' memory compression is twice as good as the real and existing memory compression available on other operating systems.
It's 4-bit quantized (Q4_K_M, 2.5 GB) and still works well for this task. It's amazing. I've been running various small models on this 8 GB Air since the first Llama and GPT-J, and they improved so much!
macOS virtual memory works well on swapping in and out stuff to SSD.
I'd love to host my own LLMs but I keep getting held back from the quality and affordability of Cloud LLMs. Why go local unless there's private data involved?
There are some use cases I use LLMs for where I don't care a lot about the data being private (although that's a plus) but I don't want to pay XXX€ for classifying some data and I particularly don't want to worry about having to pay that again if I want to redo it with some changes.
Using local LLMs for this I don't worry about the price at all, I can leave it doing three tries per "task" without tripling the cost if I wanted to.
It's true that there is an upfront cost but way easier to get over that hump than on-demand/per-token costs, at least for me.
Same. For 'sovereignty ' reasons I eventually will move to local processing, but for now in development/prototyping the gap with hosted LLM's seems too wide.
VPN providers are first and foremost trust businesses. Why would you choose and pay one that is not well established and trusted? Mine have been there for more than a decade by now.
Alternatively, you could just set up your own (cheaper?) VPN relay on the tiniest VPS you can rent on AWS or IBM Cloud, right?
The VPN providers that get you to jump the cloud in China are Chinese, and China is not yet a high trust society, just like how they’ll take your payment for one year of gym fees and then disappear the next week (sigh). If AWS or IBM cloud find out you are using them as a VPN to jump the GFW, they will ban you for life, Microsoft, IBM, Amazon, aren’t interested in having their whole cloud added to the GFW block list. Many people have tried this (including Microsfties in China with free Azure credits) and they’ve all been dealt with harshly by the cloud providers.
The $3000 that a MBP M3 Max with 64GB of RAM costs might cover a round trip business class ticket for a trans pacific…if it is on sale (a Chinese carrier probably with GFW internet).
Some of us don't have the most reliable ISPs or even network infrastructure, and I say that as someone who lives in Spain :) I live outside a huge metropolitan area and Vodafone fiber went down twice this year, not even counting the time the country's electricity grid was down for like 24 hours.
I did this a month ago and don't regret it one bit. I had a long laundry list of ML "stuff" I wanted to play with or questions to answer. There's no world in which I'm paying by the request, or token, or whatever, for hacking on fun projects. Keeping an eye on the meter is the opposite of having fun and I have absolutely nowhere I can put a loud, hot GPU (that probably has "gamer" lighting no less) in my fam's small apartment.
Right on. I also have a laundry list of ML things I want to do starting with fine tuning models.
I don't mind paying for models to do things like code. I like to move really fast when I'm coding. But for other things, I just didn't want to spend a week or two coming up on the hardware needed to build a GPU system. You can just order a big GPU box, but it's going to cost you astronomically right now. Building a system with 4-5 PCIE 5.0 x16 slots, enough power, enough pcie lanes... It's a lot to learn. You can't go on PC part picker and just hunt a motherboard with 6 double slots.
This is a machine to let me do some things with local models. My first goal is to run some quantized version of the new V3 model and try to use it for coding tasks.
I expect it will be slow for sure, but I just want to know what it's capable of.
I genuinely cannot wrap my head around spending this much money on hardware that is dramatically inferior to hardware that costs half the price. MacOS is not even great anymore, they stopped improving their UX like a decade ago.
If the rumors about splitting CPU/GPU in new Macs are true, your MacStudio will be the last one capable of running DeepSeek R1 671B Q4. It looks like Apple had an accidental winner that will go away with the end of unified RAM.
Seems Apple is waking up to the fact that if it's too easy to run weights locally, there really isn't much sense to having their own remote inference endpoints, so time to stop the party :)
No, I think Apple been clear from the beginning that they won't be able to do everything on the devices themselves, that's why they're building the infrastructure/software for their "cloud intelligence system" or whatever they call it.
Not OP, but with LM Studio I get a chat interface out-of-the-box for local models, while with openwebui I'd need to configure it to point to an OpenAI API-compatible server (like LM Studio). It can also help determine which models will work well with your hardware.
LM Studio isn't FOSS though.
I did enjoy hooking up OpenWebUI to Firefox's experimental AI Chatbot. (browser.ml.chat.hideLocalhost to false, browser.ml.chat.provider to localhost:${openwebui-port})
i recently tried openwebui but it was so painful to get it to run with local model.
that "first run experience" of lm studio is pretty fire in comparison. can't really talk about actually working with it though, still waiting for the 8GB download
I think that's the thing. Compared to LM Studio, just running Ollama (fiddling around with terminals) is more complicated than the full E2E of chatting with LM Studio.
Of course, for folks used to terminals, daemons and so on it makes sense from the get go, but for others it seemingly doesn't, and it doesn't help that Ollama refuses to communicate what people should understand before trying to use it.
Yup, I'm spoiled by Claude 3.7 Sonnet right now. I had to stop using opus for plan mode in my Agent because it is just so expensive. I'm using Gemini 2.5 pro for that now.
I was using Claude 3.7 exclusively for coding, but it sure seems like it got worse suddenly about 2–3 weeks back. It went from writing pretty solid code I had to make only minor changes to, to being completely off its rails, altering files unrelated to my prompt, undoing fixes from the same conversation, reinventing db access and ignoring existing coding 'standards' established in the existing codebase. Became so untrustworthy I finally gave OpenAi O3 a try and honestly, I was pretty surprised how solid it has been. I've been using o3 since, and I find it generally does exactly what I ask, esp if you have a well established project with plenty of code for it to reference.
Just wondering if Claude 3.7 has seemed differently lately for anyone else? Was my go to for several months, and I'm no fan of OpenAI, but o3 has been rock solid.
Could be the prompt and/or tool descriptions in whatever tool you are using Claude in that degraded. Have definitely noticed variance across Cursor, Claude Code, etc even with the exact same models.
Cursor became awful over the last few weeks so it's likely them, no idea what they did to their prompt but its just been incredibly poor at most tasks regardless of which model you pick.
Can't wait for it to arrive and crank up LM Studio. It's literally the first install. I'm going to download it with safari.
LM Studio is newish, and it's not a perfect interface yet, but it's fantastic at what it does which is bring local LLMs to the masses w/o them having to know much.
There is another project that people should be aware of: https://github.com/exo-explore/exo
Exo is this radically cool tool that automatically clusters all hosts on your network running Exo and uses their combined GPUs for increased throughput.
Like HPC environments, you are going to need ultra fast interconnects, but it's just IP based.