Just ordered a $12k mac studio w/ 512GB of integrated RAM. Can't wait for it to ...

zackify · 2025-06-25T19:44:11 1750880651

I love LM studio but I’d never waste 12k like that. The memory bandwidth is too low trust me.

Get the RTX Pro 6000 for 8.5k with double the bandwidth. It will be way better

tymscar · 2025-06-25T23:41:04 1750894864

Why would they pay 2/3 of the price for something with 1/5 of ram?

The whole point of spending that much money for them is to run massive models, like the full R1, which the Pro 6000 cant

zackify · 2025-06-26T02:39:18 1750905558

Because waiting forever for initial prompt processing with realistic number of MCP tools enabled on a prompt is going to suck without the most bandwidth possible

And you are never going to sit around waiting for anything larger than the 96+gb of ram that the RTX pro has.

If you’re using it for background tasks and not coding it’s a different story

johndough · 2025-06-26T06:44:20 1750920260

If the MPC tools come first in the conversation, it should be technically possible to cache the activations, so you do not have to recompute them each time.

MangoToupe · 2025-06-26T14:56:46 1750949806

> And you are never going to sit around waiting for anything larger than the 96+gb of ram that the RTX pro has.

Am I the only person that gives aider instructions and leaves it alone for a few hours? This doesn't seem that difficult to integrate into my workflow.

diggan · 2025-06-26T15:17:47 1750951067

> Am I the only person that gives aider instructions and leaves it alone for a few hours?

Probably not, but in my experience, if it takes longer than 10-15 minutes it's either stuck in a loop or down the wrong rabbit hole. But I don't use it for vibe coding or anything "big scope" like that, but more focused changes/refactors so YMMV

tucnak · 2025-06-26T10:35:27 1750934127

https://docs.vllm.ai/projects/production-stack/en/latest/tut...

storus · 2025-06-26T10:44:29 1750934669

M3 Ultra GPU is around 3070-3080 for the initial token processing. Not great, not terrible.

pests · 2025-06-26T08:31:54 1750926714

Initial prompt processing with a large static context (system prompt + tools + whatever) could technically be improved by checkpointing the model state and reusing for future prompts. Not sure if any tools support this.

112233 · 2025-06-28T11:26:51 1751110011

Dropping in late into this discussion, but is there any way to "comfortably" use multiple precomputed kv-caches with current models, in the style of this work: https://arxiv.org/abs/2212.10947 ?

Meaning, I pre-parse multiple documents, and the prompt and completion attention sees all of them, but there is no attention between the documents (they are all encoded in the same overlapping positions).

This way you can include basically unlimited amount of data in the prompt, paying for it with the perfomance.

chisleu · 2025-06-28T19:55:49 1751140549

You are correct that inference speed per $ is not optimized with this purchase.

What is optimized is the ability to find tune medium size models (~200GB) / $

You just can't get 500GB of VRAM for less than $100k. Even with $9k Blackwell cards, you have $10k in a barebones GPU server. You can't use commodity hardware and cluster it because you need fast interconnects. I'm talking 200-400GB/s interconnects. And those take yet another PCIe slot and require expensive Infiniband switches.

Shit gets costly fast. I consternated about this purchase for weeks. Eventually deciding that it's the easiest path to success for my purposes. Not for everyone's, but for mine.

marci · 2025-06-25T23:38:32 1750894712

You can't run deepseek-v3/r1 on the RTX Pro 6000, not to mention the upcomming 1 million context qwen models, or the current qwen3-235b.

112233 · 2025-06-28T12:08:38 1751112518

I can run full deepseek r1 on m1 max with 64GB of ram. Around 0.5 t/s with small quant. Q4 quant of Maverick (253 GB) runs at 2.3 t/s on it (no GPU offload).

Practically, last gen or even ES/QS EPYC or Xeon (with AMX), enough RAM to fill all 8 or 12 channels plus fast storage (4 Gen5 NVMEs are almost 60 GB/s) on paper at least look like cheapest way to run these huge MoE models at hobbyist speeds.

marci · 2025-07-03T13:48:17 1751550497

If you're talking about Deepseek r1 with llama.cpp and mmap, then at this point you can run deepseek r1 on a raspberry zero with a 256GB micro sdcard and a phone charger. The only metric left to know is one's patience.

smcleod · 2025-06-26T13:23:31 1750944211

RTX is nice, but it's memory limited and requires to have a full desktop machine to run it in. I'd take slower inference (as long as it's not less than 15tk/s) for more memory any day!

diggan · 2025-06-26T15:21:18 1750951278

I'd love to see more Very-Large-Memory Mac Studio benchmarks for prompt processing and inference. The few benchmarks I've seem either missed to take prompt processing into account, didn't share exact weights+setup that were used or showed really abysmal performance.

chisleu · 2025-06-28T19:58:06 1751140686

Oh I plan to produce a ton of that. I'll post a blog on it to HN and /r/localllama when I'm done.

t1amat · 2025-06-26T00:22:35 1750897355

(Replying to both siblings questioning this)

If the primary use case is input heavy, which is true of agentic tools, there’s a world where partial GPU offload with many channels of DDR5 system RAM leads to an overall better experience. A good GPU will process input many times faster, and with good RAM you might end up with decent output speed still. Seems like that would come in close to $12k?

And there would be no competition for models that do fit entirely inside that VRAM, for example Qwen3 32B.

storus · 2025-06-26T10:43:40 1750934620

RTX Pro 6000 can't do DeepSeek R1 671B Q4, you'd need 5-6 of them, which makes it way more expensive. Moreover, MacStudio will do it at 150W whereas Pro 6000 would start at 1500W.

diggan · 2025-06-26T11:21:29 1750936889

> Moreover, MacStudio will do it at 150W whereas Pro 6000 would start at 1500W.

No, Pro 6000 pulls max 600W, not sure where you get 1500W from, that's more than double the specification.

Besides, what is the token/second or second/token, and prompt processing speed for running DeepSeek R1 671B on a Mac Studio with Q4? Curious about those numbers, because I have a feeling they're very far off each other.

storus · 2025-06-27T10:58:43 1751021923

You need at least 5x Pro 6000 (for smaller contexts), let's say Max-Q edition running at 300W, so overall you get a minimum of 1500W.

You get around 6 tokens/second which is not great but not terrible. If you use very long prompts, things get bad.

chisleu · 2025-06-28T19:51:59 1751140319

Only on HN can buying a $12k badass computer be a waste of money

dchest · 2025-06-25T18:07:38 1750874858

I'm using it on MacBook Air M1 / 8 GB RAM with Qwen3-4B to generate summaries and tags for my vibe-coded Bloomberg Terminal-style RSS reader :-) It works fine (the laptop gets hot and slow, but fine).

Probably should just use llama.cpp server/ollama and not waste a gig of memory on Electron, but I like GUIs.

minimaxir · 2025-06-25T18:24:06 1750875846

8 GB of RAM with local LLMs in general is iffy: a 8-bit quantized Qwen3-4B is 4.2GB on disk and likely more in memory. 16 GB is usually the minimum to be able to run decent models without compromising on heavy quantization.

hnuser123456 · 2025-06-25T23:34:52 1750894492

But 8GB of Apple RAM is 16GB of normal RAM.

https://www.pcgamer.com/apple-vp-says-8gb-ram-on-a-macbook-p...

minimaxir · 2025-06-26T02:59:09 1750906749

Interestingly it was AI (Apple Intelligence) that was the primary reason Apple abandoned that hedge.

arrty88 · 2025-06-26T02:49:40 1750906180

I concur. I just upgraded from m1 air with 8gb to m4 with 24gb. Excited to run bigger models.

diggan · 2025-06-26T11:26:45 1750937205

> m4 with 24gb

Wow, that is probably analogous to 48GB on other systems then, if we were to ask an Apple VP?

vntok · 2025-06-26T23:33:30 1750980810

Not sure what Apple VPs have to do with the tech but yeah, pretty much any core engineer you ask at Apple will tell you this.

Here is a nice article with some info about what memory compression is and how it works: https://arstechnica.com/gadgets/2013/10/os-x-10-9/#page-17

It's been a hard technical problem but is pretty much solved by now since its first debut in 2012-2013.

pxc · 2025-06-27T05:57:35 1751003855

I've heard good things about how macOS handles memory relative to other operating systems. But Linux and Windows both have memory compression nowadays. So the claim is then not that memory compression makes your RAM twice as effective, but that macOS' memory compression is twice as good as the real and existing memory compression available on other operating systems.

Doesn't such a claim... need stronger evidence?

dchest · 2025-06-26T08:07:58 1750925278

It's 4-bit quantized (Q4_K_M, 2.5 GB) and still works well for this task. It's amazing. I've been running various small models on this 8 GB Air since the first Llama and GPT-J, and they improved so much!

macOS virtual memory works well on swapping in and out stuff to SSD.

imranq · 2025-06-25T19:35:30 1750880130

I'd love to host my own LLMs but I keep getting held back from the quality and affordability of Cloud LLMs. Why go local unless there's private data involved?

diggan · 2025-06-26T15:27:18 1750951638

There are some use cases I use LLMs for where I don't care a lot about the data being private (although that's a plus) but I don't want to pay XXX€ for classifying some data and I particularly don't want to worry about having to pay that again if I want to redo it with some changes.

Using local LLMs for this I don't worry about the price at all, I can leave it doing three tries per "task" without tripling the cost if I wanted to.

It's true that there is an upfront cost but way easier to get over that hump than on-demand/per-token costs, at least for me.

PeterStuer · 2025-06-26T08:06:40 1750925200

Same. For 'sovereignty ' reasons I eventually will move to local processing, but for now in development/prototyping the gap with hosted LLM's seems too wide.

mycall · 2025-06-26T01:11:47 1750900307

Offline is another use case.

seanmcdirmid · 2025-06-26T02:03:50 1750903430

Nothing like playing around with LLMs on an airplane without an internet connection.

asteroidburger · 2025-06-26T03:30:40 1750908640

If I can afford a seat above economy with room to actually, comfortably work on a laptop, I can afford the couple bucks for wifi for the flight.

seanmcdirmid · 2025-06-26T04:44:20 1750913060

If you are assuming that your Hainan airlines flight has wifi that isn't behind the GFW, even outside of cattle class, I have some news for you...

sach1 · 2025-06-26T05:33:50 1750916030

Getting around the GFW is trivially easy.

seanmcdirmid · 2025-06-26T16:59:53 1750957193

ya ya, just buy a VPN, pay the yearly subscription, and then have them disappear the week after you paid. Super trivially frustrating.

vntok · 2025-06-26T23:38:37 1750981117

VPN providers are first and foremost trust businesses. Why would you choose and pay one that is not well established and trusted? Mine have been there for more than a decade by now.

Alternatively, you could just set up your own (cheaper?) VPN relay on the tiniest VPS you can rent on AWS or IBM Cloud, right?

seanmcdirmid · 2025-06-27T04:23:25 1750998205

The VPN providers that get you to jump the cloud in China are Chinese, and China is not yet a high trust society, just like how they’ll take your payment for one year of gym fees and then disappear the next week (sigh). If AWS or IBM cloud find out you are using them as a VPN to jump the GFW, they will ban you for life, Microsoft, IBM, Amazon, aren’t interested in having their whole cloud added to the GFW block list. Many people have tried this (including Microsfties in China with free Azure credits) and they’ve all been dealt with harshly by the cloud providers.

MangoToupe · 2025-06-26T14:58:32 1750949912

Woah there Mr Money, slow down with these assumptions. A computer is worth the investment. But paying a cent extra to airlines? Unacceptable.

seanmcdirmid · 2025-06-27T04:25:54 1750998354

The $3000 that a MBP M3 Max with 64GB of RAM costs might cover a round trip business class ticket for a trans pacific…if it is on sale (a Chinese carrier probably with GFW internet).

diggan · 2025-06-26T15:29:30 1750951770

Some of us don't have the most reliable ISPs or even network infrastructure, and I say that as someone who lives in Spain :) I live outside a huge metropolitan area and Vodafone fiber went down twice this year, not even counting the time the country's electricity grid was down for like 24 hours.

noman-land · 2025-06-25T19:34:52 1750880092

I love LM Studio. It's a great tool. I'm waiting for another generation of Macbook Pros to do as you did :).

incognito124 · 2025-06-25T18:24:32 1750875872

> I'm going to download it with Safari

Oof you were NOT joking

noman-land · 2025-06-25T19:36:19 1750880179

Safari to download LM Studio. LM Studio to download models. Models to download Firefox.

teaearlgraycold · 2025-06-25T20:42:51 1750884171

The modern ninite

whatevsmate · 2025-06-26T13:47:54 1750945674

I did this a month ago and don't regret it one bit. I had a long laundry list of ML "stuff" I wanted to play with or questions to answer. There's no world in which I'm paying by the request, or token, or whatever, for hacking on fun projects. Keeping an eye on the meter is the opposite of having fun and I have absolutely nowhere I can put a loud, hot GPU (that probably has "gamer" lighting no less) in my fam's small apartment.

chisleu · 2025-06-28T19:47:26 1751140046

Right on. I also have a laundry list of ML things I want to do starting with fine tuning models.

I don't mind paying for models to do things like code. I like to move really fast when I'm coding. But for other things, I just didn't want to spend a week or two coming up on the hardware needed to build a GPU system. You can just order a big GPU box, but it's going to cost you astronomically right now. Building a system with 4-5 PCIE 5.0 x16 slots, enough power, enough pcie lanes... It's a lot to learn. You can't go on PC part picker and just hunt a motherboard with 6 double slots.

This is a machine to let me do some things with local models. My first goal is to run some quantized version of the new V3 model and try to use it for coding tasks.

I expect it will be slow for sure, but I just want to know what it's capable of.

datpuz · 2025-06-27T01:24:22 1750987462

I genuinely cannot wrap my head around spending this much money on hardware that is dramatically inferior to hardware that costs half the price. MacOS is not even great anymore, they stopped improving their UX like a decade ago.

chisleu · 2025-06-28T19:48:34 1751140114

How can you say something so brave, and so wrong?

storus · 2025-06-26T10:42:19 1750934539

If the rumors about splitting CPU/GPU in new Macs are true, your MacStudio will be the last one capable of running DeepSeek R1 671B Q4. It looks like Apple had an accidental winner that will go away with the end of unified RAM.

phren0logy · 2025-06-26T13:15:39 1750943739

I have not heard this rumor. Source?

prophesi · 2025-06-26T13:54:08 1750946048

I believe they're talking about the rumors by an Apple supply chain analyst, Ming-Chi Kuo.

https://www.techspot.com/news/106159-apple-m5-silicon-rumore...

diggan · 2025-06-26T15:31:25 1750951885

Seems Apple is waking up to the fact that if it's too easy to run weights locally, there really isn't much sense to having their own remote inference endpoints, so time to stop the party :)

prophesi · 2025-06-27T05:27:56 1751002076

I thought their goal was to completely remove the need for a remote inference endpoint in the first place? May have read your comment wrong.

diggan · 2025-06-27T10:23:46 1751019826

No, I think Apple been clear from the beginning that they won't be able to do everything on the devices themselves, that's why they're building the infrastructure/software for their "cloud intelligence system" or whatever they call it.

prettyblocks · 2025-06-25T19:20:45 1750879245

I've been using openwebui and am pretty happy with it. Why do you like lm studio more?

prophesi · 2025-06-25T19:34:55 1750880095

Not OP, but with LM Studio I get a chat interface out-of-the-box for local models, while with openwebui I'd need to configure it to point to an OpenAI API-compatible server (like LM Studio). It can also help determine which models will work well with your hardware.

LM Studio isn't FOSS though.

I did enjoy hooking up OpenWebUI to Firefox's experimental AI Chatbot. (browser.ml.chat.hideLocalhost to false, browser.ml.chat.provider to localhost:${openwebui-port})

truemotive · 2025-06-25T19:31:16 1750879876

Open WebUI can leverage the built in web server in LM Studio, just FYI in case you thought it was primarily a chat interface.

s1mplicissimus · 2025-06-25T21:15:36 1750886136

i recently tried openwebui but it was so painful to get it to run with local model. that "first run experience" of lm studio is pretty fire in comparison. can't really talk about actually working with it though, still waiting for the 8GB download

prettyblocks · 2025-06-25T23:59:51 1750895991

Interesting. I run my local llms through ollama and it's zero trouble to get that working in openwebui as long as the ollama server is running.

diggan · 2025-06-26T11:28:48 1750937328

I think that's the thing. Compared to LM Studio, just running Ollama (fiddling around with terminals) is more complicated than the full E2E of chatting with LM Studio.

Of course, for folks used to terminals, daemons and so on it makes sense from the get go, but for others it seemingly doesn't, and it doesn't help that Ollama refuses to communicate what people should understand before trying to use it.

karmakaze · 2025-06-25T18:09:43 1750874983

Nice. Ironically well suited for non-Apple Intelligence.

teaearlgraycold · 2025-06-25T18:47:06 1750877226

What are you going to do with the LLMs you run?

chisleu · 2025-06-25T18:52:36 1750877556

Currently I'm using gemini 2.5 and claude 3.7 sonnet for coding tasks.

I'm interested in using models for code generation, but I'm not expecting much in that regard.

I'm planning to attempt fine tuning open source models on certain tool sets, especially MCP tools.

sneak · 2025-06-25T18:44:05 1750877045

I already got one of these. I’m spoiled by Claude 4 Opus; local LLMs are slower and lower quality.

I haven’t been using it much. All it has on it is LM Studio, Ollama, and Stats.app.

> Can't wait for it to arrive and crank up LM Studio. It's literally the first install. I'm going to download it with safari.

lol, yup. same.

chisleu · 2025-06-25T18:56:09 1750877769

Yup, I'm spoiled by Claude 3.7 Sonnet right now. I had to stop using opus for plan mode in my Agent because it is just so expensive. I'm using Gemini 2.5 pro for that now.

I'm considering ordering one of these today: https://www.newegg.com/p/N82E16816139451?Item=N82E1681613945...

It looks like it will hold 5 GPUs with a single slot open for infiniband

Then local models might be lower quality, but it won't be slow! :)

evo_9 · 2025-06-25T21:32:34 1750887154

I was using Claude 3.7 exclusively for coding, but it sure seems like it got worse suddenly about 2–3 weeks back. It went from writing pretty solid code I had to make only minor changes to, to being completely off its rails, altering files unrelated to my prompt, undoing fixes from the same conversation, reinventing db access and ignoring existing coding 'standards' established in the existing codebase. Became so untrustworthy I finally gave OpenAi O3 a try and honestly, I was pretty surprised how solid it has been. I've been using o3 since, and I find it generally does exactly what I ask, esp if you have a well established project with plenty of code for it to reference.

Just wondering if Claude 3.7 has seemed differently lately for anyone else? Was my go to for several months, and I'm no fan of OpenAI, but o3 has been rock solid.

jessmartin · 2025-06-26T01:25:26 1750901126

Could be the prompt and/or tool descriptions in whatever tool you are using Claude in that degraded. Have definitely noticed variance across Cursor, Claude Code, etc even with the exact same models.

Prompts + tools matter.

esskay · 2025-06-26T09:01:28 1750928488

Cursor became awful over the last few weeks so it's likely them, no idea what they did to their prompt but its just been incredibly poor at most tasks regardless of which model you pick.

sneak · 2025-06-26T06:25:16 1750919116

Me too. (re: Claude; I haven’t switched models.) It sucks because I was happily paying >$1k/mo in usage charges and then it all went south.

sneak · 2025-06-26T06:18:52 1750918732

I’m firehosing about $1k/mo at Cursor on pay-as-you-go and am happy to do it (it’s delivering 2-10k of value each month).

What cards are you gonna put in that chassis?

kristopolous · 2025-06-25T19:37:19 1750880239

The GPUs are the hard things to find unless you want to pay like 50% markup

sneak · 2025-06-26T06:26:29 1750919189

That’s just what they cost; MSRP is irrelevant. They’re not hard to find, they’re just expensive.