Because waiting forever for initial prompt processing with realistic number of M...

johndough · 2025-06-26T06:44:20 1750920260

If the MPC tools come first in the conversation, it should be technically possible to cache the activations, so you do not have to recompute them each time.

MangoToupe · 2025-06-26T14:56:46 1750949806

> And you are never going to sit around waiting for anything larger than the 96+gb of ram that the RTX pro has.

Am I the only person that gives aider instructions and leaves it alone for a few hours? This doesn't seem that difficult to integrate into my workflow.

diggan · 2025-06-26T15:17:47 1750951067

> Am I the only person that gives aider instructions and leaves it alone for a few hours?

Probably not, but in my experience, if it takes longer than 10-15 minutes it's either stuck in a loop or down the wrong rabbit hole. But I don't use it for vibe coding or anything "big scope" like that, but more focused changes/refactors so YMMV

tucnak · 2025-06-26T10:35:27 1750934127

https://docs.vllm.ai/projects/production-stack/en/latest/tut...

storus · 2025-06-26T10:44:29 1750934669

M3 Ultra GPU is around 3070-3080 for the initial token processing. Not great, not terrible.

pests · 2025-06-26T08:31:54 1750926714

Initial prompt processing with a large static context (system prompt + tools + whatever) could technically be improved by checkpointing the model state and reusing for future prompts. Not sure if any tools support this.

112233 · 2025-06-28T11:26:51 1751110011

Dropping in late into this discussion, but is there any way to "comfortably" use multiple precomputed kv-caches with current models, in the style of this work: https://arxiv.org/abs/2212.10947 ?

Meaning, I pre-parse multiple documents, and the prompt and completion attention sees all of them, but there is no attention between the documents (they are all encoded in the same overlapping positions).

This way you can include basically unlimited amount of data in the prompt, paying for it with the perfomance.

chisleu · 2025-06-28T19:55:49 1751140549

You are correct that inference speed per $ is not optimized with this purchase.

What is optimized is the ability to find tune medium size models (~200GB) / $

You just can't get 500GB of VRAM for less than $100k. Even with $9k Blackwell cards, you have $10k in a barebones GPU server. You can't use commodity hardware and cluster it because you need fast interconnects. I'm talking 200-400GB/s interconnects. And those take yet another PCIe slot and require expensive Infiniband switches.

Shit gets costly fast. I consternated about this purchase for weeks. Eventually deciding that it's the easiest path to success for my purposes. Not for everyone's, but for mine.