Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

For anyone running locally, could you please describe your hardware setup? CPU only? CPU+GPU(s)? How much memory? What type of CPU? Particularly interested in larger models (say >30b params).

For transparency, I work for an x86 motherboard manufacturer and the LLM-on-local-hw space is very interesting. If you're having trouble finding the right HW, would love to hear those pain points.



The most popular performant desktop llm runtimes are pure GPU (exLLAMA) or GPU + CPU (llama.cpp). People stuff the biggest models that will fit into the collective RAM + VRAM pool, up to ~48GB for llama 70B. Sometimes users will split models across two 24GB CUDA GPUs.

Inference for me is bottlenecked by my GPU and CPU RAM bandwidth. TBH the biggest frustration is that OEMs like y'all can't double up VRAM like you could in the old days, or sell platforms with a beefy IGP, and that quad channel+ CPUs are too expensive.

Vulkan is a popular target runtimes seem to be heading for. IGPs with access to lots of memory capacity + bandwidth will be very desirable, and I hear Intel/AMD are cooking up quad channel IGPs.

On the server side, everyone is running Nvidia boxes, I guess. But I had a dream about an affordable llama.cpp host: The cheapest Sapphire Rapids HBM SKUs, with no DIMM slots, on a tiny, dirt cheap motherboard you can pack into a rack like sardines. Llama.cpp is bottlenecked by bandwidth, and ~64GB is perfect.


If you're at a motherboard manufacturer I have some definite points for you to hear.

One is.. there are essentially zero motherboards that space out x16 pci-e slots so that you can appropriately use more than 2 triple-slot GPUs. 3090s and 4090s are all triple slot cards, but often motherboard are putting x16 slots spaced 2-apart, with x8 or less slots in between. There may be a few that allow you to fit 2x cards, but none that would support 3x I don't think, and definitely none that do 4x. Obviously that would result in a non-standard length motherboard (much taller). But in the ML world it would be appreciated because it would be possible to build quad card systems without watercooling cards or using A5000/A6000 or other dual slot, expensive datacenter cards.

And then, even for dual-slot cards like the A5000/A6000 etc again there are very few motherboards that you can get the x16 slots spaced appropriately. The Supermicro H12SSL-i is about the only one that gets 4 x16 slots double-slot spaced appropriately and in a way that you could run 4 blower or WC'd cards and not overlap something else. And then, even when you do, you have the problem of the pin headers on the bottom of the motherboard interfere with the last card. That location of the pin headers is archaic and annoying and just needs to die.

Remember those mining-rig specialty motherboards, with all the wide-spaced pci-e slots for like 8x GPU's at once? we need that, but with x16-bandwidth slots. Those mining cards were typically only x1 bandwidth slots (even if they were x16 length) because for mining, bandwidth between cards and CPU isn't a problem, but for ML it is.

Sure, these won't fit the typical ATX case standards. But if you build it, they will come.


This would have to be a server board, or at least a HEDT board.

And yeah, as said below, 4x 4090s would trip most circuit breakers and require some contortions to power with a regular PSU. And it would be so expensive that you mind as well buy 2x A6000s.

Really, the problem is no one will sell fast, sanely priced 32GB+ GPUs. I am hoping Intel will disrupt this awful status quo with Battlemage.


The thought of power draw of 4 4090s going through a commercial along with insisting on staying with aircooling in a case that is now going to be horribly cramped and with no airflow probably keeps a firefighter awake a night sometimes.

There's no reasonable use of consumer motherboards having the space for that. Even SLI is more or less abandoned. Needless to say:

* Making a new motherboard form factor, incompatible with _everything in the market_

* Having to make therefore new cases, incompatible with _everything in the market_ (or, well, it'll be compatible. It'll just be extremely empty.

* Having to probably make your own PSUs because I wouldn't trust a constant 1200W draw from just GPUs on your average Seasonic PSU

If you build it, not only will noone come, but they also don't have the money to pay for what you'd charge to even just offset costs.


There are cases on the market today with more than 7 slots next to each other like the Fractal Design Define 7 XL and the Fractal Design Meshify 2 XL. These could fit three 3-slot cards with the right mainboard.

I think there is a market for it and i hope these products will arrive soon.


I totally agree!

There are a few more boards with 4 x16 slots at 2 slot spacing:

GIGABYTE MU92-TU0

ASRock Rack SPC621D8 (3 variants)

ASRock C621A WS

Supermicro X12SPA-TF


I use two 3090s to run the 70b model at a good speed. Takes 32 gigs of vram, more depending on context. I tried CPU+GPU (5900X + 3090) but with extended context it's slow enough that I wouldn't recommend it (~1 token/s). CPU only gets "let it run over night" slow. Works ok-ish for with a small context though (even if it's still "non-interactive" slow).


What’s the difference in output quality between that and the 33b parameter model? That would fit entirely in vram, right?


The 33B model is llama V1. Facebook reportedly held back 34B llama v2 because it failed some safety metrics.

So... Generally the quality is worse, but the available set of finetunes is totally different. Some llama v1 33b finetunes are not available in 70B, and extremely good at their niche.

Also 70B should get more than 1 token/sec on a single 3090 offloaded to CPU. I dunno what framework op is using.


Any chance you could point me in the right direction on how to set something like this up?

Right now, I'm using pure CPU Llama but only the 17B version, based on I believe llama.cpp. How do I mix both CPU and GPU together for more performance?


The easy way: download koboldcpp. Otherwise you have to compile llama.cpp (or kobold.cpp) with opencl or cuda support. There are instructions for this on the git page.

Then offload as many layers as you can to the gpu with the gpu layers flag. You will have to play with this and observe your gpu's vram.


what example niche?


- roleplaying (chronos merged with airoboros)

- theraputic/friend style chat (Samantha)

- translation (various single language finetines)

- medical advice (can't remember this one)

This is non exhaustive. And Llama V2's extended native context does really help some niches (like storytelling) that a few 33B models are still pretty good at.


> medical advice (can't remember this one)

You're probably thinking of Clinical Camel: https://huggingface.co/augtoma/qCammel-70-x


This is a 70B tune. And its new to me. Looks interesting!


I'm running two RTX 3090 at PCIe 4.0 x8 on a X570 board w/ 128GB DDR4 @ 3200.

Going beyond that is very expensive right now.

The AMD X670 chipset offers 28 PCIe 5.0 lanes, can't you make a mainboard with three x16 PCIe 4.0 slots out of that? Ideally two models: One with 2-slot spacing (for watercooling) and another (oversized) board with 3-slot spacing for cases like the Fractal Design Meshify 2 XL.


Slot spacing on motherboards is a challenge due to high frequency signal attenuation. You have finite limits on how far your slots can be from the CPU. Your signal budget/allowable distances are decreasing as each successive PCIe generation runs at a higher frequency.

Yes, you can space out slots widely, however this means you have to use PCIe redrivers/retimers which adds cost to the board. You can also use different materials for the motherboard but again, this adds cost.

We'd love to provide better slot spacing configs, but there are technical and commercial tradeoffs to be made.


There are boards on the market that offer four PCIe 4.0 x16 slots at 2x spacing. Offering three PCIe 4.0 x16 slots at 3x spacing means just going one slot position further. I hope it's possible.


In EdgeChains, we run models using DeepJavaLibrary (DJL). Preferably only CPU - more focused on edge+embedding usecases


I would love to see an AMD MI300A board for hobbyist :D




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: