Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Why would you use this over vLLM?


we have vllm in certin production instances, it is a pain for most non-nvidia related architectures. A bit of digging around and we realized that most of it is just a wrapper on top of pytorch function calls. If we can do away with batch processing with vllm supports, we can be good, this is what we did here.


Batching is how you get ~350 tokens/sec on Qwen 14b on vLLM (7900XTX). By running 15 requests at once.

Also, there is a Dockerfile.rocm at the root of vLLM's repo. How is it a pain?


driver mismatch issues, we mostly use publicly available instances, so the drivers change as the instances change, according to their base image. Not saying it won't work, but it was more painful to figure out vllm, than to write a simple inference script and do it ourselves.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: