Why would you use this over vLLM?

fazkan · on Oct 2, 2024

we have vllm in certin production instances, it is a pain for most non-nvidia related architectures. A bit of digging around and we realized that most of it is just a wrapper on top of pytorch function calls. If we can do away with batch processing with vllm supports, we can be good, this is what we did here.

dhruvdh · on Oct 2, 2024

Batching is how you get ~350 tokens/sec on Qwen 14b on vLLM (7900XTX). By running 15 requests at once.

Also, there is a Dockerfile.rocm at the root of vLLM's repo. How is it a pain?

fazkan · on Oct 2, 2024

driver mismatch issues, we mostly use publicly available instances, so the drivers change as the instances change, according to their base image. Not saying it won't work, but it was more painful to figure out vllm, than to write a simple inference script and do it ourselves.