13 results for "vllm"
vLLM-Compile: Bringing Compiler Optimizations to LLM Inference
vLLM-compile: Bringing Compiler Optimizations to LLM Inference Luka Govedič vLLM Committer Senior Machine Learning Engineer, Red Hat 1…
Qwen3.6-27B at ~80 tps with 218k context window on 1x RTX 5090 served by vllm 0.19
Qwen3.6-27B is out for a few days and the NVFP4 with MTP is dropped earlier on HF: Can follow the same recipe I used for Qwen3.5-27B to achieve ~80 tps on a single RTX 5090 at 218k context window via …
Comparison: vLLM 0.6 vs. Text Generation Inference 1.4 for Serving Code LLMs
Serving code LLMs at production scale is 3.2x more expensive than general-purpose LLMs when using...…
Disaggregated Serving for Hybrid SSM Models in vLLM
Hybrid architectures that interleave Mamba-style SSM layers with standard full-attention (FA) layers — such as NVIDIA Nemotron-H — are gaining traction as a way…
Simple to use vLLM Docker Container for Qwen3.6 27b with Lorbus AutoRound INT4 quant and MTP speculative decoding - 118 tokens/second on 2x 3090s
Qwen3.6-27B-INT4 clocking 100 tps with 256k context length on 1x RTX 5090 via vllm 0.19
Thanks to the community the Qwen3.6-27B speed keeps getting better. The following improves upon my recipe from yesterday and delivered a whopping 100+ tps (TG). Model: - MTP supported - KLD is decent …
Ubuntu 26.04 vs 24.04 speed improvements for inference?
I'm curious if any brave soul has upgraded their computer (especially if it's Strix Halo) from Ubuntu 24.04 -> 26.04 and seen a significant performance improvement for inference with VLLM, llama-serve…
Intel B70: LLama.ccp SYCL vs LLama.cpp OpenVino vs LLM-Scaler
In case anyone is interested, I decided to test out LLama.cpp's new OpenVino backend to see how it compares on Intel GPUs. At first glance, it stomps all over the previous best-case, SYCL, but lags be…
The exact KV cache usage of DeepSeek V4
Figure 1 of DSV4 paper seems to imply that DSV3.2 uses ~50GB at 1m context and DSV4 uses ~5GB: ***Numbers updated with the KV cache breakdown from vllm*** From my own calculations, the correct FP16 KV…
Qwen 3.6 27B in Claude Code says it will do something then stops and prompts for user reply (not failing a tool call)
I'm running Qwen/Qwen3.6-27B-FP8 via vLLM using this command: vllm serve Qwen/Qwen3.6-27B-FP8 --tensor-parallel-size 4 --gpu-memory-utilization 0.95 --max-num-seqs 8 \ --enable-auto-tool-choice --tool…
Will llama.cpp multislot improve speed?
I've heard mostly bad opinions about multiple slots with llama.cpp (--parallel > 1). I guess comparing to vLLM it might be worse at this, but I recently tried vLLM on 4 slots and it indeed improved th…
Qwen3.6-35B-A3B KLDs - INTs and NVFPs
KLD for INTs and NVFP4s. AS ALWAYS - Use Case is important. Accuracy versus speed versus native kernels on your GPUs. Things to note again: This is done in VLLM, with REAL logits. My Repo ( ) has made…
your daily driver stack, what's it look like? and why?
What it says in the title, I'm interested in hearing what you all have landed on as a workable / useful stack for you. Mine looks like this: back end inference servers - llama.cpp, vLLM | V hermes-age…