Search: "vllm" — WeSearch Press

GOOGLE DOCS

vLLM-Compile: Bringing Compiler Optimizations to LLM Inference

vLLM-compile: Bringing Compiler Optimizations to LLM Inference Luka Govedič vLLM Committer Senior Machine Learning Engineer, Red Hat 1…

Wed, 29 Apr 2026 01:52:36 GMT · 6 views

Qwen3.6-27B at ~80 tps with 218k context window on 1x RTX 5090 served by vllm 0.19

Qwen3.6-27B is out for a few days and the NVFP4 with MTP is dropped earlier on HF: Can follow the same recipe I used for Qwen3.5-27B to achieve ~80 tps on a single RTX 5090 at 218k context window via …

Sun, 26 Apr 2026 08:59:43 GMT · 10 views

DEV.TO (TOP)

Comparison: vLLM 0.6 vs. Text Generation Inference 1.4 for Serving Code LLMs

Serving code LLMs at production scale is 3.2x more expensive than general-purpose LLMs when using...…

Wed, 29 Apr 2026 04:34:24 GMT · 5 views

VERCEL

Disaggregated Serving for Hybrid SSM Models in vLLM

Hybrid architectures that interleave Mamba-style SSM layers with standard full-attention (FA) layers — such as NVIDIA Nemotron-H — are gaining traction as a way…

Tue, 28 Apr 2026 20:46:24 GMT · 5 views

LOCALLLAMA

Simple to use vLLM Docker Container for Qwen3.6 27b with Lorbus AutoRound INT4 quant and MTP speculative decoding - 118 tokens/second on 2x 3090s

Mon, 27 Apr 2026 15:38:07 GMT · 7 views

Qwen3.6-27B-INT4 clocking 100 tps with 256k context length on 1x RTX 5090 via vllm 0.19

Thanks to the community the Qwen3.6-27B speed keeps getting better. The following improves upon my recipe from yesterday and delivered a whopping 100+ tps (TG). Model: - MTP supported - KLD is decent …

Sun, 26 Apr 2026 22:44:09 GMT · 10 views

LOCALLLAMA

Results for "vllm".

vLLM-Compile: Bringing Compiler Optimizations to LLM Inference

Qwen3.6-27B at ~80 tps with 218k context window on 1x RTX 5090 served by vllm 0.19

Comparison: vLLM 0.6 vs. Text Generation Inference 1.4 for Serving Code LLMs

Disaggregated Serving for Hybrid SSM Models in vLLM

Simple to use vLLM Docker Container for Qwen3.6 27b with Lorbus AutoRound INT4 quant and MTP speculative decoding - 118 tokens/second on 2x 3090s

Qwen3.6-27B-INT4 clocking 100 tps with 256k context length on 1x RTX 5090 via vllm 0.19

Ubuntu 26.04 vs 24.04 speed improvements for inference?

Intel B70: LLama.ccp SYCL vs LLama.cpp OpenVino vs LLM-Scaler

The exact KV cache usage of DeepSeek V4

Qwen 3.6 27B in Claude Code says it will do something then stops and prompts for user reply (not failing a tool call)

Will llama.cpp multislot improve speed?

Qwen3.6-35B-A3B KLDs - INTs and NVFPs

your daily driver stack, what's it look like? and why?

Or browse by topic