WeSearch
Hub / Tags / Vllm
TAG · #VLLM

Vllm coverage.

Every story in the WeSearch catalog tagged with #vllm, chronological, with view counts. Subscribe to the per-tag RSS feed to follow this topic in your reader of choice.

6 stories tagged with #vllm, in publish-time order across the WeSearch catalog. Tag pages update as new stories ingest.

⌘ RSS feed for this tag →   or   search "Vllm"

RELATED TAGS
#disaggregated-serving1#mamba1#state-space-models1#rdma1
DEV.TO (TOP)

Comparison: vLLM 0.6 vs. Text Generation Inference 1.4 for Serving Code LLMs

Serving code LLMs at production scale is 3.2x more expensive than general-purpose LLMs when using...…

5 views ·
GOOGLE DOCS

vLLM-Compile: Bringing Compiler Optimizations to LLM Inference

vLLM-compile: Bringing Compiler Optimizations to LLM Inference Luka Govedič vLLM Committer Senior Machine Learning Engineer, Red Hat 1…

6 views ·
VERCEL

Disaggregated Serving for Hybrid SSM Models in vLLM

Hybrid architectures that interleave Mamba-style SSM layers with standard full-attention (FA) layers — such as NVIDIA Nemotron-H — are gaining traction as a way…

5 views ·
#disaggregated serving#mamba
LOCALLLAMA

Simple to use vLLM Docker Container for Qwen3.6 27b with Lorbus AutoRound INT4 quant and MTP speculative decoding - 118 tokens/second on 2x 3090s

7 views ·
REDDIT

Qwen3.6-27B-INT4 clocking 100 tps with 256k context length on 1x RTX 5090 via vllm 0.19

Thanks to the community the Qwen3.6-27B speed keeps getting better. The following improves upon my recipe from yesterday and delivered a whopping 100+ tps (TG). Model: - MTP suppor…

10 views ·
REDDIT

Qwen3.6-27B at ~80 tps with 218k context window on 1x RTX 5090 served by vllm 0.19

Qwen3.6-27B is out for a few days and the NVFP4 with MTP is dropped earlier on HF: Can follow the same recipe I used for Qwen3.5-27B to achieve ~80 tps on a single RTX 5090 at 218k…

10 views ·