8 stories tagged with #vllm, in publish-time order across the WeSearch catalog. Tag pages update as new stories ingest.
⌘ RSS feed for this tag → or search "Vllm"
llama.cpp b9455 Finally Caught vLLM: 70t/s on 2x3090 Qwen 27B UQ8
Test post…
Show HN: Tiny-vLLM – high performance LLM inference engine in C++ and CUDA
Build your own high performance LLM inference engine in C++ and CUDA - a smaller version of vLLM - jmaczan/tiny-vllm…
Prefix caching in vLLM under multi-tenant agent traffic
TL;DR: We turned on vLLM's prefix cache for our agent workloads at Nexus Labs and watched TTFT drop...…
End-to-End Observability for vLLM and TGI: from DCGM to Tokens
Running large language model inference servers in production exposes gaps that neither stock...…
Intel llm-scaler-vllm PV 1.4 Released With Updated Components, Arc Pro B70 Support
Intel software engineers today rolled out the llm-scaler-vllm PV v1.4 as the Docker build of their latest software stack for those wishing to run vLLM in a pre-configured, performa…
Ollama vs llama.cpp vs vLLM: Which Should You Use in 2026?
Ollama vs llama.cpp vs vLLM compared — ease of use, speed, GPU needs. Which inference engine is right for your workflow?…
Comparison: vLLM 0.6 vs. Text Generation Inference 1.4 for Serving Code LLMs
Serving code LLMs at production scale is 3.2x more expensive than general-purpose LLMs when using...…
Disaggregated Serving for Hybrid SSM Models in vLLM
Hybrid architectures that interleave Mamba-style SSM layers with standard full-attention (FA) layers — such as NVIDIA Nemotron-H — are gaining traction as a way…