20 stories tagged with #llm-inference, in publish-time order across the WeSearch catalog. Tag pages update as new stories ingest.
⌘ RSS feed for this tag → or search "Llm Inference"
TensorSharp: Open-Source Local LLM Inference Engine
A C# inference engine for running large language models (LLMs) locally using GGUF model files. TensorSharp provides a console application, a web-based chatbot interface, and Ollama…
Show HN: Tiny-vLLM – high performance LLM inference engine in C++ and CUDA
Build your own high performance LLM inference engine in C++ and CUDA - a smaller version of vLLM - jmaczan/tiny-vllm…
How Many GPUs? A simple LLM inference sizing calculator
Real-time LLM Inference on Standard GPUs: 3k tokens/s per request
Today, Kog AI launches a tech preview of the Kog Inference Engine (KIE): 3,000 output tokens/s per request on 8× AMD MI300X GPUs and 2,100 on 8× NVIDIA H200 (FP16, no speculative d…
Identifying and Mitigating Systemic Measurement Bias in Production LLM Inference Benchmarks
As Large Language Models (LLMs) transition from research environments to production deployments, evaluating their performance against strict Service Level Objectives (SLOs) has bec…
Distributing LLM Inference in DwarfStar
Show HN: YieldOS-Lite – A simulator for LLM inference control-plane governance
Contribute to nikitph/yieldos development by creating an account on GitHub.…
RTX 6000 Ada vs RTX PRO Blackwell for local LLM inference?
SSV: Sparse Speculative Verification for Efficient LLM Inference
Speculative decoding and dynamic sparse attention are two complementary approaches for accelerating long-context LLM inference: the former amortizes target-model execution across m…
Characterization of machine learning compilers for LLM inference on NVIDIA GPUs
AI inference is conflicted between Performance, developer Productivity, and device Portability–the P3 problem. Machine learning compilers (MLCs) aim to address this, but their ecos…
Show HN: BonzAI – self-sovereign, local LLM inference in the browser
Generate unlimited AI content offline. Train custom models and earn crypto by serving them on our decentralized P2P network powered by Chainlink.…
Llamas on the Web: Memory-Efficient, Performance-Portable, and Multi-Precision LLM Inference with WebGPU
Running language models in the browser presents a unique opportunity to build efficient, private, and portable AI applications, but requires contending with constrained memory avai…
SuperInfer: SLO-Aware Rotary Scheduling and Memory Management for LLM Inference
SuperInfer: SLO-Aware Rotary Scheduling and Memory Management for LLM Inference on Superchips.…
AI Foundry – Flat-Fee Unlimited LLM Inference on Blackwell GPUs in NZ
NZ's Sovereign AI Inference Platform…
Agentic LLM Inference Parameters Reference for Qwen and Gemma
This page is a practical reference for agentic LLM inference tuning (temperature, top_p, top_k,...…
ClickBook – Offline Android eReader with local LLM inference via llama.rn
Tap any word to instantly understand it. Offline AI ereader for EPUB and PDFs.…
Ada-MK: Adaptive MegaKernel Optimization via DAG-Based Search for LLM Inference
When large language models (LLMs) serve real-time inference in commercial online advertising systems, end-to-end latency must be strictly bounded to the millisecond range. Yet ever…
I've updated my glorified Llama fork (LLM Inference Server) for P40's to utilise MTP + TurboQuant + DFlash
Show HN: AI/ML benchmark for local LLM inference and XGBoost training on GPU/CPU
A suite to benchmark CPU/GPU Python performance in training ML models and running local LLMs - albedan/ai-ml-gpu-bench…
Asynchronicity in Continuous Batching
We’re on a journey to advance and democratize artificial intelligence through open source and open science.…