#llm-inference — Tagged Stories

Every story in the WeSearch catalog tagged with #llm-inference, chronological, with view counts. Subscribe to the per-tag RSS feed to follow this topic in your reader of choice.

20 stories tagged with #llm-inference, in publish-time order across the WeSearch catalog. Tag pages update as new stories ingest.

⌘ RSS feed for this tag → or search "Llm Inference"

RELATED TAGS

#gpu-optimization2 #asynchronous-processing1 #cuda1 #performance-optimization1 #ml1 #compiler-design1 #high-performance-computing1 #ada-mk1 #wenxin-dong1 #mingqing-hu1 #guanghui-yu1 #qiang-fu1

GITHUB

TensorSharp: Open-Source Local LLM Inference Engine

A C# inference engine for running large language models (LLMs) locally using GGUF model files. TensorSharp provides a console application, a web-based chatbot interface, and Ollama…

29 views · Thu, 04 Jun 2026 01:25:03 GMT

#technology #software #open-source

GITHUB

Show HN: Tiny-vLLM – high performance LLM inference engine in C++ and CUDA

Build your own high performance LLM inference engine in C++ and CUDA - a smaller version of vLLM - jmaczan/tiny-vllm…

17 views · Fri, 29 May 2026 19:45:02 GMT

#technology #programming #machine learning

HACKER NEWS (AI / LLM)

How Many GPUs? A simple LLM inference sizing calculator

16 views · Fri, 29 May 2026 17:05:05 GMT

KOG LABS

Real-time LLM Inference on Standard GPUs: 3k tokens/s per request

Today, Kog AI launches a tech preview of the Kog Inference Engine (KIE): 3,000 output tokens/s per request on 8× AMD MI300X GPUs and 2,100 on 8× NVIDIA H200 (FP16, no speculative d…

18 views · Fri, 29 May 2026 10:00:00 GMT

#ai #technology #gpu

ARXIV CS.AI

Identifying and Mitigating Systemic Measurement Bias in Production LLM Inference Benchmarks

As Large Language Models (LLMs) transition from research environments to production deployments, evaluating their performance against strict Service Level Objectives (SLOs) has bec…

26 views · Tue, 26 May 2026 04:07:43 GMT

#artificial intelligence #machine learning #performance evaluation

ANTIREZ

Distributing LLM Inference in DwarfStar

16 views · Mon, 25 May 2026 15:27:38 GMT

GITHUB

Show HN: YieldOS-Lite – A simulator for LLM inference control-plane governance

Contribute to nikitph/yieldos development by creating an account on GitHub.…

17 views · Mon, 25 May 2026 04:42:36 GMT

#technology #research #simulation

R/HOMELAB

RTX 6000 Ada vs RTX PRO Blackwell for local LLM inference?

20 views · Mon, 25 May 2026 00:37:38 GMT

ARXIV.ORG

SSV: Sparse Speculative Verification for Efficient LLM Inference

Speculative decoding and dynamic sparse attention are two complementary approaches for accelerating long-context LLM inference: the former amortizes target-model execution across m…

20 views · Sun, 24 May 2026 02:07:28 GMT

#computer science #machine learning #operating systems

SPRINGER

Characterization of machine learning compilers for LLM inference on NVIDIA GPUs

AI inference is conflicted between Performance, developer Productivity, and device Portability–the P3 problem. Machine learning compilers (MLCs) aim to address this, but their ecos…

20 views · Sun, 24 May 2026 02:07:28 GMT

#machine learning #nvidia #artificial intelligence

BONZAI

Show HN: BonzAI – self-sovereign, local LLM inference in the browser

Generate unlimited AI content offline. Train custom models and earn crypto by serving them on our decentralized P2P network powered by Chainlink.…

15 views · Fri, 22 May 2026 22:47:03 GMT

ARXIV CS.AI

Llamas on the Web: Memory-Efficient, Performance-Portable, and Multi-Precision LLM Inference with WebGPU

Running language models in the browser presents a unique opportunity to build efficient, private, and portable AI applications, but requires contending with constrained memory avai…

21 views · Fri, 22 May 2026 04:02:00 GMT

#artificial intelligence #machine learning #distributed computing

GITHUB

SuperInfer: SLO-Aware Rotary Scheduling and Memory Management for LLM Inference

SuperInfer: SLO-Aware Rotary Scheduling and Memory Management for LLM Inference on Superchips.…

19 views · Tue, 19 May 2026 01:29:57 GMT

#technology #artificial intelligence #machine learning

AI Foundry – Flat-Fee Unlimited LLM Inference on Blackwell GPUs in NZ

NZ's Sovereign AI Inference Platform…

14 views · Mon, 18 May 2026 07:24:56 GMT

DEV.TO (TOP)

Agentic LLM Inference Parameters Reference for Qwen and Gemma

This page is a practical reference for agentic LLM inference tuning (temperature, top_p, top_k,...…

20 views · Sun, 17 May 2026 02:40:19 GMT

#llm #tuning #qwen

GOOGLE

ClickBook – Offline Android eReader with local LLM inference via llama.rn

Tap any word to instantly understand it. Offline AI ereader for EPUB and PDFs.…

15 views · Sat, 16 May 2026 20:15:19 GMT

ARXIV.ORG

Ada-MK: Adaptive MegaKernel Optimization via DAG-Based Search for LLM Inference

When large language models (LLMs) serve real-time inference in commercial online advertising systems, end-to-end latency must be strictly bounded to the millisecond range. Yet ever…

21 views · Sat, 16 May 2026 17:00:18 GMT

#machine learning #gpu optimization

R/LOCALLLAMA