42 stories tagged with #inference, in publish-time order across the WeSearch catalog. Tag pages update as new stories ingest.
⌘ RSS feed for this tag → or search "Inference"
Inference is giving AI chip startups a second chance to make their mark
In a disaggregated AI world, Nvidia can be both a friend and an enemy AI adoption is reaching an inflection point as the focus shifts from training new models to serving them. For …
Built a local LLM inference engine on CachyOS — runs faster than llama.cpp on my 9070 XT
Hey folks, we've been hacking on a Vulkan-based LLM engine the last few weeks, figured I'd share since I'm running it exclusively on CachyOS with Mesa RADV. It's called VulkanForge…
VulkanForge – 14 MB Vulkan LLM engine that runs native FP8 models on AMD (Rust)
interfernece in rust and vulkan. Contribute to maeddesg/vulkanforge development by creating an account on GitHub.…
Anthropic in early talks to buy DRAM-less AI inference chips from UK startup — Fractile's SRAM architecture reduces need for pricey memory during extreme pricing and shortage crunch
Anthropic has reportedly held early discussions with London-based chip startup Fractile about purchasing the company's inference accelerators.…
[Paper on Hummingbird+: low-cost FPGAs for LLM inference] Qwen3-30B-A3B Q4 at 18 t/s token-gen, 24GB, expected $150 mass production cost
Inference Scaling (Test-Time Compute): Why Reasoning Models Raise Your Compute Bill
Why reasoning models dramatically increase token usage, latency, and infrastructure costs in production systems The post Inference Scaling (Test-Time Compute): Why Reasoning Models…
2026.18: Long-term, Peripheral & Myopic Visions
The best Stratechery content from the week of April 27, 2026, including Amazon and AI, the future of AR devices, and Beijing's myopia.…
Sources: Anthropic is in early talks to buy AI inference chips from UK-based Fractile when they become available in 2027 (The Information)
The Information : Sources: Anthropic is in early talks to buy AI inference chips from UK-based Fractile when they become available in 2027 — As Anthropic's sales explode, straining…
Hybrid on-device inference on Android: llama.cpp + LiteRT + NPU/GPU routing
Hi everyone, I’m the maintainer of Box — a fork of Google’s AI Edge Gallery that I’ve been extending into a fully offline AI assistant for Android. Full disclosure: I built this pr…
Chapter 12: Inference - Generating New Text
What You'll Build A sampling loop that generates new names from the trained model. Depends On Chapter 11 (the trained model). How Generation Works After training, the parameters ar…
Welcome to Actual Computer
Actual Computer is building software for mesh inference across heterogeneous hardware, abstracting device communication, topology, OS compatibility, and provider API equivalency so…
Product Experimentation with Propensity Scores: Causal Inference for LLM-Based Features in Python
Every product experimentation team running causal inference on LLM-based features eventually hits the same wall: when users click "Try our AI assistant," the volunteers aren't a ra…
Cloud computing provider Nebius agrees to buy Eigen AI, which optimizes the performance of chips running AI inference tasks, for $615M in stock and cash (Dina Bass/Bloomberg)
Dina Bass / Bloomberg : Cloud computing provider Nebius agrees to buy Eigen AI, which optimizes the performance of chips running AI inference tasks, for $615M in stock and cash — C…
Nebius Agrees to Acquire Eigen AI, Strengthening Nebius Token Factory as a Frontier Inference Platform - Morningstar
Comprehensive up-to-date news coverage, aggregated from sources all over the world by Google News.…
Openpi-flash: Real-time inference engine for openpi
Real-time inference engine for openpi. Contribute to Hebbian-Robotics/openpi-flash development by creating an account on GitHub.…
More Tokens Isn't More Intelligence
Cost vs. benefit imbalance: Biology vs. AI scaling…
Video Demo: How Does Model Compression Change AI Reasoning?
In this video, I benchmark Mistral-7B-Instruct-v0.2 on an NVIDIA H200 DigitalOcean GPU in three...…
Serverless inference platform Featherless.ai raised a $20M Series A co-led by AMD Ventures and Airbus Ventures; the startup supports over 30,000 open models (Cate Lawrence/Tech.eu)
By Cate Lawrence / Tech.eu. View the full context on Techmeme.…
Amazon Earnings, Trainium and Commodity Markets, Additional Amazon Notes
Amazon’s earnings suggest that the shift away from training towards inference and agents means their bet on Trainium is paying off. Plus, additional notes on ads, agents, and sport…
Ask HN: What are you doing during inference?
Comparison: vLLM 0.6 vs. Text Generation Inference 1.4 for Serving Code LLMs
Serving code LLMs at production scale is 3.2x more expensive than general-purpose LLMs when using...…
Applied AI-Enhanced RF Interference Rejection
AI-enhanced interference rejection in radio frequency (RF) transmissions has recently attracted interest because deep learning approaches trained on both the signal of interest (SO…
Complete Cyclic Subtask Graphs for Tool-Using LLM Agents: Flexibility, Cost, and Bottlenecks in Multi-Agent Workflows
Long-horizon tool-using tasks sometimes benefit from revisiting earlier subtasks for recovery and exploration, but added multi-agent workflow flexibility can also introduce coordin…
Benchmarking Inference Engines on Agentic Workloads
vLLM-Compile: Bringing Compiler Optimizations to LLM Inference
vLLM-compile: Bringing Compiler Optimizations to LLM Inference Luka Govedič vLLM Committer Senior Machine Learning Engineer, Red Hat 1…
DigitalOcean launches AI-Native Cloud platform for inference workloads
Private Decentralized Inference on Consumer Hardware [pdf]
Decentralized Private Inference. Contribute to Layr-Labs/d-inference development by creating an account on GitHub.…
PAVO-Bench – 50K voice turns and an 85K-param router for ASR→LLM→TTS
A 50K-turn voice pipeline benchmark and an 85K-param meta-controller that cuts P95 latency 10.3% and energy 71% vs fixed cloud. TMLR 2026. - vnmoorthy/pavo-bench…
Ubuntu's AI roadmap revealed, universal AI 'kill switch' and forced AI integration are not part of the plan — cloud tracking, local inference, and agentic system tools take center stage
AI is coming to Ubuntu…
AI is coming to Linux, but not in the obnoxious way that will grind your gear
Ubuntu is bringing AI into the OS carefully, focusing on optional features, local processing, and tools that enhance workflows without disrupting the traditional Linux experience.…
DigitalOcean launches AI inference engine with routing capabilities
How to Use Transformers.js in a Chrome Extension
We’re on a journey to advance and democratize artificial intelligence through open source and open science.…
AMD: Inference And Agentic AI Are Expanding Its Runway
Advanced Micro Devices is Buy-rated on expanding AI demand, strong EPYC/data center momentum, and discounted valuation. Learn more about AMD stock here.…
Active Inference: A method for Phenotyping Agency in AI systems?
The proliferation of agentic artificial intelligence has outpaced the conceptual tools needed to characterize agency in computational systems. Prevailing definitions mainly rely on…
Causal Discovery as Dialectical Aggregation: A Quantitative Argumentation Framework
Constraint-based causal discovery is brittle in finite-sample regimes because erroneous conditional-independence (CI) decisions can cascade into substantial structural errors. We p…
We benchmarked gpt-oss-120b across 6 inference providers and found a 10x throughput spread
We ran a benchmark across 10+ LLM routers, providers, and inference backends to answer the questions that come up every time someone picks a provider. Key findings: Do LLM routers …
Skymizer Taiwan Inc. Unveils Breakthrough Architecture Enabling Ultra-Large LLM Inference on a Single Card
Source Article excerpt: With a single PCIe card — powered by six HTX301 chips and 384 GB of memory — enterprises can now run 700B-parameter model inference locally at just ~240W pe…
Ubuntu 26.04 vs 24.04 speed improvements for inference?
I'm curious if any brave soul has upgraded their computer (especially if it's Strix Halo) from Ubuntu 24.04 -> 26.04 and seen a significant performance improvement for inference wi…
AMD Hipfire - a new inference engine optimized for AMD GPU's
Came across hipfire the other day. It's a brand new inference engine focused on all AMD GPU's (not just the latest). Github. It uses a special mq4 quantization method. The hipfire …
llama.cpp DeepSeek v4 Flash experimental inference
Hi, here you can find experimental llama.cpp support for DeepSeek v4, and here there is the GGUF you can use to run the inference with "just" (lol) 128GB of RAM. The model, even qu…
DeepSeek-V4 on Day 0: From Fast Inference to Verified RL with SGLang and Miles
We are thrilled to announce Day-0 support for DeepSeek-V4 across both inference and RL training. SGLang and Miles form the first open-source stack to serve and train DeepSeek-V4 on…
FP4 inference in llama.cpp (NVFP4) and ik_llama.cpp (MXFP4) landed - Finally
Both llama.cpp and ik_llama.cpp now have FP4 support — but with different flavors worth knowing about. llama.cpp recently merged NVFP4 (Nvidia's block-scaled FP4, `GGML_TYPE_NVFP4 …