Search: "cpp" — WeSearch Press

Intel B70: LLama.ccp SYCL vs LLama.cpp OpenVino vs LLM-Scaler

In case anyone is interested, I decided to test out LLama.cpp's new OpenVino backend to see how it compares on Intel GPUs. At first glance, it stomps all over the previous best-case, SYCL, but lags be…

Mon, 27 Apr 2026 08:05:35 GMT · 6 views

Benchmark: Windows 11 vs Lubuntu 26.04 on Llama.cpp (RTX 5080 + i9-14900KF). I didn't expect the gap to be this big.

UPDATE: Vulkan benches arew now included. And yes, I used AI to help me write this post. As a life-long Windows user (don't hate me, I was exposed to it at a young age) I was wondering how much (if an…

Sun, 26 Apr 2026 22:44:09 GMT · 6 views

llama.cpp DeepSeek v4 Flash experimental inference

Hi, here you can find experimental llama.cpp support for DeepSeek v4, and here there is the GGUF you can use to run the inference with "just" (lol) 128GB of RAM. The model, even quantized at 2 bit, lo…

Sun, 26 Apr 2026 22:44:09 GMT · 6 views

Will llama.cpp multislot improve speed?

I've heard mostly bad opinions about multiple slots with llama.cpp (--parallel > 1). I guess comparing to vLLM it might be worse at this, but I recently tried vLLM on 4 slots and it indeed improved th…

Sun, 26 Apr 2026 20:22:58 GMT · 6 views

Experts-Volunteers needed for Vulkan on ik_llama.cpp

ik_llama.cpp is great for both CPU & CUDA. Need legends to make Vulkan better as well. So, after bringing the Vulkan back-end up to speed some time ago, I felt that I simply don't have the bandwidth t…

Sun, 26 Apr 2026 19:59:59 GMT · 6 views

FP4 inference in llama.cpp (NVFP4) and ik_llama.cpp (MXFP4) landed - Finally

Both llama.cpp and ik_llama.cpp now have FP4 support — but with different flavors worth knowing about. llama.cpp recently merged NVFP4 (Nvidia's block-scaled FP4, `GGML_TYPE_NVFP4 = 40`), with CUDA ke…

Sun, 26 Apr 2026 06:16:01 GMT · 11 views

THE GLOBE AND MAIL

Highlights from the spring economic update, including CPP contribution cuts and new sports funding

Tue, 28 Apr 2026 20:06:24 GMT · 0 views

LOCALLLAMA

convert : add support for Nemotron Nano 3 Omni by danbev · Pull Request #22481 · ggml-org/llama.cpp

NVIDIA Nemotron 3 Nano Omni is a multimodal large language model that unifies video, audio, image, and text understanding to support enterprise-grade Q&A, summarization, transcription, and document in…

Tue, 28 Apr 2026 18:09:07 GMT · 3 views

THE GLOBE AND MAIL

CPPIB among investors looking to sell down stakes in India’s NSE IPO, sources say

Shareholders including the Canadian pension plan, LIC, SBI, Temasek Holdings and Morgan Stanley will offload a 5% stake, according to the sources…

Tue, 28 Apr 2026 11:05:49 GMT · 3 views

VRAM.cpp: Running llama-fit-params directly in your browser

Lots of people are always asking on this subreddit if their system can run a certain model. A lot of the "VRAM calculators" that I've found only provide either very rough estimates or are severely lim…

Mon, 27 Apr 2026 10:56:53 GMT · 7 views

LOCALLLAMA

mesa PR with 37-130% llama.cpp pp perf gain for vulkan on Linux on Intel Xe2

Mon, 27 Apr 2026 06:01:12 GMT · 6 views

CUDA: reduce MMQ stream-k overhead by JohannesGaessler · Pull Request #22298 · ggml-org/llama.cpp

CUDA prompt processing speedup on MoE check this…

Sun, 26 Apr 2026 06:16:02 GMT · 6 views

LOCALLLAMA

Qwen 3.6-35B-A3B KV cache bench: f16 vs q8_0 vs turbo3 vs turbo4 from 0 to 1M context on M5 Max

Took TheTom's TurboQuant Metal fork of llama.cpp (github.com/TheTom/llama-cpp-turboquant, the feature/turboquant-kv-cache branch) and ran a depth sweep on Qwen 3.6-35B-A3B Q8. TheTom had already publi…

Tue, 28 Apr 2026 17:59:56 GMT · 2 views

AMD Radeon RX 6900 XT - ROCm vs Vulkan - Gemma 4 and Qwen 3.5 speed benchmarks

Did some quick tests after building llama.cpp with ROCm 6.4.2 and latest Vulkan for my 6900 XT gemma4 E2B Q4_K ubatch ROCm pp512 Vulkan pp512 ROCm tg128 Vulkan tg128 32 1536.60 1423.49 151.92 174.59 6…

Tue, 28 Apr 2026 11:13:29 GMT · 5 views

LOCALLLAMA

Most efficient way of running Gemma 4 E4B with multimodal capabilities on a laptop?

The gemma 4 E4B and E2B models have built-in multimodal capabilities. However, as far as I am aware, llama.cpp does not have proper support for vision and audio inputs (specially audio) for these mode…

Tue, 28 Apr 2026 08:13:44 GMT · 5 views

How to run a local coding agent with Gemma 4 and Pi | Patrick Loeber

Tutorial from the Google guy, I use very similar setup (llama.cpp instead of lmstudio)…

Mon, 27 Apr 2026 16:54:14 GMT · 11 views

GBNF grammar tweak for faster Qwen3.6 35B-A3B and Qwen3.6 27B

Hi folks, Enjoy an optimised Qwen3.6 35B-A3B and Qwen3.6 27B for coding and general purpose - it's able to solve puzzles correctly more often too. The initial intent was to optimise the 35B-A3B reason…

Mon, 27 Apr 2026 16:54:14 GMT · 6 views

Brief Ngram-Mod Test Results - R9700/Qwen3.6 27B

Decided to try out the new --spec-type ngram-mod feature in llama.cpp using Qwen3.6 27B during an OpenCode bug chasing session. TLDR: Performance is variable, but so far it seems to provide a nice spe…

Mon, 27 Apr 2026 11:28:02 GMT · 8 views

LOCALLLAMA

What is the best coding agent (CLI) like Claude Code for Local Development

Hey all: I am trying to set up claude code to work with llama.cpp, I am using the Qwen3.6-35B-A3B. I usually use claude code + ZLM subscription i got lucky with $30 yearly - the set up is very simple …

Mon, 27 Apr 2026 06:01:12 GMT · 8 views

Using PaddleOCR-VL-1.5 with llama-server for book OCR

I've been running PaddleOCR-VL-1.5 via llama.cpp's server for OCR on book pages. It handles complex layouts, tables, and mixed text/figure pages surprisingly well. Setup: - Model: PaddleOCR-VL-1.5-GGU…

Sun, 26 Apr 2026 22:44:10 GMT · 9 views

LOCALLLAMA

Is there a way to mitigate performance as context grows?

In my local LLM setup I get from 30 to 80 t/s generation at the beginning, but it drops quite a lot as context grows. I use llama.cpp/Vulkan with an MI50 and a V100, is there some command line flags t…

Sun, 26 Apr 2026 20:57:14 GMT · 6 views

LOCALLLAMA

Results for "cpp".

Intel B70: LLama.ccp SYCL vs LLama.cpp OpenVino vs LLM-Scaler

Benchmark: Windows 11 vs Lubuntu 26.04 on Llama.cpp (RTX 5080 + i9-14900KF). I didn't expect the gap to be this big.

llama.cpp DeepSeek v4 Flash experimental inference

Will llama.cpp multislot improve speed?

Experts-Volunteers needed for Vulkan on ik_llama.cpp

FP4 inference in llama.cpp (NVFP4) and ik_llama.cpp (MXFP4) landed - Finally

Highlights from the spring economic update, including CPP contribution cuts and new sports funding

convert : add support for Nemotron Nano 3 Omni by danbev · Pull Request #22481 · ggml-org/llama.cpp

CPPIB among investors looking to sell down stakes in India’s NSE IPO, sources say

VRAM.cpp: Running llama-fit-params directly in your browser

mesa PR with 37-130% llama.cpp pp perf gain for vulkan on Linux on Intel Xe2

CUDA: reduce MMQ stream-k overhead by JohannesGaessler · Pull Request #22298 · ggml-org/llama.cpp

Qwen 3.6-35B-A3B KV cache bench: f16 vs q8_0 vs turbo3 vs turbo4 from 0 to 1M context on M5 Max

AMD Radeon RX 6900 XT - ROCm vs Vulkan - Gemma 4 and Qwen 3.5 speed benchmarks

Most efficient way of running Gemma 4 E4B with multimodal capabilities on a laptop?

How to run a local coding agent with Gemma 4 and Pi | Patrick Loeber

GBNF grammar tweak for faster Qwen3.6 35B-A3B and Qwen3.6 27B

Brief Ngram-Mod Test Results - R9700/Qwen3.6 27B

What is the best coding agent (CLI) like Claude Code for Local Development

Using PaddleOCR-VL-1.5 with llama-server for book OCR

Is there a way to mitigate performance as context grows?

Quant Qwen3.6-27B on 16GB VRAM with 100k context length

Field report: coding with Qwen 3.6 35B-A3B on an M2 Macbook Pro with 32GB RAM

your daily driver stack, what's it look like? and why?

Or browse by topic