29 stories tagged with #llama, in publish-time order across the WeSearch catalog. Tag pages update as new stories ingest.
⌘ RSS feed for this tag → or search "Llama"
Show HN-style: Blue Arrow – modular orchestration system with state-driven execution, local LLaMA integration and post-execution verification
Normalized Categories: One Filter for "Polos" Across Every Supplier
If you've ever tried to search "polos under $10 in navy" across more than one supplier, you already...…
Meta abandons open-source Llama for proprietary Muse Spark
Meta has shifted from Llama to its new proprietary AI model Muse Spark, leaving open-source developers searching for alternatives and migration paths.…
Mastering On-Device GenAI: How to Fine-Tune LLMs for Android Using LoRA and Kotlin 2.x
The dream of a truly personal AI—one that lives entirely on your smartphone, understands your medical...…
llama.cpp's Preliminary SM120 Native NVFP4 MMQ Is Merged
And somehow we already got some GGUFs for it! (the below one is from PR author himself)…
Pebble – Menu-bar text polisher running on local Ollama
Menu-bar text-polish tool that rewrites your clipboard with a local Ollama model. One global shortcut, seven presets, no cloud. - gashiartim/pebble…
Arc Gate —LLM proxy that hits P=1.00 R=1.00 F1=1.00 on indirect/roleplay prompt injection (beats OpenAI Moderation and LlamaGuard)
Benchmarked on 40 out-of-distribution prompts, indirect requests, roleplay framings, hypothetical scenarios, technical phrasings. The stuff that slips past everything else. Arc Gat…
convert : add support for Nemotron Nano 3 Omni by danbev · Pull Request #22481 · ggml-org/llama.cpp
NVIDIA Nemotron 3 Nano Omni is a multimodal large language model that unifies video, audio, image, and text understanding to support enterprise-grade Q&A, summarization, transcript…
Arc Gate — LLM proxy that catches 100% of indirect/roleplay prompt injection attacks (beats OpenAI Moderation and LlamaGuard)
Built an LLM proxy that sits in front of any OpenAI-compatible endpoint and blocks prompt injection before it reaches your model. Benchmarked against OpenAI Moderation API and Llam…
Step-by-Step Guide to Building RAG with LlamaIndex 0.10 and Vector 0.4 for Docs Search
80% of engineering teams building RAG pipelines for internal documentation search waste 3+ weeks...…
Show HN: DeadNet – Watch AI agents debate, play games, and write stories live
DeadNet is a live arena where AI agents debate, play games, and write stories while humans watch and vote. Watch matches or build your own agent.…
A Primer on LLM Post-Training
Duality of r/LocalLLaMA
Step-by-Step Guide to Setting Up Local AI Code Review with Continue.dev 0.9, Ollama 0.5, and ESLint 9
82% of engineering teams report that cloud-based AI code review tools leak sensitive IP, cost 4x more...…
Offline Agentic Coding
Offline Agentic Coding: Ollama and Claude code…
VRAM.cpp: Running llama-fit-params directly in your browser
Lots of people are always asking on this subreddit if their system can run a certain model. A lot of the "VRAM calculators" that I've found only provide either very rough estimates…
Intel B70: LLama.ccp SYCL vs LLama.cpp OpenVino vs LLM-Scaler
In case anyone is interested, I decided to test out LLama.cpp's new OpenVino backend to see how it compares on Intel GPUs. At first glance, it stomps all over the previous best-cas…
The cost math behind routing Claude Code through Ollama (~90% cut)
Pair Claude Desktop on Anthropic with Claude Code routed through Ollama. Visual walkthrough + copy-paste prompt that cuts your Claude Code bill ~90%. - Coherence-Daddy/use-ollama-t…
mesa PR with 37-130% llama.cpp pp perf gain for vulkan on Linux on Intel Xe2
Qwen3.6-27B: Flagship-Level Coding in a 27B Dense Model
Qwen3.6-27B: Flagship-Level Coding in a 27B Dense Model Big claims from Qwen about their latest open weight model: Qwen3.6-27B delivers flagship-level agentic coding performance, s…
r/LocalLLaMa Rule Updates
As the sub has grown (and as AI based tools have gotten better) with over 1M weekly visitors , we've seen a marked increase in slop, spam etc. This has been on the mod team's mind …
Using PaddleOCR-VL-1.5 with llama-server for book OCR
I've been running PaddleOCR-VL-1.5 via llama.cpp's server for OCR on book pages. It handles complex layouts, tables, and mixed text/figure pages surprisingly well. Setup: - Model: …
Benchmark: Windows 11 vs Lubuntu 26.04 on Llama.cpp (RTX 5080 + i9-14900KF). I didn't expect the gap to be this big.
UPDATE: Vulkan benches arew now included. And yes, I used AI to help me write this post. As a life-long Windows user (don't hate me, I was exposed to it at a young age) I was wonde…
llama.cpp DeepSeek v4 Flash experimental inference
Hi, here you can find experimental llama.cpp support for DeepSeek v4, and here there is the GGUF you can use to run the inference with "just" (lol) 128GB of RAM. The model, even qu…
Will llama.cpp multislot improve speed?
I've heard mostly bad opinions about multiple slots with llama.cpp (--parallel > 1). I guess comparing to vLLM it might be worse at this, but I recently tried vLLM on 4 slots and i…
Experts-Volunteers needed for Vulkan on ik_llama.cpp
ik_llama.cpp is great for both CPU & CUDA. Need legends to make Vulkan better as well. So, after bringing the Vulkan back-end up to speed some time ago, I felt that I simply don't …
This is where we are right now, LocalLLaMA
the future is now…
CUDA: reduce MMQ stream-k overhead by JohannesGaessler · Pull Request #22298 · ggml-org/llama.cpp
CUDA prompt processing speedup on MoE check this…
FP4 inference in llama.cpp (NVFP4) and ik_llama.cpp (MXFP4) landed - Finally
Both llama.cpp and ik_llama.cpp now have FP4 support — but with different flavors worth knowing about. llama.cpp recently merged NVFP4 (Nvidia's block-scaled FP4, `GGML_TYPE_NVFP4 …