16 stories tagged with #quantization, in publish-time order across the WeSearch catalog. Tag pages update as new stories ingest.
⌘ RSS feed for this tag → or search "Quantization"
Why your quantized LLM loses its MTP heads and how to keep them
Quantizing a model with multi-token prediction heads? Here's why standard conversion pipelines drop them silently, and how to preserve and calibrate them.…
Tail-Aware HiFloat4: W4A4 Post-Training Quantization for Wan2.2
This report describes Tail-Aware HiFloat4, our submission to the low-bit text-to-video generation quantization challenge. Our method adapts the public ViDiT-Q post-training quantiz…
InfoQuant: Shaping Activation Distributions for Low-Bit LLM Quantization
Low-bit activation quantization remains a major bottleneck in efficient large language model (LLM) deployment. The difficulty is not only that activations contain outliers, but tha…
OSCAR RotationZoo - Offline Spectral Covariance-Aware Rotation for 2-bit KV Cache Quantization
We added W8A8 activation quantization to MLX — prefill went from 2.84s to 2.52s on M5 Pro
Preconditioning Vectors: Making Elasticsearch VectorDB BBQ Work for Every Vector
Learn when to use vector preconditioning to improve recall for Better Binary Quantization (BBQ) in Elasticsearch…
Quant.npu: Enabling Efficient Mobile NPU Inference for on-device LLMs via Fully Static Quantization
Large language models (LLMs) are increasingly deployed on mobile devices, where Neural Processing Units (NPUs) necessitate fully static quantization for optimal inference efficienc…
Decomposing MXFP4 quantization error for LLM reinforcement learning: reducible bias, recoverable deadzone, and an irreducible floor
MXFP4 arithmetic can dramatically accelerate reinforcement learning (RL) post-training of large language models (LLMs), yet the quantization error introduces severe accuracy degrad…
3.125-Bit LLM quantization bypassing tensor cores
By trading heavy FP16 MatMuls for SRAM lookups and 1-bit additions, our custom quantization pipeline squeezes state-of-the-art models down to approx. 3 bits per weight with minimal…
ggufy: easy quantization for the GPU poor
Cohere cracks lossless quantization and native citations with first full Apache 2.0 licensed open model Command A+
Theory-optimal Quantization Based on Flatness
Post-training quantization has emerged as a widely adopted technique for compressing and accelerating the inference of Large Language Models (LLMs). The primary challenges in LLMs …
Evaluation of Various MLX Quantizations
Utilities to evaluate MLX quantizations. Contribute to deepsweet/mlx-eval development by creating an account on GitHub.…
Quantization Undoes Alignment: Bias Emergence in Compressed LLMs Across Models and Precision Levels
Large Language Models are routinely compressed via post-training quantization to reduce inference costs and memory footprint for cloud and edge deployment, yet the impact of this c…
PrismQuant: Rate-Distortion-Optimal Vector Quantization for Gaussian-Mixture Sources
For a Gaussian source under mean-squared error (MSE), classical transform coding is rate--distortion (RD) optimal: the Karhunen--Loeve transform (KLT) diagonalizes the covariance, …
Scalar and Binary Quantization for Pgvector Vector Search and Storage (2024)
Quantization can reduce vector sizes, but how does it impact query performance and quality?…