Xiaomi open-sources MiMo-V2.5: 311B A15B 1M-context omnimodal model
Xiaomi has open-sourced MiMo-V2.5, a 311-billion-parameter omnimodal AI model with 15 billion activated parameters, supporting text, image, video, and audio understanding within a unified architecture. The model features a 1-million-token context window, hybrid attention mechanisms, and specialized vision and audio encoders. It is designed for strong performance in long-context reasoning, multimodal tasks, and agentic workflows. MiMo-V2.5 is available on HuggingFace and ModelScope under the MIT license.
- ▪MiMo-V2.5 is a sparse Mixture-of-Experts (MoE) model with 310 billion total parameters and 15 billion activated per token.
- ▪It supports a context length of up to 1 million tokens and integrates text, image, video, and audio understanding natively.
- ▪The model uses a hybrid attention architecture combining sliding window and global attention to optimize KV-cache efficiency.
- ▪It includes a 729M-parameter Vision Transformer and a 261M-parameter Audio Transformer for multimodal processing.
- ▪MiMo-V2.5 was trained on approximately 48 trillion tokens using FP8 precision and includes multi-token prediction for faster inference.
Full article excerpt tap to expand
XiaomiMiMo / MiMo-V2.5 like 103 Follow Xiaomi MiMo 2.27k Safetensors English Chinese mimo_v2 multimodal vision-language audio agent video-understanding long-context custom_code Eval Results fp8 License: mit Model card Files Files and versions xet Community 3 MiMo-V2.5 1. Introduction Model Summary 2. Downloads 3. Evaluation Results Multimodal BenchmarksCoding & Agent BenchmarksLong Context Benchmarks4. Model Architecture LLM BackboneVision EncoderAudio Encoder5. Training Process 6. Deployment SGLang DeploymentvLLM DeploymentCitation Contact | 🤗 HuggingFace | 📰 Blog | 🎨 Xiaomi MiMo API Platform | 🗨️ Xiaomi MiMo Studio | Community WeChat Group | Discord | Telegram | Reddit MiMo-V2.5 1. Introduction MiMo-V2.5 is a native omnimodal model with strong agentic capabilities, supporting text, image, video, and audio understanding within a unified architecture. Built upon the MiMo-V2-Flash backbone and extended with dedicated vision and audio encoders, it delivers robust performance across multimodal perception, long-context reasoning, and agentic workflows. Key features include: Hybrid Attention Architecture: Inherits the hybrid design from MiMo-V2-Flash, interleaving Sliding Window Attention (SWA) and Global Attention (GA) with a 5:1 ratio and 128 sliding window. This reduces KV-cache storage by nearly 6× while maintaining long-context performance via learnable attention sink bias. Native Omnimodal Encoders: Equipped with a 729M-param Vision Transformer (ViT) featuring hybrid window attention and a dedicated audio encoder initialized from the weights of MiMo-Audio, enabling high-quality image, video, and audio understanding. Multi-Token Prediction (MTP): Three lightweight MTP modules with dense FFNs accelerate inference via speculative decoding and improve RL training efficiency. Efficient Pre-Training: Trained on a total of ~48T tokens using FP8 mixed precision. The context window supports up to 1M tokens. Agentic Capabilities: Post-training incorporates SFT, large-scale agentic RL, and Multi-Teacher On-Policy Distillation (MOPD), achieving strong performance on agentic tasks and multimodal understanding benchmarks. Model Summary Architecture: Sparse MoE (Mixture of Experts), 310B total / 15B activated parameters Context Length: Up to 1M tokens Modalities: Text, Image, Video, Audio Vision Encoder: 729M-param ViT (28 layers: 24 SWA + 4 Full) Audio Encoder: 261M-param Audio Transformer (24 layers: 12 SWA + 12 Full) Multi-Token Prediction (MTP): 329M parameters, 3 layers 2. Downloads Model Context Length Download MiMo-V2.5-Base 256K 🤗 HuggingFace 🤖 ModelScope MiMo-V2.5 1M 🤗 HuggingFace 🤖 ModelScope 3. Evaluation Results Multimodal Benchmarks Coding & Agent Benchmarks Long Context Benchmarks 4. Model Architecture LLM Backbone MiMo-V2.5's core language backbone inherits from the MiMo-V2-Flash architecture, a sparse MoE model with hybrid sliding window attention. Component MiMo-V2.5-Pro MiMo-V2.5 Total Parameters 1.02T 310B Activated Parameters 42B 15B Hidden Size 6144 4096 Num Layers 70 (1 dense + 69 MoE) 48 (1 dense + 47 MoE) Full Attention Layers 10 9 SWA Layers 60 39 Num Attention Heads 128 64 Num KV Heads 8 (GQA) 8 (GA) / 4 (SWA) Head Dim (QK / V) 192 / 128 192 / 128 Routed Experts 384 256 Experts per Token 8 8 MoE Intermediate Size 2048 2048 Dense Intermediate Size 16384 (layer 0 only) 16384 (layer 0 only) SWA Window Size 128 128 Max Context Length 1M 1M MTP Layers 3 3 Vision Encoder We train a dedicated MiMo ViT that adopts…
This excerpt is published under fair use for community discussion. Read the full article at Huggingface.