Google unveils Gemini Omni, a multimodal AI model that generates video from text, images, and audio
Google has introduced Gemini Omni, a new multimodal AI model capable of generating video from various inputs including text, images, and audio. This model represents a significant advancement in video generation technology, allowing for the creation of short clips with synchronized audio. Gemini Omni is set to replace the previous Veo model and aims to enhance user interaction through a conversational interface for editing.
- ▪Gemini Omni can generate short video clips of approximately 10 seconds in length.
- ▪The model emphasizes improvements in world understanding, physics simulation, and character consistency.
- ▪Google plans to expand the clip length capabilities over time, although no specific timeline has been provided.
Opening excerpt (first ~120 words) tap to expand
Google unveils Gemini Omni, a multimodal AI model that generates video from text, images, and audio The multimodal model turns text, images, audio, and existing footage into realistic video clips, with implications that ripple well beyond Mountain View. Share Add us on Google by Editorial Team May. 23, 2026 window.sevioads = window.sevioads || []; var sevioads_preferences = []; sevioads_preferences[0] = {}; sevioads_preferences[0].zone = "01f21ccf-2092-46b1-9ac7-8c44cc782e0f"; sevioads_preferences[0].adType = "native"; sevioads_preferences[0].inventoryId = "c5700508-581b-472c-8fdd-a931cdbfc8e1"; sevioads_preferences[0].accountId = "1e47efc1-ec2d-4fca-a8b9-354e249e5095"; sevioads.push(sevioads_preferences); Google DeepMind just dropped what might be the most capable video generation model…
Excerpt limited to ~120 words for fair-use compliance. The full article is at Crypto Briefing.