55 stories tagged with #multimodal, in publish-time order across the WeSearch catalog. Tag pages update as new stories ingest.
⌘ RSS feed for this tag → or search "Multimodal"
Introducing Gemma 4 12B: a unified, encoder-free multimodal model
Alibaba releases Qwen3.7-Plus, a multimodal proprietary model with a 1M-token context window, costing $2 per 1M tokens, 60% less than text-only Qwen3.7-Max (Carl Franzen/VentureBeat)
CORE: Conflict-Oriented Reasoning for General Multimodal Manipulation Detection
The rapid rise of generative AI has made multimodal fake news increasingly realistic and pervasive, posing severe threats to public trust and social stability. Existing detection m…
CP-Agent: Context-Aware Multimodal Reasoning for Cellular Morphological Profiling under Chemical Perturbations
Cell Painting combines multiplexed fluorescent staining, high-content imaging, and quantitative analysis to generate high-dimensional phenotypic readouts to support diverse downstr…
Tempus AI presents multimodal foundation model results at ASCO
India needs 216 multimodal logistics parks by 2047 for smooth freight movement: CII
A Union government official had told HT that the government had now downscaled its approach to build only nine of them, primarily due to land acquisition constraints | India News…
Step 3.7 Flash – Open-source multimodal model for speed and agents
Personalizing Embodied Multimodal Large Language Model Agents over Long-term User Interactions
Multimodal large language model (MLLM)-based embodied agents have shown strong potential for solving complex tasks in physical environments. However, personalized assistance requir…
Advancing Creative Physical Intelligence in Large Multimodal Models
Large multimodal models (LMMs) have rapidly advanced in perception and reasoning; however, it remains unclear whether these capabilities generalize to discovering visually grounded…
PolyFusionAgent: A Multimodal Foundation Model and Autonomous AI Assistant for Polymer Property Prediction and Inverse Design
Polymer discovery is central to fields ranging from energy storage to biomedicine, but it is hindered by an astronomically large chemical design space and fragmented representation…
LiveK12Bench: Have Large Multimodal Models Truly Conquered High School-level Examinations?
Advanced Large Multimodal Models (LMMs) have demonstrated impressive performance in K-12 reasoning tasks, exhibiting great promise as intelligent tutors. Realizing this potential r…
Quick Tip: Benchmarking Multimodal APIs in Under 10 Minutes
Look, I’m a backend engineer. I don’t have time to read through 40 pages of model cards before...…
ConceptM$^3$oE: Concept-Guided Multimodal Mixture of Experts for Interpretable Computational Pathology
Healthcare models are transitioning from unimodal prediction toward multimodal reasoning over heterogeneous diagnostic inputs. In computational pathology, for complex tumor subtype…
PANDO: Efficient Multimodal AI Agents via Online Skill Distillation
Recent advances in multimodal web agents often rely on increased inference-time computation, including rollout search, verifier passes, offline skill discovery, and specialist mode…
Show HN: Gemini Omni – A curated list of native multimodal guides and showcases
A curated list of awesome Google Gemini Omni prompt guides, interactive platforms, and creative showcases. - cnemri/awesome-gemini-omni…
Beyond Binary Edits Robust Multimodal Knowledge Editing with Adversarial Subspace Alignment
Multimodal large language models (MLLMs) need efficient mechanisms to update knowledge without degrading existing capabilities. While intrinsic multimodal knowledge editing achieve…
LFRAG: Layout-oriented Fine-grained Retrieval-Augmented Generation on Multimodal Document Understanding
Multimodal Retrieval-Augmented Generation (RAG) has emerged as an effective paradigm for enhancing Large Language Models (LLMs) with external knowledge. However, existing multimoda…
RAG4Outcome: A Retrieval-Augmented Multimodal Framework for Prognostic Prediction in Chronic Osteomyelitis
Chronic osteomyelitis presents substantial prognostic challenges due to its high recurrence risk and complex postoperative recovery trajectories. Traditional assessment often relie…
Gemma 4: The 128K Multimodal Powerhouse in Your Terminal
A raw, developer-first look at Google’s new open-weight Gemma 4 family—featuring a hands-on local...…
Quick Tip: Benchmarking Multimodal APIs in Under 10 Minutes
Look, I’m a backend engineer. I don’t have time to read through 40 pages of model cards before...…
Google unveils Gemini Omni, a multimodal AI model that generates video from text, images, and audio
Google DeepMind unveiled Gemini Omni at Google I/O, a multimodal AI model family for video generation with implications for decentralized compute and Web3 media.…
When AI Reads Blueprints: The Hidden Attack Surface of Multimodal Engineering Intelligence
description: "A security analysis of steganographic prompt injection and data poisoning...…
Gemma 4 is Here: The Dawn of Local Multimodal Reasoning
This is a submission for the Gemma 4 Challenge: Write About Gemma 4 Gemma 4 is Here: The...…
The Edge AI Revolution: Why Gemma 4 E4B is a Game-Changer for Offline Multimodality
This is a submission for the Gemma 4 Challenge: Write About Gemma 4 The Cloud is Great, But...…
Replicating a Language-Learning Comedy Short with Claude Code — Gemini as a Multimodal Sub-Agent
Building a local GPU + Gemini 3.1 Pro hybrid pipeline that generates publishable comedy Shorts from a single line of text in under 60 seconds.…
Evaluating multimodal emotion recognition in proactive conversational agents: A user study
This article presents a multimodal emotion recognition module integrated into a proactive Socially Interactive Agent (SIA) powered by generative artificial intelligence. The system…
Chronicle: A Multimodal Foundation Model for Joint Language and Time Series Understanding
Real-world time series come with text: metadata, descriptions, news, reports. Yet time series foundation models process numerical sequences in isolation, and the multimodal text-an…
JUDO: A Juxtaposed Domain-Oriented Multimodal Reasoner for Industrial Anomaly QA
Industrial anomaly detection has been significantly advanced by Large Multimodal Models (LMMs), enabling diverse human instructions beyond detection, particularly through visually …
Latent Space Guided Scenario Sampling for Multimodal Segmentation Under Missing Modalities
Multimodal semantic segmentation benefits remote sensing analysis by combining complementary information from different sensor modalities. In real-world remote sensing applications…
SAVER: Selective As-Needed Vision Evidence for Multimodal Information Extraction
Multimodal IE in social media is difficult because a post may attach multiple images that are weakly related, redundant, or even misleading with respect to the text. In this settin…
Hallucination as Exploit: Evidence-Carrying Multimodal Agents
Multimodal agents use screenshots, documents, and webpages to choose tool calls. When a false visual claim triggers a click, email, extraction, or transfer, hallucination becomes a…
GroupAffect-4: A Multimodal Dataset of Four-Person Collaborative Interaction
Existing affective-computing, social-signal-processing, and meeting corpora capture important parts of human interaction, but they rarely support analysis of affect in co-located g…
Robust Checkpoint Selection for Multimodal LLMs via Agentic Evaluation and Stability-Aware Ranking
Checkpoint selection for multimodal large language models (MLLMs) presents significant challenges when performance differentials are marginal and evaluation signals are prone to no…
Google unveils Gemini Omni, its first native multimodal AI model built for enterprises
Google unveiled Gemini Omni at I/O, its first native multimodal AI model for enterprises that processes video, audio, images, and text from a single architecture.…
Deploying a Multistage Multimodal Recommender System on Amazon Elastic Kubernetes Service
A practical walkthrough of building and deploying a multistage, multimodal recommender system on Amazon EKS, covering data pipelines, model training, Bloom filters, feature caching…
Gemini Omni Flash can create and edit videos with your voice and it feels like the future of multimodal AI
Gemini Omni Flash sounds like it’ll be an essential new AI content creation tool…
Google launches the Gemini Omni multimodal model, saying it can "create anything from any input", starting with video generation, for Google AI subscribers (Carl Franzen/VentureBeat)
Google Introduces Gemini Omni, a Multimodal AI That Knows the World
Starting with video, Omni will eventually be able to create any output from any input.…
TTE-Flash: Accelerating Reasoning-based Multimodal Representations via Think-Then-Embed Tokens
Recent research has demonstrated that Universal Multimodal Embedding (UME) benefits significantly from Chain-of-Thought (CoT) reasoning. In this paradigm, a generative model produc…
Learning to Learn from Multimodal Experience
Experience-driven learning has emerged as a promising paradigm for enabling agents to improve from interaction trajectories by accumulating and reusing past experience. However, ex…
F2IND-IT! -- Multimodal Fuzzy Fake Indian News Detection using Images and Text
Biased manipulation of facts across regional and national media outlets complicates misinformation detection in diverse landscapes like India. This paper introduces a novel multimo…
CatalyticMLLM: A Graph-Text Multimodal Large Language Model for Catalytic Materials
Property prediction and inverse structural design of catalytic materials are typically modeled as two independent tasks: the former predicts target properties from given structures…
Multimodal Cultural Heritage Knowledge Graph Extension with Language and Vision Models
The preservation and interpretation of cultural heritage increasingly rely on digital technologies, among which Knowledge Graphs (KGs) stand out for their ability to structure vast…
EGI: A Multimodal Emotional AI Framework for Enhancing Scrum Master Real-time Self-Awareness
While increasing research focuses on the emotional well-being of agile team members, a significant gap remains in emotion monitoring studies for Scrum Masters and meeting organizer…
SVFSearch: A Multimodal Knowledge-Intensive Benchmark for Short-Video Frame Search in the Gaming Vertical Domain
Multimodal large language models are increasingly used as agent backbones that understand multimodal inputs, plan retrieval actions, invoke external tools, and reason over retrieve…
Safety Geometry Collapse in Multimodal LLMs and Adaptive Drift Correction
Multimodal large language models (MLLMs) often fail to transfer safety capabilities learned in the text modality to semantically equivalent non-text inputs, revealing a persistent …
MemEye: A Visual-Centric Evaluation Framework for Multimodal Agent Memory
Join the discussion on this paper page…
Agent4POI: Agentic Context-Conditioned Affordance Reasoning for Multimodal Point-of-Interest Recommendation
We introduce Agent4POI, the first POI recommendation framework that generates context-conditioned multimodal representations at recommendation time, rather than relying on static P…
DeltaPrompts: Escaping the Zero-Delta Trap in Multimodal Distillation
Distillation enables compact Vision-Language Models (VLMs) to obtain strong reasoning capabilities, yet the prompts driving this process are typically chosen via simple heuristics …
ASRU: Activation Steering Meets Reinforcement Unlearning for Multimodal Large Language Models
Multimodal large language models (MLLMs) may memorize sensitive cross-modal information during pretraining, making machine unlearning (MU) crucial. Existing methods typically evalu…
Fibra monomodale (SMF) VS multimodale (MMF) Consigli.
Gemma.Witness - Offline Multimodal Evidence Capture with Gemma 4
An offline-first multimodal evidence capture system built on Gemma 4, designed for environments where cloud access and chain-of-custody assumptions fail.…
Gemma 4: From Raspberry Pi to Research Workstation — One Architecture, No Quality Compromise
This is a submission for the Gemma 4 Challenge: Write About Gemma 4 TL;DR — Gemma 4 is four...…
Nvidia launches Nemotron 3 Nano Omni, an open multimodal model with a 30B-A3B hybrid MoE architecture; the Nemotron 3 family saw 50M+ downloads in the past year (Kyt Dotson/SiliconANGLE)
Kyt Dotson / SiliconANGLE : Nvidia launches Nemotron 3 Nano Omni, an open multimodal model with a 30B-A3B hybrid MoE architecture; the Nemotron 3 family saw 50M+ downloads in the p…
Xiaomi open-sources MiMo-V2.5: 311B A15B 1M-context omnimodal model
We’re on a journey to advance and democratize artificial intelligence through open source and open science.…