VISAFF: Speaker-Centered Visual Affective Feature Learning for Emotion Recognition in Conversation
The article discusses a new framework called VISAFF for Emotion Recognition in Conversation (ERC). This framework aims to improve the identification of emotional states in dialogues by focusing on the active speaker's visual cues. VISAFF is designed to be computationally efficient and avoids the need for extensive fine-tuning of large models.
- ▪VISAFF stands for Speaker-Centered VISual AFFective feature learning for Emotion Recognition in Conversation.
- ▪The framework consists of two stages: Speaker-Centered Affective Grounding and Reliability-Guided Affective Complementation.
- ▪VISAFF achieves competitive performance while eliminating the need for expensive fine-tuning of large Vision-Language Models.
Opening excerpt (first ~120 words) tap to expand
Computer Science > Artificial Intelligence arXiv:2605.18547 (cs) [Submitted on 18 May 2026] Title:VISAFF: Speaker-Centered Visual Affective Feature Learning for Emotion Recognition in Conversation Authors:Linan ZHU, Zihao Zhai, Xiao Han, Yuqian Fu, Xiangfan Chen, Xiangjie Kong, Guojiang Shen View a PDF of the paper titled VISAFF: Speaker-Centered Visual Affective Feature Learning for Emotion Recognition in Conversation, by Linan ZHU and 6 other authors View PDF HTML (experimental) Abstract:Emotion Recognition in Conversation (ERC) is essential for effective human-machine interaction, aiming to identify speakers' emotional states in multi-turn dialogues. Early text-based methods struggle with complex scenarios like sarcasm because they inherently neglect vital non-verbal information.
…
Excerpt limited to ~120 words for fair-use compliance. The full article is at arXiv cs.AI.