SimInsert: Seamless Video Object Insertion via Regional Sparse Attention Fusion
The paper presents SimInsert, a novel approach to video object insertion that enhances spatio-temporal coherence and realism without requiring extensive retraining. It utilizes a training-free method that separates the task into single-frame editing and semantic motion description. SimInsert demonstrates superior performance compared to existing methods, achieving significant improvements in key quality metrics.
- ▪SimInsert efficiently decouples video object insertion into intuitive single-frame editing and semantic motion description.
- ▪The approach leverages image-to-video diffusion models to ensure background invariance and plausible interactions.
- ▪SimInsert outperforms state-of-the-art methods, achieving an 18.8% gain in PSNR and a 44.1% decrease in LPIPS.
Opening excerpt (first ~120 words) tap to expand
Computer Science > Computer Vision and Pattern Recognition arXiv:2605.23245 (cs) [Submitted on 22 May 2026] Title:SimInsert: Seamless Video Object Insertion via Regional Sparse Attention Fusion Authors:Xinyu Chen, Yuyi Qian, Jiang Lin, Shenyi Wang, Gao Wang, Zhiqiu Zhang, Jizhi Zhang, Mingjie Wang, Qiang Tang, Qian Wang, Song Wu, Zili Yi View a PDF of the paper titled SimInsert: Seamless Video Object Insertion via Regional Sparse Attention Fusion, by Xinyu Chen and 11 other authors View PDF HTML (experimental) Abstract:Video object insertion requires ensuring spatio-temporal coherence and interactive realism, extending far beyond simple content placement.
…
Excerpt limited to ~120 words for fair-use compliance. The full article is at arXiv cs.AI.