Rethinking Cross-Layer Information Routing in Diffusion Transformers
The paper discusses improvements in Diffusion Transformers (DiTs) through a new method called Diffusion-Adaptive Routing (DAR). This method addresses issues with traditional residual addition in DiTs, enhancing information flow across layers. The authors demonstrate that DAR significantly improves performance while reducing training time.
- ▪Diffusion Transformers have become essential in visual generation, but their residual stream design has not been significantly altered.
- ▪The paper identifies three issues with traditional residual addition: monotonic forward magnitude inflation, sharp backward gradient decay, and block-wise redundancy.
- ▪The proposed DAR method allows for learnable, timestep-adaptive aggregation of sublayer outputs, improving training efficiency and model performance.
Opening excerpt (first ~120 words) tap to expand
Computer Science > Computer Vision and Pattern Recognition arXiv:2605.20708 (cs) [Submitted on 20 May 2026] Title:Rethinking Cross-Layer Information Routing in Diffusion Transformers Authors:Chao Xu, Maohua Li, Qirui Li, Yixuan Xu, Yanke Zhou, Yunhe Li, Cuifeng Shen, Hanlin Tang, Kan Liu, Tao Lan, Lin Qu, Shao-Qun Zhang View a PDF of the paper titled Rethinking Cross-Layer Information Routing in Diffusion Transformers, by Chao Xu and 11 other authors View PDF HTML (experimental) Abstract:Diffusion Transformers (DiTs) have become a de facto backbone of modern visual generation, and nearly every major axis of their design -- tokenization, attention, conditioning, objectives, and latent autoencoders -- has been extensively revisited.
…
Excerpt limited to ~120 words for fair-use compliance. The full article is at arXiv cs.AI.