5 results for "transformer architecture"
Learning to Rotate: Temporal and Semantic Rotary Encoding for Sequential Modeling
Every Transformer architecture dedicates enormous capacity to learning rich representations in semantic embedding space -- yet the rotation manifold acted upon by Rotary Positional Embeddings (RoPE) h…
The Randomness Floor: Measuring Intrinsic Non-Randomness in Language Model Token Distributions
Language models cannot be random. This paper introduces Entropic Deviation (ED), the normalised KL divergence between a model's token distribution and the uniform distribution, and measures it systema…
MAE-Based Self-Supervised Pretraining for Data-Efficient Medical Image Segmentation Using nnFormer
Transformer architectures, including nnFormer,have demonstrated promising results in volumetric medical image segmentation by being able to capture long-range spatial interactions. Although they have …
Modeling Induced Pleasure through Cognitive Appraisal Prediction via Multimodal Fusion
Multimodal affective computing analyzes user-generated social media content to predict emotional states. However, a critical gap remains in understanding how visual content shapes cognitive interpreta…
Going from 3B/7B dense to Nemotron 3 Nano (hybrid Mamba-MoE) for multi-task reasoning — what changes in the fine-tuning playbook? [D]
Following up on something I posted a few days back about fine-tuning for multi-task reasoning. Read a lot since then, and I've moved past the dense 3B vs 7B question — landing on Nemotron 3 Nano (the …