Softmax in front of CrossEntropyLoss: 16 other bugs PyTorch won't catch
PyTorch does not catch certain architectural bugs during model design, leading to issues that only appear during or after training. A design-time linter called Neurarch has been developed to detect 17 common structural failure modes in neural networks before training begins. These include incorrect layer ordering, missing components, and inefficient configurations that degrade performance or stability.
- ▪The nn.CrossEntropyLoss function in PyTorch applies log-softmax internally, so adding an explicit Softmax layer causes double application and harms training stability.
- ▪The linter checks for issues such as incorrect normalization order, missing residual connections, and absence of positional encoding in attention layers.
- ▪Rules also flag performance problems like excessive dropout rates and large activation tensors that increase memory usage unnecessarily.
- ▪Some bugs, like placing Dropout before BatchNorm, cancel intended regularization effects and are only detectable through static analysis of the model graph.
- ▪The tool operates on the model's architecture graph before any forward pass, aiming to prevent wasted computation and debugging time.
- ▪Transformer-specific rules catch errors such as incorrect GQA head divisibility and missing auxiliary losses in MoE layers.
Opening excerpt (first ~120 words) tap to expand
You can put a Softmax in front of CrossEntropyLoss. PyTorch won’t stop you. Here are 16 other architecture bugs it won’t catch.A walkthrough of the 17-rule design-time linter inside Neurarch: what each rule catches, why it matters, and where static analysis stops being useful for neural networks.Xin GaoMay 17, 2026ShareThe bug that started thisYou can put a Softmax in front of CrossEntropyLoss in PyTorch. The model trains. The loss curve looks fine. You ship it. Accuracy is bad, and you spend the next day finding out why.The bug is that nn.CrossEntropyLoss applies log-softmax internally, so the explicit Softmax causes double-application and degrades training stability. The bug is visible from the architecture diagram in two seconds.
…
Excerpt limited to ~120 words for fair-use compliance. The full article is at Hacker News (Newest).