Show HN: GPT-2 inference in pure C#, 0 bytes allocated per token
A new deep-learning engine in pure C# enables zero-allocation inference for GPT-2 models. It boasts predictable CPU performance and does not rely on native binaries or Python runtimes. The engine allows for efficient model training and inference, achieving competitive results with existing frameworks.
- ▪The engine supports loading and building models while ensuring zero-allocation CPU inference.
- ▪It can load GPT-2 Small weights from HuggingFace and achieves 0 bytes allocated per token during inference.
- ▪The framework allows for ONNX import, enabling direct loading of PyTorch-exported models.
Opening excerpt (first ~120 words) tap to expand
Overfit Pure C# deep-learning and optimization engine. Predictable CPU performance, explicit memory ownership, zero-allocation inference hot paths. No native binaries. No Python runtime. No ONNX Runtime dependency. What it does Train in PyTorch or .NET. Load or build a model. Run predictable, allocation-free inference in .NET. Zero-allocation CPU inference — preallocated buffers, no per-call GC pressure, competitive with ONNX Runtime. GPT-2 inference — load GPT-2 Small (124M params) weights from HuggingFace. KV-cache decode: 0 bytes allocated per token, O(N) scaling. Top-10 logit overlap 10/10 vs PyTorch, maxAbsDiff=0.000107. ONNX import — load PyTorch-exported models directly. 14 operators, branching DAGs (ResNet skip connections), output matches PyTorch within 1e-4.
…
Excerpt limited to ~120 words for fair-use compliance. The full article is at GitHub.