Show HN: GPT-2 inference in pure C#, 0 bytes allocated per token

May 17, 2026 · 7:17 PM UTC ·13 min read · 0 reactions · 0 comments · 13 views

#deep-learning #gpt-2 #csharp #inference #optimization

Show HN: GPT-2 inference in pure C#, 0 bytes allocated per token

⚡ TL;DR · AI summary

A new deep-learning engine in pure C# enables zero-allocation inference for GPT-2 models. It boasts predictable CPU performance and does not rely on native binaries or Python runtimes. The engine allows for efficient model training and inference, achieving competitive results with existing frameworks.

Key facts

▪The engine supports loading and building models while ensuring zero-allocation CPU inference.
▪It can load GPT-2 Small weights from HuggingFace and achieves 0 bytes allocated per token during inference.
▪The framework allows for ONNX import, enabling direct loading of PyTorch-exported models.

Original article

GitHub

Read full at GitHub →

Opening excerpt (first ~120 words) tap to expand

Overfit Pure C# deep-learning and optimization engine. Predictable CPU performance, explicit memory ownership, zero-allocation inference hot paths. No native binaries. No Python runtime. No ONNX Runtime dependency. What it does Train in PyTorch or .NET. Load or build a model. Run predictable, allocation-free inference in .NET. Zero-allocation CPU inference — preallocated buffers, no per-call GC pressure, competitive with ONNX Runtime. GPT-2 inference — load GPT-2 Small (124M params) weights from HuggingFace. KV-cache decode: 0 bytes allocated per token, O(N) scaling. Top-10 logit overlap 10/10 vs PyTorch, maxAbsDiff=0.000107. ONNX import — load PyTorch-exported models directly. 14 operators, branching DAGs (ResNet skip connections), output matches PyTorch within 1e-4.

…

Excerpt limited to ~120 words for fair-use compliance. The full article is at GitHub.

Anonymous · no account needed

Discussion

0 comments

Show HN: GPT-2 inference in pure C#, 0 bytes allocated per token

Discussion

More from GitHub