WeSearch

Show HN: GPT-2 inference in pure C#, 0 bytes allocated per token

·13 min read · 0 reactions · 0 comments · 13 views
#deep-learning#gpt-2#csharp#inference#optimization
Show HN: GPT-2 inference in pure C#, 0 bytes allocated per token
⚡ TL;DR · AI summary

A new deep-learning engine in pure C# enables zero-allocation inference for GPT-2 models. It boasts predictable CPU performance and does not rely on native binaries or Python runtimes. The engine allows for efficient model training and inference, achieving competitive results with existing frameworks.

Key facts
Original article
GitHub
Read full at GitHub →
Opening excerpt (first ~120 words) tap to expand

Overfit Pure C# deep-learning and optimization engine. Predictable CPU performance, explicit memory ownership, zero-allocation inference hot paths. No native binaries. No Python runtime. No ONNX Runtime dependency. What it does Train in PyTorch or .NET. Load or build a model. Run predictable, allocation-free inference in .NET. Zero-allocation CPU inference — preallocated buffers, no per-call GC pressure, competitive with ONNX Runtime. GPT-2 inference — load GPT-2 Small (124M params) weights from HuggingFace. KV-cache decode: 0 bytes allocated per token, O(N) scaling. Top-10 logit overlap 10/10 vs PyTorch, maxAbsDiff=0.000107. ONNX import — load PyTorch-exported models directly. 14 operators, branching DAGs (ResNet skip connections), output matches PyTorch within 1e-4.

Excerpt limited to ~120 words for fair-use compliance. The full article is at GitHub.

Anonymous · no account needed
Share 𝕏 Facebook Reddit LinkedIn Threads WhatsApp Bluesky Mastodon Email

Discussion

0 comments

More from GitHub