Semantic IDs for finding vulnerable code at scale
A new tool called SecSid has been developed to identify vulnerable code more effectively than existing methods. It utilizes a Residual-Quantized Variational Autoencoder (RQ-VAE) to generate Semantic IDs for C/C++ functions, significantly increasing the detection of cross-project clones. In tests, SecSid found 112 clones compared to just 1 found by the previous method, VUDDY.
- ▪SecSid uses a 3-level RQ-VAE on top of a code embedder to produce Semantic IDs for vulnerable functions.
- ▪The tool was trained on a dataset of 5000 C/C++ functions with known vulnerabilities.
- ▪SecSid's approach allows for efficient lookup of functions sharing the same vulnerability characteristics.
Opening excerpt (first ~120 words) tap to expand
Essay May 16, 2026 11 min read Semantic IDs for vulnerable code: finding 100× more cross-project clones than VUDDY Learned RQ-VAE Semantic IDs for C/C++ vulnerability clones. Borrowing the TIGER substrate from recsys: on a 5000-function CVE registry SecSid finds 112 cross-project clones; VUDDY finds 1. AI Native Development Security Evals Semantic IDs are the interesting recsys idea I wanted to try out for security. In 2023 a paper called TIGER (Rajput et al.) rewired recommendation systems away from “every item gets a learned high-dim embedding” and toward “every item gets a short tuple of discrete codes.” Train an encoder over your items, train a Residual-Quantized VAE on top, and the output is a [c1, c2, c3] per item, where c1 captures broad signal and later levels refine.
…
Excerpt limited to ~120 words for fair-use compliance. The full article is at Shrikar Archak.