Semantic IDs for finding vulnerable code at scale

May 18, 2026 · 4:41 AM UTC ·11 min read · 0 reactions · 0 comments · 22 views

#security #vulnerabilities #machine learning

Semantic IDs for finding vulnerable code at scale

TL;DR · WeSearch summary

A new tool called SecSid has been developed to identify vulnerable code more effectively than existing methods. It utilizes a Residual-Quantized Variational Autoencoder (RQ-VAE) to generate Semantic IDs for C/C++ functions, significantly increasing the detection of cross-project clones. In tests, SecSid found 112 clones compared to just 1 found by the previous method, VUDDY.

Key facts

▪SecSid uses a 3-level RQ-VAE on top of a code embedder to produce Semantic IDs for vulnerable functions.
▪The tool was trained on a dataset of 5000 C/C++ functions with known vulnerabilities.
▪SecSid's approach allows for efficient lookup of functions sharing the same vulnerability characteristics.

Original article

Shrikar Archak

Read full at Shrikar Archak →

Opening excerpt (first ~120 words) tap to expand

Essay May 16, 2026 11 min read Semantic IDs for vulnerable code: finding 100× more cross-project clones than VUDDY Learned RQ-VAE Semantic IDs for C/C++ vulnerability clones. Borrowing the TIGER substrate from recsys: on a 5000-function CVE registry SecSid finds 112 cross-project clones; VUDDY finds 1. AI Native Development Security Evals Semantic IDs are the interesting recsys idea I wanted to try out for security. In 2023 a paper called TIGER (Rajput et al.) rewired recommendation systems away from “every item gets a learned high-dim embedding” and toward “every item gets a short tuple of discrete codes.” Train an encoder over your items, train a Residual-Quantized VAE on top, and the output is a [c1, c2, c3] per item, where c1 captures broad signal and later levels refine.

…

Excerpt limited to ~120 words for fair-use compliance. The full article is at Shrikar Archak.

Anonymous · no account needed

Discussion

0 comments

Semantic IDs for finding vulnerable code at scale

Discussion

More from Shrikar Archak