WeSearch

Semantic IDs for finding vulnerable code at scale

·11 min read · 0 reactions · 0 comments · 11 views
#security#vulnerabilities#machine learning
Semantic IDs for finding vulnerable code at scale
⚡ TL;DR · AI summary

A new tool called SecSid has been developed to identify vulnerable code more effectively than existing methods. It utilizes a Residual-Quantized Variational Autoencoder (RQ-VAE) to generate Semantic IDs for C/C++ functions, significantly increasing the detection of cross-project clones. In tests, SecSid found 112 clones compared to just 1 found by the previous method, VUDDY.

Key facts
Original article
Shrikar Archak
Read full at Shrikar Archak →
Opening excerpt (first ~120 words) tap to expand

Essay May 16, 2026 11 min read Semantic IDs for vulnerable code: finding 100× more cross-project clones than VUDDY Learned RQ-VAE Semantic IDs for C/C++ vulnerability clones. Borrowing the TIGER substrate from recsys: on a 5000-function CVE registry SecSid finds 112 cross-project clones; VUDDY finds 1. AI Native Development Security Evals Semantic IDs are the interesting recsys idea I wanted to try out for security. In 2023 a paper called TIGER (Rajput et al.) rewired recommendation systems away from “every item gets a learned high-dim embedding” and toward “every item gets a short tuple of discrete codes.” Train an encoder over your items, train a Residual-Quantized VAE on top, and the output is a [c1, c2, c3] per item, where c1 captures broad signal and later levels refine.

Excerpt limited to ~120 words for fair-use compliance. The full article is at Shrikar Archak.

Anonymous · no account needed
Share 𝕏 Facebook Reddit LinkedIn Threads WhatsApp Bluesky Mastodon Email

Discussion

0 comments

More from Shrikar Archak