I Built a Multilingual Spam Detection Dataset with 149K+ Messages Across 23 Languages

May 25, 2026 · 8:13 PM UTC ·4 min read · 0 reactions · 0 comments · 34 views

#spam #dataset #nlp #multilingual #technology

I Built a Multilingual Spam Detection Dataset with 149K+ Messages Across 23 Languages

TL;DR · WeSearch summary

A new multilingual spam detection dataset called SpamShield Datasets has been created, containing over 149,000 messages across 23 languages. The dataset aims to address the shortcomings of existing spam datasets, which often focus solely on English and lack real-world spam patterns. It includes features for both binary spam detection and category-level classification, making it suitable for various NLP applications.

Key facts

▪SpamShield Datasets includes 149,359 messages across 23 languages.
▪The dataset supports both binary spam detection and category-level classification.
▪About 20% of the dataset is synthetically augmented to enhance robustness against real-world spam.

Original article

DEV.to (Top)

Read full at DEV.to (Top) →

Opening excerpt (first ~120 words) tap to expand

try { if(localStorage) { let currentUser = localStorage.getItem('current_user'); if (currentUser) { currentUser = JSON.parse(currentUser); if (currentUser.id === 3903001) { document.getElementById('article-show-container').classList.add('current-user-is-article-author'); } } } } catch (e) { console.error(e); } Arjun M Posted on May 25 I Built a Multilingual Spam Detection Dataset with 149K+ Messages Across 23 Languages Spam detection datasets are surprisingly bad once you move outside English. Most public datasets are: tiny, outdated, English-only, SMS-only, or missing real-world spam patterns. Meanwhile, actual spam today is multilingual, code-mixed, obfuscated, and platform-adaptive.

…

Excerpt limited to ~120 words for fair-use compliance. The full article is at DEV.to (Top).

Anonymous · no account needed

Discussion

0 comments

I Built a Multilingual Spam Detection Dataset with 149K+ Messages Across 23 Languages

Discussion

More from DEV.to (Top)