I Built a Multilingual Spam Detection Dataset with 149K+ Messages Across 23 Languages
A new multilingual spam detection dataset called SpamShield Datasets has been created, containing over 149,000 messages across 23 languages. The dataset aims to address the shortcomings of existing spam datasets, which often focus solely on English and lack real-world spam patterns. It includes features for both binary spam detection and category-level classification, making it suitable for various NLP applications.
- ▪SpamShield Datasets includes 149,359 messages across 23 languages.
- ▪The dataset supports both binary spam detection and category-level classification.
- ▪About 20% of the dataset is synthetically augmented to enhance robustness against real-world spam.
Opening excerpt (first ~120 words) tap to expand
try { if(localStorage) { let currentUser = localStorage.getItem('current_user'); if (currentUser) { currentUser = JSON.parse(currentUser); if (currentUser.id === 3903001) { document.getElementById('article-show-container').classList.add('current-user-is-article-author'); } } } } catch (e) { console.error(e); } Arjun M Posted on May 25 I Built a Multilingual Spam Detection Dataset with 149K+ Messages Across 23 Languages Spam detection datasets are surprisingly bad once you move outside English. Most public datasets are: tiny, outdated, English-only, SMS-only, or missing real-world spam patterns. Meanwhile, actual spam today is multilingual, code-mixed, obfuscated, and platform-adaptive.
…
Excerpt limited to ~120 words for fair-use compliance. The full article is at DEV.to (Top).