WeSearch

I Built a Multilingual Spam Detection Dataset with 149K+ Messages Across 23 Languages

·4 min read · 0 reactions · 0 comments · 13 views
#spam#dataset#nlp#multilingual#technology
I Built a Multilingual Spam Detection Dataset with 149K+ Messages Across 23 Languages
⚡ TL;DR · AI summary

A new multilingual spam detection dataset called SpamShield Datasets has been created, containing over 149,000 messages across 23 languages. The dataset aims to address the shortcomings of existing spam datasets, which often focus solely on English and lack real-world spam patterns. It includes features for both binary spam detection and category-level classification, making it suitable for various NLP applications.

Key facts
Original article
DEV.to (Top)
Read full at DEV.to (Top) →
Opening excerpt (first ~120 words) tap to expand

try { if(localStorage) { let currentUser = localStorage.getItem('current_user'); if (currentUser) { currentUser = JSON.parse(currentUser); if (currentUser.id === 3903001) { document.getElementById('article-show-container').classList.add('current-user-is-article-author'); } } } } catch (e) { console.error(e); } Arjun M Posted on May 25 I Built a Multilingual Spam Detection Dataset with 149K+ Messages Across 23 Languages Spam detection datasets are surprisingly bad once you move outside English. Most public datasets are: tiny, outdated, English-only, SMS-only, or missing real-world spam patterns. Meanwhile, actual spam today is multilingual, code-mixed, obfuscated, and platform-adaptive.

Excerpt limited to ~120 words for fair-use compliance. The full article is at DEV.to (Top).

Anonymous · no account needed
Share 𝕏 Facebook Reddit LinkedIn Threads WhatsApp Bluesky Mastodon Email

Discussion

0 comments

More from DEV.to (Top)