Slop Bucket Idea – a dataset of AI slop (train AI what not to do)

May 18, 2026 · 2:11 AM UTC ·1 min read · 0 reactions · 0 comments · 32 views

#artificial intelligence #data #research #Microsoft #arXiv

via

Ycombinator

TL;DR · WeSearch summary

The article discusses the prevalence of low-quality AI-generated content, often referred to as 'AI slop.' It proposes the creation of a public dataset to catalog and explain these issues, potentially aiding in the training of better language models. The author expresses uncertainty about the technical feasibility of this idea.

Key facts

▪AI slop is leading to bans for submitting low-quality papers to arXiv.
▪The idea is to create a public dataset to catalog different types of AI slop.
▪The author is unsure about the technical details of training a language model with this dataset.

Original article

Ycombinator

Read full at Ycombinator →

Opening excerpt (first ~120 words) tap to expand

I just had this idea, you read it all the time AI slop is so prevalent people are getting banned for a year for submitting science papers to arXiv with it, moans of angst from developers, even Microsoft doing its own study where AI degrades the quality of simple documents, and the beloved em-dash.I don't really have the know-how or the time but it occurred to me, if we created a public data set that could be submitted to publicly, we could catalog and organize all the AI slop, the different types, with explanations about why it is slop and why not to do it, and then train a large language model using this data set included, to help correct itself.I don't really know the technical details of training a large language model,is this even possible?

Excerpt limited to ~120 words for fair-use compliance. The full article is at Ycombinator.

Anonymous · no account needed

Discussion

0 comments

Slop Bucket Idea – a dataset of AI slop (train AI what not to do)

Discussion

More from Ycombinator