国外给数据集,国内吹牛逼:锐评女娲马斯克乔布斯Skill
The article critiques certain Chinese AI projects for lacking genuine openness, emphasizing that while they present elaborate documentation, they often fail to release actual training data. It contrasts this with Western practices where datasets and code are fully shared to enable replication. The author argues that true open-source contribution requires transparency, not just polished narratives.
- ▪Many Chinese AI projects publish detailed README files but do not release raw training data, undermining true open-source principles.
- ▪Western AI initiatives like EleutherAI and LAION provide full datasets and replication scripts, enabling verifiable research.
- ▪Publicly available datasets for figures like Musk and Jobs exist on platforms like Zenodo and Hugging Face, but some projects ignore them in favor of AI-generated summaries.
- ▪The article compares these projects to presenting 'archaeological reports' without disclosing excavation data, highlighting a lack of methodological transparency.
- ▪Real data curation involves labor-intensive work such as annotation and timestamping, which many current 'open-source' efforts skip entirely.
Opening excerpt (first ~120 words) tap to expand
try { if(localStorage) { let currentUser = localStorage.getItem('current_user'); if (currentUser) { currentUser = JSON.parse(currentUser); if (currentUser.id === 3860368) { document.getElementById('article-show-container').classList.add('current-user-is-article-author'); } } } } catch (e) { console.error(e); } GokuScraper悟空爬虫 Posted on May 2 国外给数据集,国内吹牛逼:锐评女娲马斯克乔布斯Skill 国外给数据集,国内吹牛逼:锐评女娲马斯克乔布斯Skill 说句得罪人的话:中国AI圈有些项目,正在重新定义“开源”二字——把README写得像史诗,却连一个原始数据都不敢往外放。 这不是技术差距,是诚意的差距。 一、国外的“开源”是卸了妆见人,咱们的“开源”是化了浓妆念经 国外的AI开源项目,玩的是“交货”。什么叫交货? 你说你开源了个模型,好,数据给我。训练数据的每一行json、每一个csv,全都扔出来。EleutherAI发The Pile,800个G的原始文 本,下载脚本都给你写好——就怕你复现不了。LAION发图文对数据集,不光给数据,连怎么筛掉NSFW内容的脚本都公开。道理很简 单:开源不交数据,就像卖车不给发动机——你他妈让我推着走? 再看国内某些项目,玩的是“交作业”。什么叫交作业? 你点进去一看,data/文件夹是空的,原始语料没有,训练数据没有,标注文件没有。 没有一克米,但README里已经把满汉全席的菜名报完了。…
Excerpt limited to ~120 words for fair-use compliance. The full article is at DEV Community.