50 stories tagged with #dataset, in publish-time order across the WeSearch catalog. Tag pages update as new stories ingest.
⌘ RSS feed for this tag → or search "Dataset"
Running Python code in a sandbox with MicroPython and WASM
I've been experimenting with different approaches to running code in a sandbox for several years now, but my latest attempt feels like it might finally have all of the characterist…
HETDEX opens massive Cosmic Noon dataset to scientists, novices and AI
"what if you don't have the dataset?"
Discovering the LLM's curious and remarkable world knowledge of open data on the web.…
Do Real-World Datasets Contain Natural Experiments? An Empirical Study Using Causal Feature Selection
In nature, events that affect some individuals or groups but not others constitute an implicit intervention and are known as natural experiments. For example, the COVID-19 pandemic…
The DeepSpeak-Agentic Dataset
We present DeepSpeak-Agentic, a dataset of videos comprising over 37 hours of semi-structured conversations between a human and an embodied AI agent. We use this dataset to evaluat…
May 2026 newsletter
I just sent out the May edition of my sponsors-only monthly newsletter . If you are a sponsor (or if you start a sponsorship now) you can access it here . This month: Al got expens…
datasette 1.0a32
Release: datasette 1.0a32 A minor bugfix release. Fixes a bug with INSERT ... RETURNING queries via the new /db/-/execute-write endpoint and a bunch of base_url issues which showed…
Running Python ASGI apps in the browser via Pyodide + a service worker
By running Python ASGI web applications entirely in the browser using Pyodide and a dedicated service worker, this project intercepts all same-origin requests under `/app/` and exe…
Before we spend months processing open-source robotics datasets, tell us why this is a bad idea [D]
Just out of curiosity: How are consistency Loras trained, what dataset?
Does anyone have a copy of the ICDAR2013 Chinese Handwriting Competition Dataset? [R]
Hugging Face Dataset Lineage Explorer
noisekit - CLI for generating realistic degraded speech datasets for ASR benchmarking [P]
A Dataset of Robot-Patient and Doctor-Patient Medical Dialogues for Spoken Language Processing Tasks
Large Language Models (LLMs) have brought huge improvements to Artificial Intelligence (AI), which can be applied to general-purpose tasks. However, their application to textual or…
Query Symbolically or Retrieve Semantically? A Dataset and Method for Semi-Structured Question Answering
Retrieval-Augmented Generation (RAG) systems for question answering typically retrieve evidence by semantic similarity between the query and document chunks. While effective for un…
Silicon Valley VC Backs Startup That Gathers AI Datasets From Head-Mounted Cameras on Workers in India
Human Archive believes its technology "will become foundational infrastructure for automating manual labor."…
How are people doing prompt optimization with datasets safely?
I Built a Multilingual Spam Detection Dataset with 149K+ Messages Across 23 Languages
Spam detection datasets are surprisingly bad once you move outside English. Most public datasets...…
WhatsApp users on alert after hacker drops massive dataset
Auditing Model Bias with Balanced Datasets with Mimesis
Learn how to use Mimesis library to generate a balanced, counterfactual dataset that helps analyze potential bias in your models.…
datasette 1.0a30
An open source multi-tool for exploring and publishing data…
Testing a Cold War-Era AI on Satellite Image Datasets
Show HN: CRED-1 – Open domain credibility dataset for on-device pre-bunking
CRED-1: An Open Multi-Signal Domain Credibility Dataset (2,672 domains) - aloth/cred-1…
Crucible - local open source application for dataset handling
Data Fundamentals Primer for Learning LLM
The minimum data plumbing every ML pipeline needs — samples, features and labels, the train/val/test split, text encoding (ASCII and UTF-8), and preprocessing.…
IMG Dataset Refiner v4.3 Pro is here! 🚀 The ultimate dataset prep tool for LoRAs
Help with dataset for lora training
Low-level coding dataset
Low-level coding dataset
Less Data, Faster Training: repeating smaller datasets speeds up learning via sampling biases
This work investigates the ``small-vs-large gap'', where repeating on fewer samples can lead to compute saving during training compared to using a larger dataset. This is observed …
ShadeBench: A Benchmark Dataset for Building Shade Simulation in Sustainable Society
Urban heat exposure is becoming an increasingly critical challenge due to the intensifying urban heat island effect. Fine-grained shade patterns, especially those induced by urban …
TASTE: A Designer-Annotated Multi-Dimensional Preference Dataset for AI-Generated Graphic Design
Text-to-image models produce graphic design at production scale, but their supervision comes from photo-style preference data with a single overall verdict per comparison. Designer…
Datasette Agent
We just announced the first release of Datasette Agent, a new extensible AI assistant for Datasette. I’ve been working on my LLM Python library for just over three years now, ……
using .npy dataset with 3D models [R]
I benchmarked my AI agent runtime firewall against 3 public academic datasets — here are the honest results including where it fails
AI tool fuses five satellite datasets to help track harmful algal blooms
Testing NGB Platform Beyond a Small Demo Dataset with k6 and TypeScript
How NGB Platform v1.1.1 adds a reusable performance testing framework for validating real business workloads, not just isolated endpoints.…
GroupAffect-4: A Multimodal Dataset of Four-Person Collaborative Interaction
Existing affective-computing, social-signal-processing, and meeting corpora capture important parts of human interaction, but they rarely support analysis of affect in co-located g…
No Dataset? No Problem. How I Curated a Custom AI Dataset From Instagram & Pinterest to Build a Pose Suggester
When you start a new Machine Learning project, you pray there’s a clean, ready-to-use dataset on...…
Show HN: Dataset for AI training and fine tuning
Article 10 and Annex IV-ready CC0 training data for EU high-risk AI compliance. IP indemnity included on Compliance tier. Enforcement starts 2 August 2026.…
Built an address-level Calgary civic data explorer by connecting multiple public datasets
Take-Two's CEO says AI's not in the business of making hits, 'datasets by their very nature are backward looking', but that doesn't mean AI can't be 'super helpful'
"Clones don't sell".…
Designing a plotting Dataset for Rust: Balancing Polars support with zero-dependency weight
Slop Bucket Idea – a dataset of AI slop (train AI what not to do)
How are you handling training data when public datasets don't match your use case? [D]
🧞♂️Transform unstructured PDFs Job Offers into a dataset w. gemma4:2b
This is a submission for the Gemma 4 Challenge: Build with Gemma 4 🤔 About the power of...…
Generated 1000 liminal/dreamcore images with GPT Image 2 and put them in a dataset - could be useful for training
But the trends in this dataset are loud enough to cut …
But the trends in this dataset are loud enough to cut ……
[Academic] We need Data Annotators or Someone who Prepares Dataset [R]
What's new in pip 26.1 - lockfiles and dependency cooldowns!
Richard Si describes an excellent set of upgrades to Python's default pip tool for installing dependencies. This version drops support for Python 3.9 - fair enough, since it's been…