WeSearch
Hub / Tags / Dataset
TAG · #DATASET

Dataset coverage.

Every story in the WeSearch catalog tagged with #dataset, chronological, with view counts. Subscribe to the per-tag RSS feed to follow this topic in your reader of choice.

50 stories tagged with #dataset, in publish-time order across the WeSearch catalog. Tag pages update as new stories ingest.

⌘ RSS feed for this tag →   or   search "Dataset"

RELATED TAGS
#datasette5#ai4#python3#technology3#webassembly3#programming2#software2#llm2#ml2#datasets2#richard-si1#collaboration1
SIMON WILLISON'S WEBLOG

Running Python code in a sandbox with MicroPython and WASM

I've been experimenting with different approaches to running code in a sandbox for several years now, but my latest attempt feels like it might finally have all of the characterist…

50 views ·
#programming#python#webassembly
PHYS.ORG

HETDEX opens massive Cosmic Noon dataset to scientists, novices and AI

13 views ·
CHRIS-PARMER

"what if you don't have the dataset?"

Discovering the LLM's curious and remarkable world knowledge of open data on the web.…

7 views ·
#technology#data#analytics
ARXIV CS.AI

Do Real-World Datasets Contain Natural Experiments? An Empirical Study Using Causal Feature Selection

In nature, events that affect some individuals or groups but not others constitute an implicit intervention and are known as natural experiments. For example, the COVID-19 pandemic…

13 views ·
#artificial intelligence#machine learning#causal inference
ARXIV CS.AI

The DeepSpeak-Agentic Dataset

We present DeepSpeak-Agentic, a dataset of videos comprising over 37 hours of semi-structured conversations between a human and an embodied AI agent. We use this dataset to evaluat…

10 views ·
#artificial intelligence#human-agent interaction
SIMON WILLISON'S WEBLOG

May 2026 newsletter

I just sent out the May edition of my sponsors-only monthly newsletter . If you are a sponsor (or if you start a sponsorship now) you can access it here . This month: Al got expens…

18 views ·
#technology#ai#newsletter
SIMON WILLISON'S WEBLOG

datasette 1.0a32

Release: datasette 1.0a32 A minor bugfix release. Fixes a bug with INSERT ... RETURNING queries via the new /db/-/execute-write endpoint and a bunch of base_url issues which showed…

16 views ·
SIMON WILLISON

Running Python ASGI apps in the browser via Pyodide + a service worker

By running Python ASGI web applications entirely in the browser using Pyodide and a dedicated service worker, this project intercepts all same-origin requests under `/app/` and exe…

13 views ·
#python#webassembly#datasette
R/MACHINELEARNING

Before we spend months processing open-source robotics datasets, tell us why this is a bad idea [D]

10 views ·
R/STABLEDIFFUSION

Just out of curiosity: How are consistency Loras trained, what dataset?

8 views ·
R/MACHINELEARNING

Does anyone have a copy of the ICDAR2013 Chinese Handwriting Competition Dataset? [R]

12 views ·
R/LOCALLLAMA

Hugging Face Dataset Lineage Explorer

19 views ·
R/MACHINELEARNING

noisekit - CLI for generating realistic degraded speech datasets for ASR benchmarking [P]

13 views ·
ARXIV CS.AI

A Dataset of Robot-Patient and Doctor-Patient Medical Dialogues for Spoken Language Processing Tasks

Large Language Models (LLMs) have brought huge improvements to Artificial Intelligence (AI), which can be applied to general-purpose tasks. However, their application to textual or…

15 views ·
#artificial intelligence#healthcare#machine learning
ARXIV CS.AI

Query Symbolically or Retrieve Semantically? A Dataset and Method for Semi-Structured Question Answering

Retrieval-Augmented Generation (RAG) systems for question answering typically retrieve evidence by semantic similarity between the query and document chunks. While effective for un…

13 views ·
#artificial intelligence#question answering#data science
GIZMODO

Silicon Valley VC Backs Startup That Gathers AI Datasets From Head-Mounted Cameras on Workers in India

Human Archive believes its technology "will become foundational infrastructure for automating manual labor."…

22 views ·
#artificial-intelligence#automation#technology
R/PROMPTENGINEERING

How are people doing prompt optimization with datasets safely?

17 views ·
DEV.TO (TOP)

I Built a Multilingual Spam Detection Dataset with 149K+ Messages Across 23 Languages

Spam detection datasets are surprisingly bad once you move outside English. Most public datasets...…

13 views ·
#spam#nlp
R/CYBERSECURITY

WhatsApp users on alert after hacker drops massive dataset

14 views ·
KDNUGGETS

Auditing Model Bias with Balanced Datasets with Mimesis

Learn how to use Mimesis library to generate a balanced, counterfactual dataset that helps analyze potential bias in your models.…

13 views ·
#machine learning#bias#data science
SIMON WILLISON

datasette 1.0a30

An open source multi-tool for exploring and publishing data…

21 views ·
#datasette#software#technology
R/ARTIFICIAL

Testing a Cold War-Era AI on Satellite Image Datasets

14 views ·
GITHUB

Show HN: CRED-1 – Open domain credibility dataset for on-device pre-bunking

CRED-1: An Open Multi-Signal Domain Credibility Dataset (2,672 domains) - aloth/cred-1…

12 views ·
#data#misinformation#credibility
R/STABLEDIFFUSION

Crucible - local open source application for dataset handling

12 views ·
ALGORHYTHM

Data Fundamentals Primer for Learning LLM

The minimum data plumbing every ML pipeline needs — samples, features and labels, the train/val/test split, text encoding (ASCII and UTF-8), and preprocessing.…

12 views ·
#machine learning#data science#datasets
R/STABLEDIFFUSION

IMG Dataset Refiner v4.3 Pro is here! 🚀 The ultimate dataset prep tool for LoRAs

12 views ·
R/STABLEDIFFUSION

Help with dataset for lora training

8 views ·
R/LOCALLLAMA

Low-level coding dataset

8 views ·
R/CPP

Low-level coding dataset

13 views ·
ARXIV CS.AI

Less Data, Faster Training: repeating smaller datasets speeds up learning via sampling biases

This work investigates the ``small-vs-large gap'', where repeating on fewer samples can lead to compute saving during training compared to using a larger dataset. This is observed …

14 views ·
#machine learning#artificial intelligence#data science
ARXIV CS.AI

ShadeBench: A Benchmark Dataset for Building Shade Simulation in Sustainable Society

Urban heat exposure is becoming an increasingly critical challenge due to the intensifying urban heat island effect. Fine-grained shade patterns, especially those induced by urban …

14 views ·
#urban planning#climate#computer vision
ARXIV CS.AI

TASTE: A Designer-Annotated Multi-Dimensional Preference Dataset for AI-Generated Graphic Design

Text-to-image models produce graphic design at production scale, but their supervision comes from photo-style preference data with a single overall verdict per comparison. Designer…

18 views ·
#artificial intelligence#graphic design#computer vision
SIMON WILLISON

Datasette Agent

We just announced the first release of Datasette Agent, a new extensible AI assistant for Datasette. I’ve been working on my LLM Python library for just over three years now, ……

15 views ·
#technology#ai#data
R/MACHINELEARNING

using .npy dataset with 3D models [R]

16 views ·
R/OPENAI

I benchmarked my AI agent runtime firewall against 3 public academic datasets — here are the honest results including where it fails

17 views ·
PHYS.ORG

AI tool fuses five satellite datasets to help track harmful algal blooms

19 views ·
DEV.TO (TOP)

Testing NGB Platform Beyond a Small Demo Dataset with k6 and TypeScript

How NGB Platform v1.1.1 adds a reusable performance testing framework for validating real business workloads, not just isolated endpoints.…

10 views ·
#opensource#performance#testing
ARXIV CS.AI

GroupAffect-4: A Multimodal Dataset of Four-Person Collaborative Interaction

Existing affective-computing, social-signal-processing, and meeting corpora capture important parts of human interaction, but they rarely support analysis of affect in co-located g…

13 views ·
#artificial intelligence#datasets#collaboration
DEV.TO (TOP)

No Dataset? No Problem. How I Curated a Custom AI Dataset From Instagram & Pinterest to Build a Pose Suggester

When you start a new Machine Learning project, you pray there’s a clean, ready-to-use dataset on...…

12 views ·
#ai#machine learning
NEURVANCE

Show HN: Dataset for AI training and fine tuning

Article 10 and Annex IV-ready CC0 training data for EU high-risk AI compliance. IP indemnity included on Compliance tier. Enforcement starts 2 August 2026.…

16 views ·
#ai#data#training
R/DATABASE

Built an address-level Calgary civic data explorer by connecting multiple public datasets

13 views ·
PC GAMER

Take-Two's CEO says AI's not in the business of making hits, 'datasets by their very nature are backward looking', but that doesn't mean AI can't be 'super helpful'

"Clones don't sell".…

13 views ·
#gaming#technology#ai
R/RUST

Designing a plotting Dataset for Rust: Balancing Polars support with zero-dependency weight

14 views ·
YCOMBINATOR

Slop Bucket Idea – a dataset of AI slop (train AI what not to do)

18 views ·
#artificial intelligence#data#research
R/MACHINELEARNING

How are you handling training data when public datasets don't match your use case? [D]

15 views ·
DEV.TO (TOP)

🧞‍♂️Transform unstructured PDFs Job Offers into a dataset w. gemma4:2b

This is a submission for the Gemma 4 Challenge: Build with Gemma 4 🤔 About the power of...…

12 views ·
#dataengineering#openai#joboffers
R/STABLEDIFFUSION

Generated 1000 liminal/dreamcore images with GPT Image 2 and put them in a dataset - could be useful for training

19 views ·
YAHOO SPORTS

But the trends in this dataset are loud enough to cut …

But the trends in this dataset are loud enough to cut ……

12 views ·
#nba#injuries#sports
R/MACHINELEARNING

[Academic] We need Data Annotators or Someone who Prepares Dataset [R]

10 views ·
SIMON WILLISON

What's new in pip 26.1 - lockfiles and dependency cooldowns!

Richard Si describes an excellent set of upgrades to Python's default pip tool for installing dependencies. This version drops support for Python 3.9 - fair enough, since it's been…

9 views ·
#python#programming#software