Your recurring scraper is re-downloading data that didn't change. Here's the 15-line fix (conditional GET)
The article discusses a solution for scrapers that repeatedly download unchanged data. It emphasizes the importance of using conditional GET requests to minimize server load and avoid unnecessary data processing. The author shares insights from extensive scraping experience, advocating for a polite approach to web scraping.
- ▪The article highlights the ethical scraping debate surrounding robots.txt and terms of service.
- ▪It suggests that conditional GET requests combined with a sensible rate limit are key to avoiding bans.
- ▪The author has conducted over 2,190 scrapes across 32 different scrapers.
Opening excerpt (first ~120 words) tap to expand
try { if(localStorage) { let currentUser = localStorage.getItem('current_user'); if (currentUser) { currentUser = JSON.parse(currentUser); if (currentUser.id === 3831260) { document.getElementById('article-show-container').classList.add('current-user-is-article-author'); } } } } catch (e) { console.error(e); } Alex Spinov Posted on May 26 • Originally published at blog.spinov.online Your recurring scraper is re-downloading data that didn't change. Here's the 15-line fix (conditional GET) #webscraping #python #ai #apify Note: This is a cross-post. Canonical version (full long-form) lives on my blog: https://blog.spinov.online/blog/ethical-scraping-is-a-rate-limit-question/ TL;DR The "ethical scraping" debate keeps arguing about robots.txt and ToS.
…
Excerpt limited to ~120 words for fair-use compliance. The full article is at DEV.to (Top).