Three memory-leak patterns in long-running scrapers (and how I caught them after 968 Trustpilot runs)
The article discusses three common memory-leak patterns encountered in long-running web scrapers. These leaks can significantly increase operational costs without immediate detection. The author shares insights from extensive testing on Trustpilot runs and offers solutions to mitigate these issues.
- ▪Memory leaks in scrapers can increase costs by doubling the Apify Memory limit from 1 GB to 4 GB.
- ▪The most common leak pattern involves an unbounded asyncio queue that grows linearly with runtime.
- ▪Dynamic regex patterns can lead to cache misses and increased memory usage during long runs.
Opening excerpt (first ~120 words) tap to expand
try { if(localStorage) { let currentUser = localStorage.getItem('current_user'); if (currentUser) { currentUser = JSON.parse(currentUser); if (currentUser.id === 3831260) { document.getElementById('article-show-container').classList.add('current-user-is-article-author'); } } } } catch (e) { console.error(e); } Alex Spinov Posted on May 18 • Originally published at blog.spinov.online Three memory-leak patterns in long-running scrapers (and how I caught them after 968 Trustpilot runs) #webscraping #python #ai #apify Memory leaks in scrapers do not crash the run. They quietly bump the Apify Memory limit from 1 GB to 2 GB to 4 GB, double the per-run cost, and only get spotted weeks later on a compute-unit invoice.
…
Excerpt limited to ~120 words for fair-use compliance. The full article is at DEV.to (Top).