How well does S3 checkpointing hold up when running Airflow on spot?
The article explores running Apache Airflow on Rackspace spot instances to reduce infrastructure costs, focusing on how S3 checkpointing helps maintain workflow reliability during instance preemptions. It details a deployment using KubernetesExecutor with control plane and worker nodes on spot instances, and evaluates failure resilience in a machine learning pipeline. Airflow's retry mechanisms and S3 for intermediate storage help mitigate disruptions caused by spot instance interruptions. The setup demonstrates cost efficiency and acceptable fault tolerance when properly configured.
- ▪Running Airflow on Rackspace spot instances can reduce monthly costs from ~$21 (on-demand) to ~$1.44 for two instances.
- ▪Rackspace spot instances have preemption rates below 1%, making interruptions infrequent.
- ▪The Airflow setup uses KubernetesExecutor with control plane and worker nodes on separate spot instances, leveraging S3 for task checkpointing and data persistence.
- ▪The credit default prediction DAG uses S3 to store intermediate data, enabling task recovery after spot interruptions.
- ▪Airflow’s built-in retries, task isolation, and S3 integration help maintain pipeline reliability despite spot instance volatility.
Opening excerpt (first ~120 words) tap to expand
IntroductionAirflow can be deployed on spot instances to significantly reduce infrastructure costs.Based on current Rackspace Spot pricing, two spot instances could cost around $1.44 per month, while a single on-demand instance with similar CPU and memory specifications comes to roughly $21 per month.That difference largely comes from how the Rackspace Spot auction-based market works. Pricing is driven by competitive bids, which allows users to access unused capacity at much lower prices, in some cases as low as $0.001 per hour. You can find more context in this article on spot instance history and market dynamics.This auction-based market maintains preemption rates below 1%, meaning interruptions tend to be infrequent.
…
Excerpt limited to ~120 words for fair-use compliance. The full article is at Rackspace.