Read-Write ETL on NAS Data with EMR Serverless Spark — No Cluster, No Copy
The article discusses the use of EMR Serverless Spark for read-write ETL processes on NAS data without the need for cluster management or data copying. It highlights the efficiency of this approach, achieving a full ETL pipeline execution in just 37 seconds at a low cost. The integration of FSx for ONTAP with EMR Serverless allows for direct reading and writing to NAS storage, streamlining data processing workflows.
- ▪EMR Serverless Spark can read, transform, and write-back Parquet files on FSx for ONTAP via S3 Access Points.
- ▪The total Spark execution time for a full ETL pipeline is 16 seconds, with a total job time of 37 seconds including cold start.
- ▪This serverless approach eliminates the need for cluster management and reduces costs to approximately $0.05 per job.
Opening excerpt (first ~120 words) tap to expand
try { if(localStorage) { let currentUser = localStorage.getItem('current_user'); if (currentUser) { currentUser = JSON.parse(currentUser); if (currentUser.id === 1143688) { document.getElementById('article-show-container').classList.add('current-user-is-article-author'); } } } } catch (e) { console.error(e); } Yoshiki Fujiwara(藤原 善基)@AWS Community Builder for AWS Community Builders Posted on May 26 Read-Write ETL on NAS Data with EMR Serverless Spark — No Cluster, No Copy #aws #spark #emr #amazonfsxfornetappontap FSx for ONTAP S3 Access Points × Lakehouse Deep Dive (7 Part Series) 1 Query NAS Data In Place with Athena and FSx for ONTAP S3 Access Points 2 FSx for ONTAP S3 Access Points Lakehouse — What Works, What Doesn't, and Why ... 3 more parts...
…
Excerpt limited to ~120 words for fair-use compliance. The full article is at DEV.to (Top).