4.8 C
New York
Tuesday, March 17, 2026

Lowering prices for shuffle-heavy Apache Spark workloads with serverless storage for Amazon EMR Serverless


At re:Invent 2025, we introduced serverless storage for Amazon EMR Serverless, eliminating the necessity to provision native disk storage for Apache Spark workloads. Serverless storage of Amazon EMR Serverless reduces information processing prices by as much as 20% whereas serving to forestall job failures from disk capability constraints.

On this publish, we discover the associated fee enhancements we noticed when benchmarking Apache Spark jobs with serverless storage on EMR Serverless. We take a deeper take a look at how serverless storage helps scale back prices for shuffle-heavy Spark workloads, and we define sensible steerage on figuring out the sorts of queries that may profit most from enabling serverless storage in your EMR Serverless Spark jobs.

Benchmark outcomes for EMR 7.12 with serverless storage towards customary disks

We performed the efficiency and price financial savings benchmarking utilizing the TPC-DS dataset at 3TB scale, working 100+ queries that included a mixture of excessive and low shuffle operations. The take a look at configuration utilized Dynamic Useful resource Allocation (DRA) with no pre-initialized capability. The system was arrange with 20GB of disk house, and Spark configurations included 4 cores and 14GB reminiscence for each driver and executor, with dynamic allocation beginning at 3 preliminary executors (spark.dynamicAllocation.initialExecutors = 3). A comparative evaluation was carried out between native disk storage and serverless storage configurations. The intention was to evaluate each whole and common value implications between these storage approaches.

The next desk and chart examine the associated fee discount we noticed within the testing setting described above. Based mostly on us-east-1 pricing, we noticed a value financial savings of greater than 26% when utilizing serverless storage.

Shuffle
serverless storage customary Disks financial savings
Complete Price ($) 24.28 33.1 26.65%
Common Price ($) 0.233 0.318 26.73%

Average cost comparison between standard disks and serverless storage

% Relative financial savings (per question) of serverless storage in comparison with customary disk shuffle

On this testing, we noticed that serverless storage in EMR Serverless reduces value for about 80% of TPC-DS queries. For the queries the place it gives advantages, it delivers a mean value saving of roughly 47%, with financial savings of as much as 85%. Queries that regress usually have low shuffle depth, preserve excessive parallelism all through execution, or full rapidly sufficient that executor scale-down alternatives are minimal. The next determine exhibits the proportion value distinction for every of the TPC-DS queries when serverless storage was enabled, in comparison with the baseline configuration with out serverless storage. Constructive values point out value financial savings (increased is healthier), whereas destructive values point out value regressions.

Proportion value financial savings per TPC-DS question with serverless storage enabled

Percentage cost savings per TPC-DS query with serverless storage enabled

Runtime comparability

There’s important value financial savings because of the elevated elasticity from terminating executors earlier. Nonetheless, job completion time might improve as a result of the shuffle information is saved in serverless storage relatively than regionally on the executors. The extra learn and write latency for shuffle information contributes to the longer runtime. The next desk and chart present the runtime comparability, we noticed in our testing setting.

Shuffle
serverless storage customary disks runtime
Complete Length (sec) 6770.63 4908.52 -37.94%
Common Length (sec) 65.1 47.2 -37.92%

Runtime comparison

Storing shuffle externally and decoupling from the compute allowed the flexibleness for EMR Serverless to show off unused sources dynamically because the state information has been offloaded from the compute. Nonetheless, these value financial savings will be realized solely when DRA is on. If DRA is turned off, Spark would hold these unused sources alive including to the overall value.

Question patterns that profit from serverless storage

The price financial savings from serverless storage rely closely on how executor demand modifications throughout levels of a job. On this part, we look at frequent execution patterns and clarify which question shapes are most probably to profit from serverless storage of EMR Serverless and which question patterns might not profit from shuffle externalization.

Inverted triangle sample queries

In an effort to perceive why externalizing the shuffle information can enable such a big value financial savings, contemplate a simplified question. The next question calculates annual whole gross sales from the TPC-DS dataset by becoming a member of the store_sales and date_dim tables, summing the gross sales quantities per 12 months, and ordering the outcomes.

SELECT d_year, SUM(ss_net_paid) AS total_sales
FROM store_sales
JOIN date_dim ON store_sales.ss_sold_date_sk = date_dim.d_date_sk
GROUP BY d_year
ORDER BY d_year;

This question demonstrates that prime executor demand in the course of the map section and low executor demand within the scale back section is an aggregation question with a excessive cardinality enter and a low cardinality group by.

  • Stage 1 (Excessive Executor Demand)

The be a part of and skim steps scan your entire store_sales and date_dim tables. This typically includes billions of rows in large-scale TPC-DS datasets, so Spark will attempt to parallelize the scan throughout many executors to maximise learn throughput and compute effectivity.

  • Stage 2 (Low Executor Demand)

The aggregation is on d_year, which usually has few distinctive values, reminiscent of solely a handful of years within the information. This implies after the shuffle stage, the scale back section combines the partial aggregates into numerous keys equal to the variety of years (typically < 10). Only some Spark duties are wanted to complete the ultimate aggregation, so most executors turn out to be idle.

With shuffle data saved on the native disk, the compute sources related to these idle executors would nonetheless be working with a view to hold the shuffle information out there. With shuffle information offloaded from the nodes working the executors, with DRA enabled, these nodes with idle executors get launched instantly.

As a result of early levels course of high-cardinality inputs and later levels collapse information right into a small variety of keys, these queries type an “inverted triangle” execution sample: broad parallelism on the prime and slender parallelism on the backside as proven within the following picture:

Inverted triangle pattern queries

Hourglass sample queries

Relying upon the complexity of the job, there will be a number of levels with various demand on variety of executors wanted for the stage. Such jobs can profit from larger elasticity obtained by offloading shuffle information to exterior serverless storage. One such sample is the hour glass sample. The next determine exhibits a workload sample the place executor demand expands, contracts throughout shuffle-heavy levels, and expands once more. Serverless storage of EMR Serverless decouples shuffle information from compute, enabling extra environment friendly scale-down throughout slender levels and serving to enhance value optimization for elastic workloads.

 Hourglass sample in Spark stage execution

Hourglass pattern queries

To establish queries of this class, contemplate the next instance, The question progresses by way of three levels:

  • Stage 1: The preliminary be a part of and filter between store_sales and merchandise produces a large, high-cardinality intermediate dataset, requiring excessive parallelism (many executors).
  • Stage 2: Aggregation teams by a small set of classes reminiscent of “House” or “Electronics”, leading to a drastic drop in output partitions. So this stage effectively runs with just a few executors, as there’s little information to parallelize.
  • Stage 3: The small result’s joined (normally a broadcast be a part of) again to a big reality desk with a date dimension, once more producing a big end result that’s well-parallelized, inflicting Spark to ramp up executor utilization for this stage.
WITH stage1_large_scan AS (
-- Stage 1: Scan and broad be a part of generates a lot of parallelism and wishes many executors
SELECT ss_item_sk, ss_sold_date_sk, ss_net_paid, i_category
FROM store_sales
JOIN merchandise ON store_sales.ss_item_sk = merchandise.i_item_sk
WHERE merchandise.i_category IN ('House', 'Electronics')
),
stage2_small_agg AS (
-- Stage 2: Combination on low-cardinality column (by class), lowering to few teams, so few executors wanted
SELECT i_category, SUM(ss_net_paid) AS total_cat_sales
FROM stage1_large_scan
GROUP BY i_category
),
stage3_broadcast_filter AS (
-- Stage 3: Be part of again to high-cardinality desk, pushing parallelism up once more
SELECT s.*, d.d_year
FROM store_sales s
JOIN date_dim d ON s.ss_sold_date_sk = d.d_date_sk
)

SELECT s3.d_year, s2.i_category, s2.total_cat_sales
FROM stage2_small_agg s2
JOIN stage3_broadcast_filter s3 ON s2.i_category = s3.i_category
ORDER BY s3.d_year, s2.i_category;

This sample is frequent for reporting and dimensional evaluation situations and is efficient for demonstrating how Spark dynamically adjusts useful resource utilization throughout job levels primarily based on cardinality and parallelism wants. Such queries may also profit from the elasticity enabled by exterior serverless storage.

Rectangle sample queries

Not all queries profit from externalizing the shuffle. Contemplate a question the place the cardinality is excessive all through, that means each the levels function on numerous partitions and keys. Sometimes, queries that group by high-cardinality columns (reminiscent of merchandise or buyer) trigger most levels to require comparable quantities of parallelism. The next determine illustrates a Spark workload the place parallelism stays persistently excessive throughout levels. On this sample, each Stage 1 and Stage 2 function on numerous partitions and keys, leading to sustained executor demand all through the job lifecycle.

Excessive-cardinality execution sample with sustained parallelism

Rectangle pattern queries

The next question is similar question that we used within the inverted triangle sample earlier, with one change. We’ve changed the dim_date desk (low cardinality) with merchandise (excessive cardinality).

SELECT i_item_id, SUM(ss_net_paid) AS total_sales
FROM store_sales
JOIN merchandise ON store_sales.ss_item_sk = merchandise.i_item_sk
GROUP BY i_item_id
ORDER BY i_item_id
LIMIT 100;

  • Stage 1: Reads the rows from store_sales and joins with merchandise, spreading information throughout many partitions—much like the unique question’s first stage.
  • Stage 2: The aggregation is by i_item_id, which usually has 1000’s to tens of millions of distinct values in actual datasets. This retains parallelism excessive; many duties deal with non-overlapping keys, and shuffle outputs stay giant.

There is no such thing as a important drop in cardinality: As a result of neither stage is diminished to a small group set, most executors keep busy all through the job’s most important phases, with little idle time even after the shuffle. This kind of question ends in a flatter executor utilization profile as a result of every stage processes an identical quantity of labor, thus minimizing variation in useful resource utilization. These rectangle sample queries won’t see the associated fee profit from the elasticity obtained by offloading shuffle information. Nonetheless, there should still be different advantages reminiscent of discount of job failures and efficiency bottlenecks from disk constraints, freedom from capability planning and sizing, and provisioning of storage for intermediate information operations.

Conclusion

Serverless storage for Amazon EMR Serverless can ship substantial value financial savings for workloads with dynamic useful resource patterns, as seen within the 26% common value financial savings we noticed in our testing setting. By externalizing shuffle information, you may acquire the elasticity to launch idle executors instantly, demonstrated by the financial savings reaching as much as 85% in our testing setting, on queries following inverted triangle and hourglass patterns when Dynamic Useful resource Allocation is enabled.Understanding your workload traits is essential. Whereas rectangle sample queries might not see dramatic value reductions, they will nonetheless profit from improved reliability and removing of capability planning overhead.

To get began: Analyze your job execution patterns, allow Dynamic Useful resource Allocation, and pilot serverless storage on shuffle-heavy workloads. Seeking to scale back your Amazon EMR Serverless prices for Spark workloads? Discover serverless storage for EMR Serverless at this time.


In regards to the authors

Sekar Srinivasan

Sekar Srinivasan

Sekar has over 20 years of expertise working with information. He’s keen about serving to prospects construct scalable options modernizing their structure and producing insights from their information. In his spare time he likes to work on non-profit tasks, particularly these centered on underprivileged Youngsters’s training.

Praveen Mohan Prasad

Praveen Mohan Prasad

Praveen is a knowledge and AI Specialist with 10+ years of expertise in distributed information programs and machine studying, specializing in Data Retrieval and vector search programs. Lively open-source contributor and technical speaker within the ML-Search and Agentic-AI house.

Related Articles

LEAVE A REPLY

Please enter your comment!
Please enter your name here

Stay Connected

0FansLike
0FollowersFollow
0SubscribersSubscribe
- Advertisement -spot_img

Latest Articles