Amazon Athena lately added assist for studying Parquet Column Indexes in Apache Iceberg tables on November 21, 2025. With this optimization, Athena can carry out page-level information pruning to skip pointless information inside Parquet row teams, probably decreasing the quantity of information scanned and enhancing question runtime for queries with selective filters. For information groups, this may increasingly assist allow sooner insights and assist cut back prices when analyzing large-scale information lakes.
Knowledge groups constructing information lakes typically select Apache Iceberg for its ACID transactions, schema evolution, and metadata administration capabilities. Athena is a serverless question engine that permits you to question Amazon S3-based information lakes utilizing SQL, and also you don’t have to handle infrastructure. Primarily based on the kind of information and question logic, Athena can apply a number of question optimizations to enhance efficiency and cut back prices.
On this weblog submit, we use Athena and Amazon SageMaker Unified Studio to discover Parquet Column Indexes and show how they will enhance Iceberg question efficiency. We clarify what Parquet Column Indexes are, show their efficiency advantages, and present you easy methods to use them in your functions.
Overview of Parquet Column Indexes
Parquet Column Indexes retailer metadata that question engines can use to skip irrelevant information with higher precision than row group statistics alone. To grasp how they work, take into account how information is structured inside Parquet recordsdata and the way engines like Athena course of them.
Parquet recordsdata arrange information hierarchically by dividing information into row teams (usually 128-512 MB every) and additional subdividing them into pages (usually 1 MB every). Historically, Parquet maintains metadata on the contents of every row group stage within the type of min/max statistics, permitting engines like Athena to skip row teams that don’t fulfill question predicates. Though this method reduces the bytes scanned and question runtime, it has limitations. If even a single web page inside a row group overlaps with the values you might be trying to find, Athena scans all pages inside the row group.
Parquet Column Indexes assist handle this drawback by storing page-level min/max statistics within the Parquet file footer. Row group statistics present coarse-grained filtering, however Parquet Column Indexes allow finer-grained filtering by permitting question engines like Athena to skip particular person pages inside a row group. Think about a Parquet file with a single row group containing 5 pages for a column. The row group has min/max statistics of (1, 20), and every web page for that column has the next min/max statistics.
When Athena runs a question filtering for values equal to 2, it first checks the row group statistics and confirms that 2 falls inside the vary (1, 20). Athena will then plan to scan the pages inside that row group. With out Parquet Column Indexes, Athena scans every of the 5 pages within the row group. With Parquet Column Indexes, Athena examines the page-level statistics and determines that solely page-0 and page-1 should be learn, skipping the remaining 3 pages.
use Parquet Column Indexes with Athena
Athena makes use of Parquet Column Indexes primarily based on desk sort:
- Amazon S3 Tables: Athena robotically makes use of Parquet Column Indexes by default when they’re current.
- Iceberg tables in S3 normal goal buckets: Athena doesn’t use Parquet Column Indexes by default. To permit Athena to make use of Parquet Column Indexes, add an AWS Glue desk property named
use_iceberg_parquet_column_indexand set it totrue. Use the AWS Glue console or AWS Glue UpdateTable API to carry out these actions.
Learn extra about easy methods to use this characteristic in Use Parquet column indexing.
Measuring Athena efficiency features when utilizing Parquet Column Indexes
Now that we perceive what Parquet Column Indexes are, we’ll show the efficiency advantages of utilizing Parquet Column Indexes by analyzing the catalog_sales desk from a 3TB TPC-DS dataset. This desk comprises ecommerce transaction information together with order dates, gross sales quantities, buyer IDs, and product data. This dataset is an efficient proxy for the forms of enterprise evaluation that you simply would possibly carry out by yourself information, corresponding to figuring out gross sales developments, analyzing buyer buying patterns, and calculating income metrics. We evaluate question execution statistics with and with out Parquet Column Indexes to quantify the efficiency enchancment.
Stipulations
Earlier than you start, you could have the next assets:
- A SageMaker Unified Studio IAM-based area.
- An Execution IAM Position configured inside the SageMaker Unified Studio IAM-based area with entry to S3, AWS Glue Knowledge Catalog, and Athena.
- An S3 bucket in your account to retailer Iceberg desk information and Athena question outcomes.
Create catalog_sales Iceberg desk
Full the next steps utilizing SageMaker Unified Studio notebooks. There, you should use SageMaker Unified Studio’s multi-dialect pocket book performance to work together with your information utilizing the Athena SQL and Spark engines. To create a catalog_sales Iceberg desk in your account, comply with these steps:
- Navigate to Amazon SageMaker within the AWS Administration Console and select Open underneath Get began with Amazon SageMaker Unified Studio.
- From the aspect navigation, choose Notebooks and select Create Pocket book. The following steps on this submit will execute scripts on this pocket book.
- Create a brand new SQL cell within the pocket book and set the connection sort to Athena (Spark). Execute the next question to create a database for the tables on this submit.
- Create a brand new SQL cell within the pocket book and confirm the connection sort is Athena (Spark). Execute the next question to create a Hive desk pointing to the placement of the TPC-DS
catalog_salesdesk information on the public S3 bucket. - Create a brand new SQL cell within the pocket book and confirm the connection sort is Athena (Spark). Execute the next question so as to add the Hive partitions to the AWS Glue metadata.
- Create a brand new SQL cell within the pocket book and confirm the connection sort is Athena (Spark). Exchange
s3://amzn-s3-demo-bucket/athena_parquet_column_index_blog/catalog_sales/Â with the S3 URI the place you need to retailer your Iceberg desk information, then execute the next question to create thecatalog_salesIceberg desk from the Hive desk. - Create a brand new SQL cell within the pocket book and confirm the connection sort is Athena (Spark). Execute the next question to delete the
catalog_sales_hivedesk, which was solely wanted to create thecatalog_salesIceberg desk.
Run an Athena question with out Parquet Column Indexes
After creating the catalog_sales Iceberg desk within the previous steps, we run a easy question that analyzes delivery delays of the highest 10 most ordered gadgets. This sort of evaluation might be essential for ecommerce and retail operations. By figuring out which standard gadgets expertise the best delays, achievement groups can focus assets the place they matter most. For instance, you possibly can regulate stock placement, change warehouse assignments, or handle provider points. Moreover, standard gadgets with vital delivery delays usually tend to lead to order cancellations or returns, so proactively figuring out these points helps shield income.
Moreover, this question is an efficient candidate for demonstrating the effectiveness of utilizing Parquet Column Indexes as a result of it has a selective filter predicate on a single column cs_item_sk. When Athena executes this question, it first identifies row teams whose min/max ranges overlap with the highest 10 most ordered gadgets. With out utilizing Parquet Column Indexes, Athena has to scan each web page of information inside these matched row teams. Nevertheless, when utilizing Parquet Column Indexes, Athena can prune information additional by skipping particular person pages inside these row teams whose min/max ranges don’t overlap with the ids. Full the next steps to determine baseline question efficiency when Athena doesn’t use Parquet Column Indexes in the course of the question.
- Create a brand new Python cell within the pocket book. Exchange
s3://amzn-s3-demo-bucket/athena_parquet_column_index_blog/query_results/with the S3 URI the place you need to retailer your Athena question outcomes, then execute the next script. Word the runtime and bytes scanned that might be printed. The script will run the question 5 occasions with question outcome reuse disabled and chooses the minimal runtime and the corresponding bytes scanned amongst these iterations. See our numbers within the Run Athena question with Parquet Column Indexes part.
Type the catalog_sales desk
Earlier than rerunning the question with Athena utilizing Parquet Column Indexes, it is advisable kind the catalog_sales desk by the cs_item_sk column. Within the previous question, there’s a dynamic filter as a subquery on the cs_item_sk column:
When executing this question, Athena pushes down the filter predicate to the info supply stage, fetching solely rows that match the highest 10 most ordered gadgets. To maximise web page pruning with Parquet Column Indexes, rows with the identical cs_item_sk values needs to be saved close to one another within the Parquet file. With out sorting, matching values might be scattered throughout many pages, forcing Athena to learn extra information. Sorting the desk by cs_item_sk clusters related values collectively, enabling Athena to learn fewer pages.
Let’s study the Parquet Column Indexes in one of many Parquet recordsdata to grasp how the info within the catalog_sales desk is at present organized. First, obtain the Parquet file from the cs_sold_date_sk = 2450815 partition and set up the open-source parquet-cli software in your native machine. Exchange with the trail to the downloaded Parquet file, then run the next command in your native machine:
This shows Parquet Column Indexes for all columns. For brevity, solely the primary 11 pages of the cs_item_sk column from the primary row group are proven within the following instance:
Discover that almost each web page comprises a variety of values. This overlap means Athena can not eradicate pages when filtering with Parquet Column Indexes on cs_item_sk. For instance, trying to find cs_item_sk = 100 requires scanning every of the 11 pages as a result of the worth 100 falls inside each web page’s min/max vary. With this overlap, enabling Athena to make use of Parquet Column Indexes would offer no efficiency profit. Sorting the info by cs_item_sk eliminates this overlap, creating distinct, non-overlapping ranges for every web page. To make Parquet Column Indexes more practical, kind the desk by finishing the next step:
- Create a brand new SQL cell within the pocket book and confirm the connection sort is Athena (Spark). Execute the question to kind the
cs_item_skcolumn values of thecatalog_salesdesk in ascending order and to place all of the null values in the previous few Parquet pages. New Iceberg information recordsdata might be generated from this question.
Operating the parquet column-index command on the sorted information file from the cs_sold_date_sk = 2450815 partition exhibits that the Parquet Column Indexes at the moment are sorted and have non-overlapping ranges. The primary 11 pages of the cs_item_sk column from the primary row group are proven within the following instance:
Now when trying to find cs_item_sk = 100, Athena solely must learn page-0, skipping the remaining 10 pages solely.
Run Athena question with Parquet Column Indexes
Now that the info is sorted to eradicate overlapping pages inside the row teams for the cs_item_sk column, we run two experiments on the sorted information. The primary measures the influence of sorting alone, and the second measures the mixed impact of sorting with Parquet Column Indexes.
- Create a brand new Python cell within the pocket book. Execute the identical script within the part Run Athena question with out Parquet Column Indexes and be aware of the question runtime and bytes scanned outcomes. This measures the efficiency of querying sorted information with out utilizing Parquet Column Indexes.
- Create a brand new Python cell within the pocket book. Execute the next Python script to set the
use_iceberg_parquet_column_indexdesk property totruefor thecatalog_salesdesk within the AWS Glue Knowledge Catalog. - Create a brand new Python cell within the pocket book. Execute the identical script within the part Run Athena question with out Parquet Column Indexes and be aware of the question runtime and bytes scanned outcomes. This measures the efficiency of querying sorted information utilizing Parquet Column Indexes.
Athena question time and bytes scanned enchancment
The next desk summarizes the outcomes from every experiment. The proportion enhancements for the sorted experiments are measured towards the unsorted baseline.
| Experiment | Runtime (sec) | Bytes Scanned (GB) |
| Unsorted with out Parquet Column Indexes | 20.6 | 45.2 |
| Sorted with out Parquet Column Indexes | 15.4 (25.2% sooner) | 27.8 (38.5% fewer bytes) |
| Sorted with Parquet Column Indexes | 10.3 (50.0% sooner) | 13.0 (71.2% fewer bytes) |
Suggestions
To maximise Athena’s means to make use of Parquet Column Indexes and obtain optimum question efficiency, we suggest the next.
- Type information by continuously filtered columns. This enables Athena to effectively learn Parquet Column Indexes and skip irrelevant pages, probably decreasing scan time. When information is sorted by a filter column, related values are clustered collectively inside pages. As a result of Parquet Column Indexes retailer min/max values for every web page, Athena can rapidly decide which pages include matching values and skip the remaining.
- Type information by high-cardinality columns. This creates distinct worth ranges between pages, maximizing the chance for Athena to skip pages throughout question execution. Excessive-cardinality (many distinct values) columns produce non-overlapping min/max ranges throughout pages, permitting Athena to extra successfully filter out irrelevant pages. In distinction, low-cardinality columns corresponding to boolean or standing fields lead to overlapping ranges throughout many pages, decreasing the variety of skipped pages.
Clear up
When you’ve got completed the steps on this submit, full the next cleanup actions to keep away from incurring ongoing costs:
- Create a brand new SQL cell within the pocket book and set the connection sort to Athena (Spark). Execute the next command to drop the
parquet_column_index_blogdatabase and thecatalog_salesdesk. - Delete the Iceberg desk information and the Athena question outcomes out of your S3 bucket.
- Delete the SageMaker Unified Studio IAM-based area whether it is now not wanted.
Conclusion
On this submit, we confirmed you ways Athena makes use of Parquet Column Indexes to hurry up queries and cut back the variety of bytes scanned. Through the use of Parquet Column Indexes, Athena can skip irrelevant information pages to enhance question efficiency, particularly for queries with selective filters on sorted information. Discuss with Optimize Iceberg tables to study extra about this characteristic and take a look at it out by yourself queries.
In regards to the Writer
