Speed up question efficiency with Apache Iceberg statistics on the AWS Glue Knowledge Catalog

[ad_1]

As we speak, we’re happy to announce a brand new functionality for the AWS Glue Knowledge Catalog: producing column-level aggregation statistics for Apache Iceberg tables to speed up queries. These statistics are utilized by cost-based optimizer (CBO) in Amazon Redshift Spectrum, leading to improved question efficiency and potential value financial savings.

Apache Iceberg is an open desk format that gives the aptitude of ACID transactions in your knowledge lakes. It’s designed to course of massive analytics datasets and is environment friendly for even small row-level operations. It additionally permits helpful options akin to time-travel, schema evolution, hidden partitioning, and extra.

AWS has invested in service integration with Iceberg to allow Iceberg workloads primarily based on buyer suggestions. One instance is the AWS Glue Knowledge Catalog. The Knowledge Catalog is a centralized repository that shops metadata about your group’s datasets, making the info seen, searchable, and queryable for customers. The Knowledge Catalog helps Iceberg tables and tracks the desk’s present metadata. It additionally permits automated compaction of particular person small recordsdata produced by every transactional write on tables into a number of massive recordsdata for sooner learn and scan operations.

In 2023, the Knowledge Catalog introduced help for column-level statistics for non-Iceberg tables. That characteristic collects desk statistics utilized by the question engine’s CBO. Now, the Knowledge Catalog expands this help to Iceberg tables. The Iceberg desk’s column statistics that the Knowledge Catalog generates are primarily based on Puffin Spec and saved on Amazon Easy Storage Service (Amazon S3) with different desk knowledge. This fashion, varied engines supporting Iceberg can make the most of and replace them.

This submit demonstrates how column-level statistics for Iceberg tables work with Redshift Spectrum. Moreover, we showcase the efficiency advantage of the Iceberg column statistics with the TPC-DS dataset.

How Iceberg desk’s column statistics works

AWS Glue Knowledge Catalog generates desk column statistics utilizing the Theta Sketch algorithm on Apache DataSketches to estimate the variety of distinct values (NDV) and shops them in Puffin file.

For SQL planners, NDV is a vital statistic to optimize question planning. There are a number of eventualities the place NDV statistics can doubtlessly optimize question efficiency. For instance, when becoming a member of two tables on a column, the optimizer can use the NDV to estimate the selectivity of the be a part of. If one desk has a low NDV for the be a part of column in comparison with the opposite desk, the optimizer might select to make use of a broadcast be a part of as an alternative of a shuffle be a part of, lowering knowledge motion and enhancing question efficiency. Furthermore, when there are greater than two tables to be joined, the optimizer can estimate the output measurement of every be a part of and plan the environment friendly be a part of order. Moreover, NDV can be utilized for varied optimizations akin to group by, distinct, and rely question.

Nonetheless, calculating NDV repeatedly with 100% accuracy requires O(N) house complexity. As a substitute, Theta Sketch is an environment friendly algorithm that lets you estimate the NDV in a dataset while not having to retailer all of the distinct values on reminiscence and storage. The important thing thought behind Theta Sketch is to hash the info into a variety between 0–1, after which choose solely a small portion of the hashed values primarily based on a threshold (denoted as θ). By analyzing this small subset of knowledge, the Theta Sketch algorithm can present an correct estimate of the NDV within the authentic dataset.

Iceberg’s Puffin file is designed to retailer data akin to indexes and statistics as a blob kind. One of many consultant blob sorts that may be saved is apache-datasketches-theta-v1, which is serialized values for estimating the NDV utilizing the Theta Sketch algorithm. Puffin recordsdata are linked to a snapshot-id on Iceberg’s metadata and are utilized by the question engine’s CBO to optimize question plans.

Leverage Iceberg column statistics by way of Amazon Redshift

To exhibit the efficiency advantage of this functionality, we make use of the industry-standard TPC-DS 3 TB dataset. We examine the question efficiency with and with out Iceberg column statistics for the tables by operating queries in Redshift Spectrum. We’ve got included the queries used on this submit, and we advocate making an attempt your individual queries by following the workflow.

The next is the general steps:

  1. Run AWS Glue Job that extracts TPS-DS dataset from Public Amazon S3 bucket and saves them as an Iceberg desk in your S3 bucket. AWS Glue Knowledge Catalog shops these tables’ metadata location. Question these tables utilizing Amazon Redshift Spectrum.
  2. Generate column statistics: Make use of the improved capabilities of AWS Glue Knowledge Catalog to generate column statistics for every tables. It generates puffin recordsdata storing Theta Sketch.
  3. Question with Amazon Redshift Spectrum: Consider the efficiency advantage of column statistics on question efficiency by using Amazon Redshift Spectrum to run queries on the dataset.

The next diagram illustrates the structure.

Speed up question efficiency with Apache Iceberg statistics on the AWS Glue Knowledge Catalog

To do that new functionality, we full the next steps:

  1. Arrange assets with AWS CloudFormation.
  2. Run an AWS Glue job to create Iceberg tables for the 3TB TPC-DS dataset in your S3 bucket. The Knowledge Catalog shops these tables’ metadata location.
  3. Run queries on Redshift Spectrum and observe the question period.
  4. Generate Iceberg column statistics for Knowledge Catalog tables.
  5. Run queries on Redshift Spectrum and examine the question period with the earlier run.
  6. Optionally, schedule AWS Glue column statistics jobs utilizing AWS Lambda and an Amazon EventBridge

Arrange assets with AWS CloudFormation

This submit features a CloudFormation template for a fast setup. You’ll be able to assessment and customise it to fit your wants. Observe that this CloudFormation template requires a area with at the very least 3 Availability Zones. The template generates the next assets:

  • A digital non-public cloud (VPC), public subnet, non-public subnets, and route tables
  • An Amazon Redshift Serverless workgroup and namespace
  • An S3 bucket to retailer the TPC-DS dataset, column statistics, job scripts, and so forth
  • Knowledge Catalog databases
  • An AWS Glue job to extract the TPS-DS dataset from the general public S3 bucket and save the info as an Iceberg desk in your S3 bucket
  • AWS Identification and Entry Administration (AWS IAM) roles and insurance policies
  • A Lambda perform and EventBridge schedule to run the AWS Glue column statistics on a schedule

To launch the CloudFormation stack, full the next steps:

  1. Register to the AWS CloudFormation console.
  2. Select Launch Stack.
  3. Select Subsequent.
  4. Depart the parameters as default or make applicable adjustments primarily based in your necessities, then select Subsequent.
  5. Evaluate the small print on the ultimate web page and choose I acknowledge that AWS CloudFormation may create IAM assets.
  6. Select Create.

This stack can take round 10 minutes to finish, after which you’ll be able to view the deployed stack on the AWS CloudFormation console.

Run an AWS Glue job to create Iceberg tables for the 3TB TPC-DS dataset

When the CloudFormation stack creation is full, run the AWS Glue job to create Iceberg tables for the TPC-DS dataset. This AWS Glue job extracts the TPC-DS dataset from the general public S3 bucket and transforms the info into Iceberg tables. These tables are loaded into your S3 bucket and registered to the Knowledge Catalog.

To run the AWS Glue job, full the next steps:

  1. On the AWS Glue console, select ETL jobs within the navigation pane.
  2. Select InitialDataLoadJob-<your-stack-name>.
  3. Select Run.

This AWS Glue job can take round half-hour to finish. The method is full when the job processing standing exhibits as Succeeded.

The AWS Glue job creates tables storing the TPC-DS dataset in two an identical databases: tpcdsdbnostats and tpcdsdbwithstats. The tables in tpcdsdbnostats can have no generated statistics, and we use them as reference. We generate statistics on tables in tpcdsdbwithstats. Verify the creation of these two databases and underlying tables on the AWS Glue console. Right now, these databases maintain the identical knowledge and there aren’t any statistics generated on the tables.

Run queries on Redshift Spectrum with out statistics

Within the earlier steps, you arrange a Redshift Serverless workgroup with the given RPU (128 by default), ready the TPC-DS 3TB dataset in your S3 bucket, and created Iceberg tables (which presently don’t have statistics).

To run your question in Amazon Redshift, full the next steps:

  1. Obtain the Amazon Redshift queries.
  2. Within the Redshift question editor v2, run the queries listed within the Redshift Question for tables with out column statistics part within the downloaded file redshift-tpcds-sample.sql.
  3. Observe the question runtime of every question.

Generate Iceberg column statistics

To generate statistics on the Knowledge Catalog tables, full the next steps:

  1. On the AWS Glue console, select Databases below Knowledge Catalog within the navigation pane.
  2. Select the tpcdsdbwithstats database to view all accessible tables.
  3. Choose any of those tables (for instance, call_center).
  4. Go to Column statistics – new and select Generate statistics.
  5. Preserve the default choices:
    1. For Select columns, choose Desk (All columns).
    2. For Row sampling choices, choose All rows.
    3. For IAM function, select AWSGluestats-blog-<your-stack-name>.
  6. Select Generate statistics.

You’ll be capable to see standing of the statistics technology run as proven within the following screenshot.

After you generate the Iceberg desk column statistics, it’s best to be capable to see detailed column statistics for that desk.

Following the statistics technology, you’ll discover an <id>.stat file within the AWS Glue desk’s underlying knowledge location in Amazon S3. This file is a Puffin file that shops the Theta Sketch knowledge construction. Question engines can use this Theta Sketch algorithm to effectively estimate the NDV when working on the desk, which helps optimize question efficiency.

Reiterate the earlier steps to generate statistics for all tables, akin to catalog_sales, catalog_returns, warehouse, merchandise, date_dim, store_sales, buyer, customer_address, web_sales, time_dim, ship_mode, web_site, and web_returns. Alternatively, you possibly can manually run the Lambda perform that instructs AWS Glue to generate column statistics for all tables. We talk about the small print of this perform later on this submit.

After you generate statistics for all tables, you possibly can assess the question efficiency for every question.

Run queries on Redshift Spectrum with statistics

Within the earlier steps, you arrange a Redshift Serverless workgroup with the given RPU (128 by default), ready the TPC-DS 3TB dataset in your S3 bucket, and created Iceberg tables with column statistics.

To run the supplied question utilizing Redshift Spectrum on the statistics tables, full the next steps:

  1. Within the Redshift question editor v2, run the queries listed in Redshift Question for tables with column statistics part within the downloaded file redshift-tpcds-sample.sql.
  2. Observe the question runtime of every question.

With Redshift Serverless 128 RPU and the TPC-DS 3TB dataset, we carried out pattern runs for 10 chosen TPC-DS queries the place NDV data was anticipated to be helpful. We ran every question 10 occasions. The outcomes proven within the following desk are sorted by the share of the efficiency enchancment for the queries with column statistics.

TPC-DS 3T Queries With out Column Statistics With Column Statistics Efficiency Enchancment (%)
Question 16 305.0284 51.7807 489.1
Question 75 398.0643 110.8366 259.1
Question 78 169.8358 52.8951 221.1
Question 95 35.2996 11.1047 217.9
Question 94 160.52 57.0321 181.5
Question 68 14.6517 7.4745 96
Question 4 217.8954 121.996 78.6
Question 72 123.8698 76.215 62.5
Question 29 22.0769 14.8697 48.5
Question 25 43.2164 32.8602 31.5

The outcomes demonstrated clear efficiency advantages starting from 31.5–489.1%.

To dive deep, let’s discover question 16, which confirmed the best efficiency profit:

TPC-DS Question 16:

choose
   rely(distinct cs_order_number) as "order rely"
  ,sum(cs_ext_ship_cost) as "whole delivery value"
  ,sum(cs_net_profit) as "whole internet revenue"
from
   "awsdatacatalog"."tpcdsdbwithstats"."catalog_sales" cs1
  ,"awsdatacatalog"."tpcdsdbwithstats"."date_dim"
  ,"awsdatacatalog"."tpcdsdbwithstats"."customer_address"
  ,"awsdatacatalog"."tpcdsdbwithstats"."call_center"
the place
    d_date between '2000-2-01' 
    and dateadd(day, 60, solid('2000-2-01' as date))
    and cs1.cs_ship_date_sk = d_date_sk
    and cs1.cs_ship_addr_sk = ca_address_sk
    and ca_state="AL"
    and cs1.cs_call_center_sk = cc_call_center_sk
    and cc_county in ('Dauphin County','Levy County','Luce County','Jackson County',
                    'Daviess County')
and exists (choose *
            from "awsdatacatalog"."tpcdsdbwithstats"."catalog_sales" cs2
            the place cs1.cs_order_number = cs2.cs_order_number
            and cs1.cs_warehouse_sk <> cs2.cs_warehouse_sk)
and never exists(choose *
               from "awsdatacatalog"."tpcdsdbwithstats"."catalog_returns" cr1
               the place cs1.cs_order_number = cr1.cr_order_number)
order by rely(distinct cs_order_number)
restrict 100;

You’ll be able to examine the distinction between the question plans with and with out column statistics with the ANALYZE question.

The next screenshot exhibits the outcomes with out column statistics.

The next screenshot exhibits the outcomes with column statistics.

You’ll be able to observe some notable variations because of utilizing column statistics. At a excessive degree, the general estimated value of the question is considerably diminished, from 20633217995813352.00 to 331727324110.36.

The 2 question plans selected totally different be a part of methods.

The next is one line included within the question plan with out column statistics:

XN Hash Be a part of DS_DIST_BOTH (cost45365031.50 rows=10764790749 width=44)
" Outer Dist Key: ""outer"".cs_order_number"
Inside Dist Key: volt_tt_61c54ae740984.cs_order_number
" Hash Cond: ((""outer"".cs_order_number = ""inside"".cs_order_number) AND (""outer"".cs_warehouse_sk = ""inside"".cs_warehouse_sk))"

The next is the corresponding line within the question plan with column statistics:

XN Hash Be a part of DS_BCAST_INNER (value=307193250965.64..327130154786.68 rows=17509398 width=32)
" Hash Cond: ((""outer"".cs_order_number = ""inside"".cs_order_number) AND (""outer"".cs_warehouse_sk = ""inside"".cs_warehouse_sk))"

The question plan for the desk with out column statistics used DS_DIST_BOTH when becoming a member of massive tables, whereas the question plan for the desk with column statistics selected DS_BCAST_INNER. The be a part of order has additionally modified primarily based on the column statistics. These be a part of technique and be a part of order adjustments are primarily pushed by extra correct be a part of cardinality estimations, that are doable with column statistics, and lead to a extra optimized question plan.

Schedule AWS Glue column statistics Runs

Sustaining up-to-date column statistics is essential for optimum question efficiency. This part guides you thru automating the method of producing Iceberg desk column statistics utilizing Lambda and EventBridge Scheduler. This automation retains your column statistics updated with out handbook intervention.

The required Lambda perform and EventBridge schedule are already created by way of the CloudFormation template. The Lambda perform is used to invoke the AWS Glue column statistics run. First, full the next steps to discover how the Lambda perform is configured:

  1. On the Lambda console, select Features within the navigation pane.
  2. Open the perform GlueTableStatisticsFunctionv1.

For a clearer understanding of the Lambda perform, we advocate reviewing the code within the Code part and inspecting the atmosphere variables below Configuration.

As proven within the following code snippet, the Lambda perform invokes the start_column_statistics_task_run API by way of the AWS SDK for Python (Boto3) library.

Subsequent, full the next steps to discover how the EventBridge schedule is configured:

  1. On the EventBridge console, select Schedules below Scheduler within the navigation pane.
  2. Find the schedule created by the CloudFormation console.

This web page is the place you handle and configure the schedules in your occasions. As proven within the following screenshot, the schedule is configured to invoke the Lambda perform each day at a particular time—on this case, 08:27 PM UTC. This makes positive the AWS Glue column statistics runs on a daily and predictable foundation.

Clear up

When you might have completed all of the above steps, keep in mind to scrub up all of the AWS assets you created utilizing AWS CloudFormation:

  1. Delete the CloudFormation stack.
  2. Delete S3 bucket storing the Iceberg desk for the TPC-DS dataset and the AWS Glue job script.

Conclusion

This submit launched a brand new characteristic within the Knowledge Catalog that lets you create Iceberg desk column-level statistics. The Iceberg desk shops Theta Sketch, which can be utilized to estimate NDV effectively in a Puffin file. The Redshift Spectrum CBO can use that to optimize the question plan, leading to improved question efficiency and potential value financial savings.

Check out this new characteristic within the Knowledge Catalog to generate column-level statistics and enhance question efficiency, and tell us your suggestions within the feedback part. Go to the AWS Glue Catalog documentation to study extra.


Concerning the Authors

Sotaro Hikita is a Options Architect. He helps clients in a variety of industries, particularly the monetary {industry}, to construct higher options. He’s notably obsessed with large knowledge applied sciences and open supply software program.

Noritaka Sekiyama is a Principal Large Knowledge Architect on the AWS Glue crew. He’s accountable for constructing software program artifacts to assist clients. In his spare time, he enjoys biking together with his new highway bike.

Kyle Duong is a Senior Software program Improvement Engineer on the AWS Glue and AWS Lake Formation crew. He’s obsessed with constructing large knowledge applied sciences and distributed programs.

Kalaiselvi Kamaraj is a Senior Software program Improvement Engineer with Amazon. She has labored on a number of tasks inside the Amazon Redshift question processing crew and presently specializing in performance-related tasks for Redshift knowledge lakes.

Sandeep Adwankar is a Senior Product Supervisor at AWS. Primarily based within the California Bay Space, he works with clients across the globe to translate enterprise and technical necessities into merchandise that allow clients to enhance how they handle, safe, and entry knowledge.

[ad_2]

Leave a Reply

Your email address will not be published. Required fields are marked *