18.1 C
New York
Saturday, August 23, 2025

Seize information lineage from dbt, Apache Airflow, and Apache Spark with Amazon SageMaker


The subsequent technology of Amazon SageMaker is the middle on your information, analytics, and AI. SageMaker brings collectively AWS synthetic intelligence and machine studying (AI/ML) and analytics capabilities and delivers an built-in expertise for analytics and AI with unified entry to information. From Amazon SageMaker Unified Studio, a single interface, you’ll be able to entry your information and use a collection of highly effective instruments for information processing, SQL analytics, mannequin improvement, coaching and inference, in addition to generative AI improvement. This unified expertise is assisted by Amazon Q and Amazon SageMaker Catalog (powered by Amazon DataZone), which delivers an embedded generative AI and governance expertise at each step.

With information lineage, now a part of SageMaker Catalog, area directors and information producers can centralize lineage metadata of their information belongings in a single place. You’ll be able to monitor the move of knowledge over time, providing you with a transparent understanding of the place it originated, the way it has modified, and its final use throughout the enterprise. By offering this degree of transparency across the origin of knowledge, information lineage helps information shoppers achieve belief that the info is appropriate for his or her use case. As a result of information lineage is captured on the desk, column, and job degree, information producers may also conduct impression evaluation and reply to information points when wanted.

Seize of knowledge lineage in SageMaker begins after connections and information sources are configured and lineage occasions are generated when information is reworked in AWS Glue or Amazon Redshift. This functionality can also be absolutely suitable with OpenLineage, so you’ll be able to additional increase information lineage seize to different information processing instruments. This put up walks you thru learn how to use the OpenLineage-compatible API of SageMaker or Amazon DataZone to push information lineage occasions programmatically from instruments supporting the OpenLineage commonplace like dbt, Apache Airflow, and Apache Spark.

Answer overview

Many third-party and open supply instruments which can be used at this time to orchestrate and run information pipelines, like dbt, Airflow, and Spark, have lively assist of the OpenLineage commonplace to supply interoperability throughout environments. With this functionality, you solely want to incorporate and configure the suitable library to your atmosphere, to have the ability to stream lineage occasions from jobs operating on this device on to their corresponding output logs or to a goal HTTP endpoint that you just specify.

With the goal HTTP endpoint choice, you’ll be able to introduce a sample to put up lineage occasions from these instruments into SageMaker or Amazon DataZone to additional assist you centralize governance of your information belongings and processes in a single place. This sample takes the type of a proxy, and its simplified structure is illustrated within the following determine.

The best way that the proxy for OpenLineage works is easy:

  • Amazon API Gateway exposes an HTTP endpoint and path. Jobs operating with the OpenLineage bundle on high of the supported information processing instruments may be arrange with the HTTP transport choice pointing to this endpoint and path. If connectivity permits, lineage occasions will probably be streamed into this endpoint because the job runs.
  • An Amazon Easy Queue Service (Amazon SQS) queue buffers the occasions as they arrive. By storing them in a queue, you could have the choice to implement methods for retries and errors when wanted. For circumstances the place occasion order is required, we suggest using first-in-first-out (FIFO) queues; nonetheless, SageMaker and Amazon DataZone are capable of map incoming OpenLineage occasions, even when they’re out of order.
  • An AWS Lambda perform retrieves occasions from the queue in batches. For each occasion in a batch, the perform can carry out transformations when wanted and put up the ensuing occasion to the goal SageMaker or Amazon DataZone area.
  • Though it’s not proven within the structure, AWS Id and Entry Administration (IAM) and Amazon CloudWatch are key capabilities that enable safe interplay between sources with minimal permissions and logging for troubleshooting and observability.

The AWS pattern OpenLineage HTTP Proxy for Amazon SageMaker Governance and Amazon DataZone supplies a working implementation of this simplified structure you can check and customise as wanted. To deploy in a check atmosphere, comply with the steps as described within the repository. We use an AWS CloudFormation template to deploy resolution sources.

After you could have deployed the OpenLineage HTTP Proxy resolution, you should utilize it to put up lineage occasions from information processing instruments like dbt, Airflow, and Spark right into a SageMaker or Amazon DataZone area, as proven within the following examples.

Arrange the OpenLineage bundle for Spark in AWS Glue 4.0

AWS Glue added built-in assist for OpenLineage with AWS Glue 5.0 (to study extra, see Introducing AWS Glue 5.0 for Apache Spark). For jobs which can be nonetheless operating on AWS Glue 4.0, you continue to can stream OpenLineage occasions into SageMaker or Amazon DataZone by utilizing the OpenLineage HTTP Proxy resolution. This serves for instance that may be utilized to different platforms operating Spark like Amazon EMR, third-party options, or self-managed clusters.

To correctly add OpenLineage capabilities to an AWS Glue 4.0 job and configure it to stream lineage occasions into the OpenLineage HTTP Proxy resolution, full the next steps:

  1. Obtain the official OpenLineage bundle for Spark. For our instance, we used the JAR bundle model 2.12 launch 1.9.1.
  2. Retailer the JAR file in an Amazon Easy Storage Service (Amazon S3) bucket that may be accessed by your AWS Glue job.
  3. On the AWS Glue console, open your job.
  4. Underneath Libraries, for Dependent JARs path, enter the trail of the JAR bundle saved in your S3 bucket.

  1. Within the Job parameters part, add the next parameters:
    1. Allow the OpenLineage bundle:
      1. Key: --user-jars-first
      2. Worth: true
    2. Configure how the OpenLineage bundle will probably be used to stream lineage occasions. Exchange and with the corresponding values of the OpenLineage HTTP Proxy resolution. These values may be discovered as outputs of the deployed CloudFormation stack. Exchange along with your AWS account ID.
      1. Key: --conf
      2. Worth:
        spark.extraListeners=io.openlineage.spark.agent.OpenLineageSparkListener 
        --conf spark.openlineage.transport.kind=http 
        --conf spark.openlineage.transport.url=
        --conf spark.openlineage.transport.endpoint=/
        --conf spark.openlineage.sides.custom_environment_variables=[AWS_DEFAULT_REGION;GLUE_VERSION;GLUE_COMMAND_CRITERIA;GLUE_PYTHON_VERSION;] 
        --conf spark.glue.accountId=

With this setup, the AWS Glue 4.0 job will use the HTTP transport choice of the OpenLineage bundle to stream lineage occasions into the OpenLineage proxy, which can put up occasions to the SageMaker or Amazon DataZone area.

  1. Run the AWS Glue 4.0 job.

The job’s ensuing datasets must be sourced into SageMaker or Amazon DataZone in order that OpenLineage occasions are mapped to them. As you discover the sourced dataset in SageMaker Unified Studio, you’ll be able to observe its origin path as described by the OpenLineage occasions streamed by means of the OpenLineage proxy.

When working with Amazon DataZone, you’ll get the identical end result.

The origin path on this instance is intensive and maps the ensuing dataset right down to its origin, on this case, a few tables hosted in a relational database and reworked by means of a knowledge pipeline with two AWS Glue 4.0 (Spark) jobs.

Arrange the OpenLineage bundle for dbt

dbt has quickly grow to be a well-liked framework to construct information pipelines on high of knowledge processing and information warehouse instruments like Amazon Redshift, Amazon EMR, and AWS Glue, in addition to different conventional and third-party options. This framework helps OpenLineage as a strategy to standardize technology of lineage occasions and combine with the rising information governance ecosystem.dbt deployments would possibly range per atmosphere, which is why we don’t dive into the specifics on this put up. Nevertheless, to easily configure your dbt undertaking to leverage the OpenLineage HTTP Proxy resolution, full the next steps:

  1. Set up the OpenLineage bundle for dbt. You’ll be able to study extra within the OpenLineage documentation.
  2. Within the root folder of your dbt undertaking, create an openlineage.yml file the place you’ll be able to specify the transport configuration. Exchange and with the values of the OpenLineage HTTP Proxy resolution. These values may be discovered as outputs of the deployed CloudFormation stack.
transport:
  kind: http
  url: 
  endpoint: 
  timeout: 5

  1. Run your dbt pipeline. As defined within the OpenLineage documentation, as an alternative of operating the usual dbt run command, you run the dbt-ol run command. The latter command is only a wrapper on high of the usual dbt run command in order that lineage occasions are captured and streamed as configured.

The job’s ensuing datasets must be sourced into SageMaker or Amazon DataZone in order that OpenLineage occasions are mapped to them. As you discover the sourced dataset in SageMaker Unified Studio, you’ll be able to observe its lineage path as described by the OpenLineage occasions streamed by means of the OpenLineage proxy.

When working with Amazon DataZone, you’ll get the identical end result.

On this instance, the dbt undertaking is operating on high of Amazon Redshift, which is a typical use case amongst clients. Amazon Redshift is built-in for automated lineage seize with SageMaker and Amazon DataZone, however such capabilities weren’t used as a part of this instance for instance how one can nonetheless combine OpenLineage occasions from dbt utilizing the sample applied within the OpenLineage HTTP Proxy resolution.The dbt pipeline is made by two levels operating sequentially, that are illustrated within the origin path because the nodes with the dbt kind.

Arrange the OpenLineage bundle for Airflow

Airflow is a well-positioned device to orchestrate information pipelines at any scale. AWS supplies Amazon Managed Workflows for Apache Airflow (Amazon MWAA) as a managed different for purchasers that wish to cut back administration and speed up the event of their information technique with Airflow in an economical manner. Airflow additionally helps OpenLineage, so you’ll be able to centralize lineage with instruments like SageMaker and Amazon DataZone.

The next steps are particular for Amazon MWAA, however they are often extrapolated to different types of deployment of Airflow:

  1. Set up the OpenLineage bundle for Airflow. You’ll be able to study extra within the OpenLineage documentation. For variations 2.7 and later, it’s beneficial to make use of the native Airflow OpenLineage bundle (apache-airflow-providers-openlineage), which is the case for this instance.
  2. To put in the bundle, add it to the necessities.txt file that you’re storing in Amazon S3 and that you’re pointing to when provisioning your Amazon MWAA atmosphere. To study extra, seek advice from Managing Python dependencies in necessities.txt.
  3. As you put in the OpenLineage bundle or afterwards, you’ll be able to configure it to ship lineage occasions to the OpenLineage proxy:
    1. When filling the shape to create a brand new Amazon MWAA atmosphere or edit an current one, within the Airflow configuration choices part, add the next. Exchange and with the values of the OpenLineage HTTP Proxy resolution. These values may be discovered as outputs of the deployed CloudFormation stack:
      1. Configuration choice: openlineage.transport
      2. Customized worth: {"kind": "http", "url": "", "endpoint": ""}

  1. Run your pipeline.

The Airflow duties will routinely use the transport configuration to stream lineage occasions into the OpenLineage proxy as they run. The duty’s ensuing datasets must be sourced into SageMaker or Amazon DataZone in order that OpenLineage occasions are mapped to them.As you discover the sourced dataset in SageMaker Unified Studio, you’ll be able to observe its origin path as described by the OpenLineage occasions streamed by means of the OpenLineage proxy.

When working with Amazon DataZone, you’ll get the identical end result.

On this instance, the Amazon MWAA Directed Acyclic Graph (DAG) is working on high of Amazon Redshift, much like the dbt instance earlier than. Nevertheless, it’s nonetheless not utilizing the native integration for automated information seize between Amazon Redshift and SageMaker or Amazon DataZone. This fashion, we will illustrate how one can nonetheless combine OpenLineage occasions from Airflow utilizing the sample applied within the OpenLineage HTTP Proxy resolution.The Airflow DAG is made by a single job that outputs the ensuing dataset by utilizing datasets that had been created as a part of the dbt pipeline within the earlier instance. That is illustrated within the origin path, the place it consists of nodes with the dbt kind and a node with AIRFLOW kind. With this ultimate instance, observe how SageMaker and Amazon DataZone map all datasets and jobs to mirror the fact of your information pipelines.

Extra issues when implementing the OpenLineage proxy sample

The OpenLineage proxy sample applied within the pattern OpenLineage HTTP Proxy resolution and offered on this put up has proven to be a sensible method to combine a rising set of knowledge processing instruments right into a centralized information governance technique on high of SageMaker. We encourage you to dive into it and use it in your check environments to find out how it may be greatest used on your particular setup.If concerned with taking this sample to manufacturing, we propose you first assessment it totally and customise it to your specific wants. The next are some gadgets value reviewing as you consider this sample implementation:

  • The answer used within the examples of this put up makes use of a public API endpoint with no authentication or authorization mechanism. For a manufacturing workload, we suggest limiting entry to the endpoint to a minimal so solely approved sources are capable of stream messages into it. To study extra, seek advice from Management and handle entry to HTTP APIs in API Gateway.
  • The logic applied within the Lambda perform is meant to be personalized relying in your use case. You would possibly must implement transformation logic, relying on how OpenLineage occasions are created by the device you might be utilizing. As a reference, for the case of the Amazon MWAA instance of this put up, some minor transformations had been required on the identify and namespace fields of the inputs and outputs parts of the occasion for full compatibility with the format anticipated for Amazon Redshift datasets as described within the dataset naming conventions of OpenLineage. You may also want to alter how the perform logs execution particulars or embody retry/error logic and extra.
  • The SQS queue used within the OpenLineage HTTP Proxy resolution is commonplace, which suggests that occasions aren’t delivered so as. If this can be a requirement, you may use FIFO queues as an alternative.

For circumstances the place you wish to put up OpenLineage occasions instantly into SageMaker or Amazon DataZone, with out utilizing the proxy sample defined on this put up, a customized transport is now accessible as an extension of the OpenLineage undertaking model 1.33.0. Leverage this characteristic in circumstances the place you don’t want further controls in your OpenLineage occasion stream, for instance, in the event you don’t want any customized transformation logic.

Abstract

On this put up, we confirmed learn how to use the OpenLineage-compatible APIs of SageMaker to seize information lineage from any device supporting this commonplace, by following an architectural sample launched because the OpenLineage proxy. We offered some examples of how one can arrange instruments like dbt, Airflow, and Spark to stream lineage occasions to the OpenLineage proxy, which subsequently posts them to a SageMaker or Amazon DataZone area. Lastly, we launched a working implementation of this sample you can check and mentioned some issues when implementing this similar sample to manufacturing.

The SageMaker compatibility with OpenLineage may help simplify governance of your information belongings and enhance belief in your information. This functionality is among the many options that at the moment are accessible to construct a complete governance technique powered by information lineage, information high quality, enterprise metadata, information discovery, entry automation, and extra. By bundling information governance capabilities with the rising set of instruments accessible for information and AI improvement, you’ll be able to derive worth out of your information quicker and get nearer to consolidating a data-driven tradition. Check out this resolution and get began with SageMaker to affix the rising set of shoppers which can be modernizing their information platform.


Concerning the authors

Jose Romero is a Senior Options Architect for Startups at AWS, based mostly in Austin, Texas. He’s captivated with serving to clients architect fashionable platforms at scale for information, AI, and ML. As a former senior architect in AWS Skilled Companies, he enjoys constructing and sharing options for widespread advanced issues in order that clients can speed up their cloud journey and undertake greatest practices. Join with him on LinkedIn.

Priya Tiruthani is a Senior Technical Product Supervisor with Amazon SageMaker Catalog (Amazon DataZone) at AWS. She focuses on constructing merchandise and their capabilities in information analytics and governance. She is captivated with constructing revolutionary merchandise to deal with and simplify clients’ challenges of their end-to-end information journey. Outdoors of labor, she enjoys being open air to hike and seize nature’s magnificence. Join along with her on LinkedIn.

Related Articles

LEAVE A REPLY

Please enter your comment!
Please enter your name here

Stay Connected

0FansLike
0FollowersFollow
0SubscribersSubscribe
- Advertisement -spot_img

Latest Articles