Efficient collaboration and scalability are important for constructing environment friendly information pipelines. Nevertheless, information modeling groups typically face challenges with advanced extract, rework, and cargo (ETL) instruments, requiring programming experience and a deep understanding of infrastructure. This complexity can result in operational inefficiencies and challenges in sustaining information high quality at scale.
dbt addresses these challenges by offering an easier strategy the place information groups can construct strong information fashions utilizing SQL, a language they’re already aware of. When built-in with fashionable growth practices, dbt tasks can use model management for collaboration, incorporate testing for information high quality, and make the most of reusable elements by way of macros. dbt additionally mechanically manages dependencies, ensuring information transformations execute within the appropriate sequence.
On this publish, we discover a streamlined, configuration-driven strategy to orchestrate dbt Core jobs utilizing Amazon Managed Workflows for Apache Airflow (Amazon MWAA) and Cosmos, an open supply package deal. These jobs run transformations on Amazon Redshift, a totally managed information warehouse that allows quick, scalable analytics utilizing commonplace SQL. With this setup, groups can collaborate successfully whereas sustaining information high quality, operational effectivity, and observability. Key steps coated embrace:
- Making a pattern dbt challenge
- Enabling auditing throughout the dbt challenge to seize runtime metrics for every mannequin
- Making a GitHub Actions workflow to automate deployments
- Organising Amazon Easy Notification Service (Amazon SNS) to proactively alert on failures
These enhancements allow model-level auditing, automated deployments, and real-time failure alerts. By the top of this publish, you should have a sensible and scalable framework for operating dbt Core jobs with Cosmos on Amazon MWAA, so your crew can ship dependable information workflows quicker.
Resolution overview
The next diagram illustrates the answer structure.
The workflow accommodates the next steps:
- Analytics engineers handle their dbt challenge of their model management instrument. On this publish, we use GitHub for instance.
- We configure an Apache Airflow Directed Acyclic Graph (DAG) to make use of the Cosmos library to create an Airflow job group that accommodates all of the dbt fashions as a part of the dbt challenge.
- We use a GitHub Actions workflow to sync the dbt challenge recordsdata and the DAG to an Amazon Easy Storage Service (Amazon S3) bucket.
- In the course of the DAG run, dbt converts the fashions, exams, and macros to Amazon Redshift SQL statements, which run immediately on the Redshift cluster.
- If a job within the DAG fails, the DAG invokes an AWS Lambda perform to ship out a notification utilizing Amazon SNS.
Stipulations
You have to have the next conditions:
Create a dbt challenge
A dbt challenge is structured to facilitate modular, scalable, and maintainable information transformations. The next code is a pattern dbt challenge construction that this publish will observe:
MY_SAMPLE_DBT_PROJECT
├── .github
│ └── workflows
│ └── publish_assets.yml
└── src
├── dags
│ └── dbt_sample_dag.py
└── my_sample_dbt_project
├── macros
├── fashions
└── dbt_project.yml
dbt makes use of the next YAML recordsdata:
- dbt_project.yml –  Serves as the primary configuration on your challenge. Objects on this challenge will inherit settings outlined right here except overridden on the mannequin stage. For instance:
- sources.yml – Defines the exterior information sources that your dbt fashions will reference. For instance:
- schema.yml – Outlines the schema of your fashions and information high quality exams. Within the following instance, we’ve got outlined two columns,
full_name
for the mannequinmodel1
andsales_id
formodel2
. We have now declared them as the first key and outlined information high quality exams to test if the 2 columns are distinctive and never null.
Allow auditing inside dbt challenge
Enabling auditing inside your dbt challenge is essential for facilitating transparency, traceability, and operational oversight throughout your information pipeline. You possibly can seize run metrics on the mannequin stage for every execution in an audit desk. By capturing detailed run metrics reminiscent of load identifier, runtime, and variety of rows affected, groups can systematically monitor the well being and efficiency of every load, rapidly determine points, and hint modifications again to particular runs.
The audit desk consists of the next attributes:
- load_id – An identifier for every mannequin run executed as a part of the load
- database_name – The title of the database inside which information is being loaded
- schema_name – The title of the schema inside which information is being loaded
- title – The title of the item inside which information is being loaded
- resource_type – The kind of object to which information is being loaded
- execution_time – The time length taken for every dbt mannequin to finish execution as a part of every load
- rows_affected – The variety of rows affected within the dbt mannequin as a part of the load
Full the next steps to allow auditing inside your dbt challenge:
- Navigate to the
fashions
listing (src/my_sample_dbt_project/fashions
) and create theaudit_table.sql
mannequin file:
- Navigate to the
macros
listing (src/my_sample_dbt_project/macros
) and create theparse_dbt_results.sql
macro file:
- Navigate to the
macros
listing (src/my_sample_dbt_project/macros
) and create thelog_audit_table.sql
macro file:
- Append the next traces to the
dbt_project.yml
file:
Create a GitHub Actions workflow
This step is elective. When you choose, you possibly can skip it and as a substitute add your recordsdata on to your S3 bucket.
The next GitHub Actions workflow automates the deployment of dbt challenge recordsdata and DAG file to Amazon S3. Exchange the placeholders {s3_bucket_name}, {account_id}, {role_name}, and {area} along with your S3 bucket title, account ID, IAM position title, and AWS Area within the workflow file.
To reinforce safety, it’s beneficial to make use of OpenID Join (OIDC) for authentication with IAM roles in GitHub Actions as a substitute of counting on long-lived entry keys.
GitHub has the next safety necessities:
- Department safety guidelines – Earlier than continuing with the GitHub Actions workflow, ensure that department safety guidelines are in place. These guidelines implement required standing checks earlier than merging code into protected branches (reminiscent of
fundamental
). - Code evaluate tips – Implement code evaluate processes to verify modifications endure evaluate. This could embrace requiring at the least one approving evaluate earlier than code is merged into the protected department.
- Incorporate safety scanning instruments – This may help detect vulnerabilities in your repository.
Be sure you are additionally adhering to dbt-specific safety greatest practices:
- Take note of dbt macros with variables and validate their inputs.
- When including new packages to your dbt challenge, consider their safety, compatibility, and upkeep standing to verify they don’t introduce vulnerabilities or conflicts into your challenge.
- Overview dynamically generated SQL to safeguard in opposition to points like SQL injection.
Replace the Amazon MWAA occasion
Full the next steps to replace the Amazon MWAA occasion:
- Set up the Cosmos library on Amazon MWAA by including
astronomer-cosmos
within thenecessities.txt
file. Be certain to test for model compatibility for Amazon MWAA and the Cosmos library. - Add the next entries in your
startup.sh
script:- Within the following code,
DBT_VENV_PATH
specifies the situation the place the Python digital surroundings for dbt shall be created.DBT_PROJECT_PATH
factors to the situation of your dbt challenge inside Amazon MWAA. - The next code creates a Python digital surroundings on the path
${DBT_VENV_PATH}
and installs thedbt-redshift
adapter to run dbt transformations on Amazon Redshift:
- Within the following code,
Create a dbt consumer in Amazon Redshift and retailer credentials
To create dbt fashions in Amazon Redshift, you have to arrange a local Redshift consumer with the mandatory permissions to entry supply tables and create new tables. It’s important to create separate database customers with minimal permissions to observe the precept of least privilege. The dbt consumer shouldn’t be granted admin privileges, as a substitute, it ought to solely have entry to the particular schemas required for its duties.
Full the next steps:
- Open the Amazon Redshift console and join as an admin (for extra particulars, confer with Connecting to an Amazon Redshift database).
- Run the next command within the question editor v2 to create a local consumer, and be aware down the values for
dbt_user_name
andpassword_value
: - Run the next instructions within the question editor v2 to grant permissions to the native consumer:
- Connect with the database the place you wish to supply tables from and run the next instructions:
- To permit the consumer to create tables inside a schema, run the next command:
- Optionally, create a secret in AWS Secrets and techniques Supervisor and retailer the values for
dbt_user_name
andpassword_value
from the earlier step as plaintext:
Making a Secrets and techniques Supervisor entry is elective, however beneficial for securely storing your credentials as a substitute of hardcoding them. To study extra, confer with AWS Secrets and techniques Supervisor greatest practices.
Create a Redshift connection in Amazon MWAA
We create one Redshift connection in Amazon MWAA for every Redshift database, ensuring that every information pipeline (DAG) can solely entry one database. This strategy supplies distinct entry controls for every pipeline, serving to forestall unauthorized entry to information. Full the next steps:
- Log in to the Amazon MWAA UI.
- On the Admin menu, select Connections.
- Select Add a brand new file.
- For Connection Id, enter a reputation for this connection.
- For Connection Sort, select Amazon Redshift.
- For Host, enter the endpoint of the Redshift cluster with out the port and database title (for instance,
redshift-cluster-1.xxxxxx.us-east-1.redshift.amazonaws.com
). - For Database, enter the database of the Redshift cluster.
- For Port, enter the port of the Redshift cluster.
Arrange an SNS notification
Organising SNS notifications is elective, however they could be a helpful enhancement to obtain alerts on failures. Full the next steps:
- Create an SNS matter.
- Create a subscription to the SNS matter.
- Create a Lambda perform with the Python runtime.
- Modify the perform code in your Lambda perform, and substitute
{topic_arn}
along with your SNS matter Amazon Useful resource Title (ARN):
Configure a DAG
The next pattern DAG orchestrates a dbt workflow for processing and auditing information fashions in Amazon Redshift. It retrieves credentials from Secrets and techniques Supervisor, runs dbt duties in a digital surroundings, and sends an SNS notification if a failure happens. The workflow consists of the next steps:
- It begins with the
audit_dbt_task
job group, which creates the audit mannequin. - The
transform_data
job group executes the opposite dbt fashions, excluding theaudit
-tagged one. Contained in thetransform_data
group, there are two dbt fashions,model1
andmodel2
, and every is adopted by a corresponding check job that runs information high quality exams outlined within theschema.yml
file. - To correctly detect and deal with failures, the DAG features a
dbt_check
Python job that runs a customized perform,check_dbt_failures
. That is vital as a result of when utilizingDbtTaskGroup
, particular person model-level failures contained in the group don’t mechanically propagate to the duty group stage. Consequently, downstream duties (such because the Lambda operatorsns_notification_for_failure
) configured withtrigger_rule="one_failed"
is not going to be triggered except a failure is explicitly raised.
The check_dbt_failures
perform addresses this by inspecting the outcomes of every dbt mannequin and check, and elevating an AirflowException
if a failure is discovered. When an AirflowException
is raised, the sns_notification_for_failure
job is triggered.
- If a failure happens, the
sns_notification_for_failure
job invokes a Lambda perform to ship an SNS notification. If no failures are detected, this job is skipped.
The next diagram illustrates this workflow.
Configure DAG variables
To customise this DAG on your surroundings, configure the next variables:
- project_name – Be certain the
project_name
matches the S3 prefix of your dbt challenge - secret_name – Present the title of the key that shops dbt consumer credentials
- target_database and target_schema – Replace these variables to mirror the place you wish to land your dbt fashions in Amazon Redshift
- redshift_connection_id – Set this to match the connection configured in Amazon MWAA for this Redshift database
- sns_lambda_function_name – Present the Lambda perform title to ship SNS notifications
- dag_name – Present the DAG title that shall be handed to the SNS notification Lambda perform
Incorporate DAG elements
After setting the variables, now you can incorporate the next elements to finish the DAG.
Secrets and techniques Supervisor
The DAG retrieves dbt consumer credentials from Secrets and techniques Supervisor:
Redshift connection configuration
It makes use of RedshiftUserPasswordProfileMapping
to authenticate:
dbt execution setup
This code accommodates the next variables:
- dbt executable path – Makes use of a digital surroundings
- dbt challenge path – Is positioned within the surroundings variable
DBT_PROJECT_PATH
below your challenge
Duties and execution circulate
This step consists of the next elements:
- Audit dbt job group (audit_dbt_task) – Runs the dbt mannequin tagged with
audit
- dbt job group (transform_data) – Runs the dbt fashions tagged with operations, excluding the audit mannequin
In dbt, tags are labels you could assign to fashions, exams, seeds, and different dbt sources to prepare and selectively run subsets of your dbt challenge. In your render_config
, you might have exclude=["tag:audit"]
. This implies dbt will exclude fashions which have the tag audit
, as a result of the audit mannequin runs individually.
- Failure test (dbt_check) – Checks for dbt mannequin failures, raises an
AirflowException
if upstream dbt duties fail - SNS notification on failure (sns_notification_for_failure) – Invokes a Lambda perform to ship an SNS notification upon a dbt job failure (for instance, a dbt mannequin within the job group)
The pattern dbt orchestrates a dbt workflow in Amazon Redshift, beginning with an audit job and adopted by a job group that processes information fashions. It features a failure dealing with mechanism that checks for failures and raises an exception to set off an SNS notification utilizing Lambda if a failure happens. If no failures are detected, the SNS notification job is skipped.
Clear up
When you not want the sources you created, delete them to keep away from further costs. This consists of the next:
- Amazon MWAA surroundings
- S3 bucket
- IAM position
- Redshift cluster or serverless workgroup
- Secrets and techniques Supervisor secret
- SNS matter
- Lambda perform
Conclusion
By integrating dbt with Amazon Redshift and orchestrating workflows utilizing Amazon MWAA and the Cosmos library, you possibly can simplify information transformation workflows whereas sustaining strong engineering practices. The pattern dbt challenge construction, mixed with automated deployments by way of GitHub Actions and proactive monitoring utilizing Amazon SNS, supplies a basis for constructing dependable information pipelines. The addition of audit logging facilitates transparency throughout your transformations, so groups can keep excessive information high quality requirements.
You should use this resolution as a place to begin on your personal dbt implementation on Amazon MWAA. The strategy we outlined emphasizes SQL-based transformations whereas incorporating important operational capabilities like deployment automation and failure alerting. Get began by adapting the configuration to your surroundings, and construct upon these practices as your information wants evolve.
For extra sources, confer with Handle information transformations with dbt in Amazon Redshift and Redshift setup.
In regards to the authors
Cindy Li is an Affiliate Cloud Architect at AWS Skilled Providers, specialising in Knowledge Analytics. Cindy works with clients to design and implement scalable information analytics options on AWS. When Cindy just isn’t diving into tech, you will discover her out on walks along with her playful toy poodle Mocha.
Akhil B is a Knowledge Analytics Advisor at AWS Skilled Providers, specializing in cloud-based information options. He companions with clients to design and implement scalable information analytics platforms, serving to organizations rework their conventional information infrastructure into fashionable, cloud-based options on AWS. His experience helps organizations optimize their information ecosystems and maximize enterprise worth by way of fashionable analytics capabilities.
Joao Palma is a Senior Knowledge Architect at Amazon Net Providers, the place he companions with enterprise clients to design and implement complete information platform options. He focuses on serving to organizations rework their information into strategic enterprise belongings and enabling data-driven resolution making.
Harshana Nanayakkara is a Supply Advisor at AWS Skilled Providers, the place he helps clients deal with advanced enterprise challenges utilizing AWS Cloud expertise. He focuses on information and analytics, information governance, and AI/ML implementations.