[ad_1]
Clients with information engineers and information scientists are utilizing Amazon Managed Workflows for Apache Airflow (Amazon MWAA) as a central orchestration platform for operating information pipelines and machine studying (ML) workloads. To help these pipelines, they usually require extra Python packages, akin to Apache Airflow Suppliers. For instance, a pipeline might require the Snowflake supplier package deal for interacting with a Snowflake warehouse, or the Kubernetes supplier package deal for provisioning Kubernetes workloads. In consequence, they should handle these Python dependencies effectively and reliably, offering compatibility with one another and the bottom Apache Airflow set up.
Python consists of the instrument pip to deal with package deal installations. To put in a package deal, you add the identify to a particular file named necessities.txt
. The pip set up
command instructs pip to learn the contents of your necessities file, decide dependencies, and set up the packages. Amazon MWAA runs the pip set up
command utilizing this necessities.txt
file throughout preliminary atmosphere startup and subsequent updates. For extra data, see The way it works.
Making a reproducible and steady necessities file is essential for decreasing pip set up and DAG errors. Moreover, this outlined set of necessities offers consistency throughout nodes in an Amazon MWAA atmosphere. That is most vital throughout employee auto scaling, the place extra employee nodes are provisioned and having totally different dependencies might result in inconsistencies and job failures. Moreover, this technique promotes consistency throughout totally different Amazon MWAA environments, akin to dev, qa, and prod.
This put up describes greatest practices for managing your necessities file in your Amazon MWAA atmosphere. It defines the steps wanted to find out your required packages and package deal variations, create and confirm your necessities.txt
file with package deal variations, and package deal your dependencies.
Finest practices
The next sections describe one of the best practices for managing Python dependencies.
Specify package deal variations within the necessities.txt file
When making a Python necessities.txt
file, you’ll be able to specify simply the package deal identify, or the package deal identify and a particular model. Including a package deal with out model data instructs the pip installer to obtain and set up the most recent obtainable model, topic to compatibility with different put in packages and any constraints. The package deal variations chosen throughout atmosphere creation could also be totally different than the model chosen throughout an auto scaling occasion afterward. This model change can create package deal conflicts resulting in pip set up errors. Even when the up to date package deal installs correctly, code adjustments within the package deal can have an effect on job conduct, resulting in inconsistencies in output. To keep away from these dangers, it’s greatest observe so as to add the model quantity to every package deal in your necessities.txt
file.
Use the constraints file on your Apache Airflow model
A constraints file accommodates the packages, with variations, verified to be suitable together with your Apache Airflow model. This file provides an extra validation layer to forestall package deal conflicts. As a result of the constraints file performs such an vital position in stopping conflicts, starting with Apache Airflow v2.7.2 on Amazon MWAA, your necessities file should embody a --constraint
assertion. If a --constraint
assertion just isn’t equipped, Amazon MWAA will specify a suitable constraints file for you.
Constraint information can be found for every Airflow model and Python model mixture. The URLs have the next type:
https://uncooked.githubusercontent.com/apache/airflow/constraints-${AIRFLOW_VERSION}/constraints-${PYTHON_VERSION}.txt
The official Apache Airflow constraints are pointers, and in case your workflows require newer variations of a supplier package deal, you could must modify your constraints file and embody it in your DAG folder. When doing so, one of the best practices outlined on this put up change into much more vital to protect in opposition to package deal conflicts.
Create a .zip archive of all dependencies
Making a .zip file containing the packages in your necessities file and specifying this because the package deal repository supply makes positive the very same wheel information are used throughout your preliminary atmosphere setup and subsequent node configurations. The pip installer will use these native information for set up quite than connecting to the exterior PyPI repository.
Check the necessities.txt file and dependency .zip file
Testing your necessities file earlier than launch to manufacturing is essential to avoiding set up and DAG errors. Testing each domestically, with the MWAA native runner, and in a dev or staging Amazon MWAA atmosphere, are greatest practices earlier than deploying to manufacturing. You should utilize steady integration and supply (CI/CD) deployment methods to carry out the necessities and package deal set up testing, as described in Automating a DAG deployment with Amazon Managed Workflows for Apache Airflow.
Resolution overview
This answer makes use of the MWAA native runner, an open supply utility that replicates an Amazon MWAA atmosphere domestically. You utilize the native runner to construct and validate your necessities file, and package deal the dependencies. On this instance, you put in the snowflake
and dbt-cloud
supplier packages. You then use the MWAA native runner and a constraints file to find out the precise model of every package deal suitable with Apache Airflow. With this data, you then replace the necessities file, pinning every package deal to a model, and retest the set up. When you might have a profitable set up, you package deal your dependencies and check in a non-production Amazon MWAA atmosphere.
We use MWAA native runner v2.8.1 for this walkthrough and stroll by way of the next steps:
- Obtain and construct the MWAA native runner.
- Create and check a necessities file with package deal variations.
- Bundle dependencies.
- Deploy the necessities file and dependencies to a non-production Amazon MWAA atmosphere.
Conditions
For this walkthrough, it’s best to have the next conditions:
Arrange the MWAA native runner
First, you obtain the MWAA native runner model matching your goal MWAA atmosphere, then you definitely construct the picture.
Full the next steps to configure the native runner:
- Clone the MWAA native runner repository with the next command:
- With Docker operating, construct the container with the next command:
Create and check a necessities file with package deal variations
Constructing a versioned necessities file makes positive all Amazon MWAA elements have the identical package deal variations put in. To find out the suitable variations for every package deal, you begin with a constraints file and an un-versioned necessities file, permitting pip to resolve the dependencies. You then create your versioned necessities file from pip’s set up output.
The next diagram illustrates this workflow.
To construct an preliminary necessities file, full the next steps:
- In your MWAA native runner listing, open
necessities/necessities.txt
in your most popular editor.
The default necessities file will look much like the next:
- Change the present packages with the next package deal listing:
- Save
necessities.txt
. - In a terminal, run the next command to generate the
pip set up
output:
test-requirements
runs pip set up
, which handles resolving the suitable package deal variations. Utilizing a constraints file makes positive the chosen packages are suitable together with your Airflow model. The output will look much like the next:
The message starting with Efficiently put in
is the output of curiosity. This reveals which dependencies, and their particular model, pip put in. You utilize this listing to create your closing versioned necessities file.
Your output may also comprise Requirement already happy
messages for packages already obtainable within the base Amazon MWAA atmosphere. You don’t add these packages to your necessities.txt
file.
- Replace the necessities file with the listing of versioned packages from the
test-requirements
command. The up to date file will look much like the next code:
Subsequent, you check the up to date necessities file to verify no conflicts exist.
- Rerun the
requirements-test
operate:
A profitable check won’t produce any errors. In the event you encounter dependency conflicts, return to the earlier step and replace the necessities file with extra packages, or package deal variations, primarily based on pip’s output.
Bundle dependencies
In case your Amazon MWAA atmosphere has a non-public webserver, you need to package deal your dependencies right into a .zip file, add the file to your S3 bucket, and specify the package deal location in your Amazon MWAA occasion configuration. As a result of a non-public webserver can’t entry the PyPI repository by way of the web, pip will set up the dependencies from the .zip file.
In the event you’re utilizing a public webserver configuration, you additionally profit from a static .zip file, which makes positive the package deal data stays unchanged till it’s explicitly rebuilt.
This course of makes use of the versioned necessities file created within the earlier part and the package-requirements
characteristic within the MWAA native runner.
To package deal your dependencies, full the next steps:
- In a terminal, navigate to the listing the place you put in the native runner.
- Obtain the constraints file on your Python model and your model of Apache Airflow and place it within the
plugins
listing. For this put up, we use Python 3.11 and Apache Airflow v2.8.1:
- In your necessities file, replace the constraints URL to the native downloaded file.
The –-constraint
assertion instructs pip to check the package deal variations in your necessities.txt
file to the allowed model within the constraints file. Downloading a particular constraints file to your plugins
listing allows you to management the constraint file location and contents.
The up to date necessities file will seem like the next code:
- Run the next command to create the .zip file:
package-requirements
creates an up to date necessities file named packaged_requirements.txt
and zippers all dependencies into plugins.zip
. The up to date necessities file appears to be like like the next code:
Be aware the reference to the native constraints file and the plugins listing. The –-find-links
assertion instructs pip to put in packages from /usr/native/airflow/plugins
quite than the general public PyPI repository.
Deploy the necessities file
After you obtain an error-free necessities set up and package deal your dependencies, you’re able to deploy the belongings to a non-production Amazon MWAA atmosphere. Even when verifying and testing necessities with the MWAA native runner, it’s greatest observe to deploy and check the adjustments in a non-prod Amazon MWAA atmosphere earlier than deploying to manufacturing. For extra details about making a CI/CD pipeline to check adjustments, seek advice from Deploying to Amazon Managed Workflows for Apache Airflow.
To deploy your adjustments, full the next steps:
- Add your
necessities.txt
file andplugins.zip
file to your Amazon MWAA atmosphere’s S3 bucket.
For directions on specifying a necessities.txt
model, seek advice from Specifying the necessities.txt model on the Amazon MWAA console. For directions on specifying a plugins.zip file
, seek advice from Putting in customized plugins in your atmosphere.
The Amazon MWAA atmosphere will replace and set up the packages in your plugins.zip
file.
After the replace is full, confirm the supplier package deal set up within the Apache Airflow UI.
- Entry the Apache Airflow UI in Amazon MWAA.
- From the Apache Airflow menu bar, select Admin, then Suppliers.
The listing of suppliers, and their variations, is proven in a desk. On this instance, the web page displays the set up of apache-airflow-providers-db-cloud
model 3.5.1 and apache-airflow-providers-snowflake
model 5.2.1. This listing solely accommodates the supplier packages put in, not all supporting Python packages. Supplier packages which can be a part of the bottom Apache Airflow set up may also seem within the listing. The next picture is an instance of the package deal listing; notice the apache-airflow-providers-db-cloud
and apache-airflow-providers-snowflake
packages and their variations.
To confirm all package deal installations, view the ends in Amazon CloudWatch Logs. Amazon MWAA creates a log stream for the necessities set up and the stream accommodates the pip set up output. For directions, seek advice from Viewing logs on your necessities.txt.
A profitable set up ends in the next message:
In the event you encounter any set up errors, it’s best to decide the package deal battle, replace the necessities file, run the native runner check, re-package the plugins, and deploy the up to date information.
Clear up
In the event you created an Amazon MWAA atmosphere particularly for this put up, delete the atmosphere and S3 objects to keep away from incurring extra costs.
Conclusion
On this put up, we mentioned a number of greatest practices for managing Python dependencies in Amazon MWAA and learn how to use the MWAA native runner to implement these practices. These greatest practices cut back DAG and pip set up errors in your Amazon MWAA atmosphere. For extra particulars and code examples on Amazon MWAA, go to the Amazon MWAA Person Information and the Amazon MWAA examples GitHub repo.
Apache, Apache Airflow, and Airflow are both registered logos or logos of the Apache Software program Basis in the US and/or different international locations.
Concerning the Writer
Mike Ellis is a Technical Account Supervisor at AWS and an Amazon MWAA specialist. Along with helping clients with Amazon MWAA, he contributes to the Apache Airflow open supply undertaking.
[ad_2]