Amazon MWAA greatest practices for managing Python dependencies


Clients with information engineers and information scientists are utilizing Amazon Managed Workflows for Apache Airflow (Amazon MWAA) as a central orchestration platform for operating information pipelines and machine studying (ML) workloads. To help these pipelines, they usually require extra Python packages, akin to Apache Airflow Suppliers. For instance, a pipeline might require the Snowflake supplier package deal for interacting with a Snowflake warehouse, or the Kubernetes supplier package deal for provisioning Kubernetes workloads. In consequence, they should handle these Python dependencies effectively and reliably, offering compatibility with one another and the bottom Apache Airflow set up.

Python consists of the instrument pip to deal with package deal installations. To put in a package deal, you add the identify to a particular file named necessities.txt. The pip set up command instructs pip to learn the contents of your necessities file, decide dependencies, and set up the packages. Amazon MWAA runs the pip set up command utilizing this necessities.txt file throughout preliminary atmosphere startup and subsequent updates. For extra data, see The way it works.

Making a reproducible and steady necessities file is essential for decreasing pip set up and DAG errors. Moreover, this outlined set of necessities offers consistency throughout nodes in an Amazon MWAA atmosphere. That is most vital throughout employee auto scaling, the place extra employee nodes are provisioned and having totally different dependencies might result in inconsistencies and job failures. Moreover, this technique promotes consistency throughout totally different Amazon MWAA environments, akin to dev, qa, and prod.

This put up describes greatest practices for managing your necessities file in your Amazon MWAA atmosphere. It defines the steps wanted to find out your required packages and package deal variations, create and confirm your necessities.txt file with package deal variations, and package deal your dependencies.

Finest practices

The next sections describe one of the best practices for managing Python dependencies.

Specify package deal variations within the necessities.txt file

When making a Python necessities.txt file, you’ll be able to specify simply the package deal identify, or the package deal identify and a particular model. Including a package deal with out model data instructs the pip installer to obtain and set up the most recent obtainable model, topic to compatibility with different put in packages and any constraints. The package deal variations chosen throughout atmosphere creation could also be totally different than the model chosen throughout an auto scaling occasion afterward. This model change can create package deal conflicts resulting in pip set up errors. Even when the up to date package deal installs correctly, code adjustments within the package deal can have an effect on job conduct, resulting in inconsistencies in output. To keep away from these dangers, it’s greatest observe so as to add the model quantity to every package deal in your necessities.txt file.

Use the constraints file on your Apache Airflow model

A constraints file accommodates the packages, with variations, verified to be suitable together with your Apache Airflow model. This file provides an extra validation layer to forestall package deal conflicts. As a result of the constraints file performs such an vital position in stopping conflicts, starting with Apache Airflow v2.7.2 on Amazon MWAA, your necessities file should embody a --constraint assertion. If a --constraint assertion just isn’t equipped, Amazon MWAA will specify a suitable constraints file for you.

Constraint information can be found for every Airflow model and Python model mixture. The URLs have the next type:

https://uncooked.githubusercontent.com/apache/airflow/constraints-${AIRFLOW_VERSION}/constraints-${PYTHON_VERSION}.txt

The official Apache Airflow constraints are pointers, and in case your workflows require newer variations of a supplier package deal, you could must modify your constraints file and embody it in your DAG folder. When doing so, one of the best practices outlined on this put up change into much more vital to protect in opposition to package deal conflicts.

Create a .zip archive of all dependencies

Making a .zip file containing the packages in your necessities file and specifying this because the package deal repository supply makes positive the very same wheel information are used throughout your preliminary atmosphere setup and subsequent node configurations. The pip installer will use these native information for set up quite than connecting to the exterior PyPI repository.

Check the necessities.txt file and dependency .zip file

Testing your necessities file earlier than launch to manufacturing is essential to avoiding set up and DAG errors. Testing each domestically, with the MWAA native runner, and in a dev or staging Amazon MWAA atmosphere, are greatest practices earlier than deploying to manufacturing. You should utilize steady integration and supply (CI/CD) deployment methods to carry out the necessities and package deal set up testing, as described in Automating a DAG deployment with Amazon Managed Workflows for Apache Airflow.

Resolution overview

This answer makes use of the MWAA native runner, an open supply utility that replicates an Amazon MWAA atmosphere domestically. You utilize the native runner to construct and validate your necessities file, and package deal the dependencies. On this instance, you put in the snowflake and dbt-cloud supplier packages. You then use the MWAA native runner and a constraints file to find out the precise model of every package deal suitable with Apache Airflow. With this data, you then replace the necessities file, pinning every package deal to a model, and retest the set up. When you might have a profitable set up, you package deal your dependencies and check in a non-production Amazon MWAA atmosphere.

We use MWAA native runner v2.8.1 for this walkthrough and stroll by way of the next steps:

  1. Obtain and construct the MWAA native runner.
  2. Create and check a necessities file with package deal variations.
  3. Bundle dependencies.
  4. Deploy the necessities file and dependencies to a non-production Amazon MWAA atmosphere.

Conditions

For this walkthrough, it’s best to have the next conditions:

Arrange the MWAA native runner

First, you obtain the MWAA native runner model matching your goal MWAA atmosphere, then you definitely construct the picture.

Full the next steps to configure the native runner:

  1. Clone the MWAA native runner repository with the next command:
    git clone git@github.com:aws/aws-mwaa-local-runner.git -b v2.8.1

  2. With Docker operating, construct the container with the next command:
    cd aws-mwaa-local-runner
     ./mwaa-local-env build-image

Create and check a necessities file with package deal variations

Constructing a versioned necessities file makes positive all Amazon MWAA elements have the identical package deal variations put in. To find out the suitable variations for every package deal, you begin with a constraints file and an un-versioned necessities file, permitting pip to resolve the dependencies. You then create your versioned necessities file from pip’s set up output.

The next diagram illustrates this workflow.

Requirements file testing process

To construct an preliminary necessities file, full the next steps:

  1. In your MWAA native runner listing, open necessities/necessities.txt in your most popular editor.

The default necessities file will look much like the next:

--constraint "https://uncooked.githubusercontent.com/apache/airflow/constraints-2.8.1/constraints-3.11.txt"
apache-airflow-providers-snowflake==5.2.1
apache-airflow-providers-mysql==5.5.1

  1. Change the present packages with the next package deal listing:
--constraint "https://uncooked.githubusercontent.com/apache/airflow/constraints-2.8.1/constraints-3.11.txt"
apache-airflow-providers-snowflake
apache-airflow-providers-dbt-cloud[http]

  1. Save necessities.txt.
  2. In a terminal, run the next command to generate the pip set up output:
./mwaa-local-env test-requirements

test-requirements runs pip set up, which handles resolving the suitable package deal variations. Utilizing a constraints file makes positive the chosen packages are suitable together with your Airflow model. The output will look much like the next:

Efficiently put in apache-airflow-providers-dbt-cloud-3.5.1 apache-airflow-providers-snowflake-5.2.1 pyOpenSSL-23.3.0 snowflake-connector-python-3.6.0 snowflake-sqlalchemy-1.5.1 sortedcontainers-2.4.0

The message starting with Efficiently put in is the output of curiosity. This reveals which dependencies, and their particular model, pip put in. You utilize this listing to create your closing versioned necessities file.

Your output may also comprise Requirement already happy messages for packages already obtainable within the base Amazon MWAA atmosphere. You don’t add these packages to your necessities.txt file.

  1. Replace the necessities file with the listing of versioned packages from the test-requirements command. The up to date file will look much like the next code:
--constraint "https://uncooked.githubusercontent.com/apache/airflow/constraints-2.8.1/constraints-3.11.txt"
apache-airflow-providers-snowflake==5.2.1
apache-airflow-providers-dbt-cloud[http]==3.5.1
pyOpenSSL==23.3.0
snowflake-connector-python==3.6.0
snowflake-sqlalchemy==1.5.1
sortedcontainers==2.4.0

Subsequent, you check the up to date necessities file to verify no conflicts exist.

  1. Rerun the requirements-test operate:
./mwaa-local-env test-requirements

A profitable check won’t produce any errors. In the event you encounter dependency conflicts, return to the earlier step and replace the necessities file with extra packages, or package deal variations, primarily based on pip’s output.

Bundle dependencies

In case your Amazon MWAA atmosphere has a non-public webserver, you need to package deal your dependencies right into a .zip file, add the file to your S3 bucket, and specify the package deal location in your Amazon MWAA occasion configuration. As a result of a non-public webserver can’t entry the PyPI repository by way of the web, pip will set up the dependencies from the .zip file.

In the event you’re utilizing a public webserver configuration, you additionally profit from a static .zip file, which makes positive the package deal data stays unchanged till it’s explicitly rebuilt.

This course of makes use of the versioned necessities file created within the earlier part and the package-requirements characteristic within the MWAA native runner.

To package deal your dependencies, full the next steps:

  1. In a terminal, navigate to the listing the place you put in the native runner.
  2. Obtain the constraints file on your Python model and your model of Apache Airflow and place it within the plugins listing. For this put up, we use Python 3.11 and Apache Airflow v2.8.1:
curl -o plugins/constraints-2.8.1-3.11.txt https://uncooked.githubusercontent.com/apache/airflow/constraints-2.8.1/constraints-3.11.txt

  1. In your necessities file, replace the constraints URL to the native downloaded file.

The –-constraint assertion instructs pip to check the package deal variations in your necessities.txt file to the allowed model within the constraints file. Downloading a particular constraints file to your plugins listing allows you to management the constraint file location and contents.

The up to date necessities file will seem like the next code:

--constraint "/usr/native/airflow/plugins/constraints-2.8.1-3.11.txt"
apache-airflow-providers-snowflake==5.2.1
apache-airflow-providers-dbt-cloud[http]==3.5.1
pyOpenSSL==23.3.0
snowflake-connector-python==3.6.0
snowflake-sqlalchemy==1.5.1
sortedcontainers==2.4.0

  1. Run the next command to create the .zip file:
./mwaa-local-env package-requirements

package-requirements creates an up to date necessities file named packaged_requirements.txt and zippers all dependencies into plugins.zip. The up to date necessities file appears to be like like the next code:

--find-links /usr/native/airflow/plugins
--no-index
--constraint "/usr/native/airflow/plugins/constraints-2.8.1-3.11.txt"
apache-airflow-providers-snowflake==5.2.1
apache-airflow-providers-dbt-cloud[http]==3.5.1
pyOpenSSL==23.3.0
snowflake-connector-python==3.6.0
snowflake-sqlalchemy==1.5.1
sortedcontainers==2.4.0

Be aware the reference to the native constraints file and the plugins listing. The –-find-links assertion instructs pip to put in packages from /usr/native/airflow/plugins quite than the general public PyPI repository.

Deploy the necessities file

After you obtain an error-free necessities set up and package deal your dependencies, you’re able to deploy the belongings to a non-production Amazon MWAA atmosphere. Even when verifying and testing necessities with the MWAA native runner, it’s greatest observe to deploy and check the adjustments in a non-prod Amazon MWAA atmosphere earlier than deploying to manufacturing. For extra details about making a CI/CD pipeline to check adjustments, seek advice from Deploying to Amazon Managed Workflows for Apache Airflow.

To deploy your adjustments, full the next steps:

  1. Add your necessities.txt file and plugins.zip file to your Amazon MWAA atmosphere’s S3 bucket.

For directions on specifying a necessities.txt model, seek advice from Specifying the necessities.txt model on the Amazon MWAA console. For directions on specifying a plugins.zip file, seek advice from Putting in customized plugins in your atmosphere.

The Amazon MWAA atmosphere will replace and set up the packages in your plugins.zip file.

After the replace is full, confirm the supplier package deal set up within the Apache Airflow UI.

  1. Entry the Apache Airflow UI in Amazon MWAA.
  2. From the Apache Airflow menu bar, select Admin, then Suppliers.

The listing of suppliers, and their variations, is proven in a desk. On this instance, the web page displays the set up of apache-airflow-providers-db-cloud model 3.5.1 and apache-airflow-providers-snowflake model 5.2.1. This listing solely accommodates the supplier packages put in, not all supporting Python packages. Supplier packages which can be a part of the bottom Apache Airflow set up may also seem within the listing. The next picture is an instance of the package deal listing; notice the apache-airflow-providers-db-cloud and apache-airflow-providers-snowflake packages and their variations.

Airflow UI with installed packages

To confirm all package deal installations, view the ends in Amazon CloudWatch Logs. Amazon MWAA creates a log stream for the necessities set up and the stream accommodates the pip set up output. For directions, seek advice from Viewing logs on your necessities.txt.

A profitable set up ends in the next message:

Efficiently put in apache-airflow-providers-dbt-cloud-3.5.1 apache-airflow-providers-snowflake-5.2.1 pyOpenSSL-23.3.0 snowflake-connector-python-3.6.0 snowflake-sqlalchemy-1.5.1 sortedcontainers-2.4.0

In the event you encounter any set up errors, it’s best to decide the package deal battle, replace the necessities file, run the native runner check, re-package the plugins, and deploy the up to date information.

Clear up

In the event you created an Amazon MWAA atmosphere particularly for this put up, delete the atmosphere and S3 objects to keep away from incurring extra costs.

Conclusion

On this put up, we mentioned a number of greatest practices for managing Python dependencies in Amazon MWAA and learn how to use the MWAA native runner to implement these practices. These greatest practices cut back DAG and pip set up errors in your Amazon MWAA atmosphere. For extra particulars and code examples on Amazon MWAA, go to the Amazon MWAA Person Information and the Amazon MWAA examples GitHub repo.

Apache, Apache Airflow, and Airflow are both registered logos or logos of the Apache Software program Basis in the US and/or different international locations.


Concerning the Writer


Mike Ellis is a Technical Account Supervisor at AWS and an Amazon MWAA specialist. Along with helping clients with Amazon MWAA, he contributes to the Apache Airflow open supply undertaking.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *