Migrate workloads from AWS Information Pipeline

[ad_1]

AWS Information Pipeline helps clients automate the motion and transformation of information. With Information Pipeline, clients can outline data-driven workflows, in order that duties may be depending on the profitable completion of earlier duties. Launched in 2012, Information Pipeline predates a number of widespread Amazon Internet Companies (AWS) choices for orchestrating information pipelines corresponding to AWS Glue, AWS Step Capabilities, and Amazon Managed Workflows for Apache Airflow (Amazon MWAA).

Information Pipeline has been a foundational service for getting buyer off the bottom for his or her extract, remodel, load (ETL) and infra provisioning use circumstances. Some clients need a deeper stage of management and specificity than attainable utilizing Information Pipeline. With the current developments within the information trade, clients are in search of a extra feature-rich platform to modernize their information pipelines to get them prepared for information and machine studying (ML) innovation. This submit explains easy methods to migrate from Information Pipeline to alternate AWS providers to serve the rising wants of information practitioners. The choice you select is dependent upon your present workload on Information Pipeline. You possibly can migrate typical use circumstances of Information Pipeline to AWS Glue, Step Capabilities, or Amazon MWAA.

Notice that you’ll want to switch the configurations and code within the examples supplied on this submit primarily based in your necessities. Earlier than beginning any manufacturing workloads after migration, it’s essential to take a look at your new workflows to make sure no disruption to manufacturing programs.

Migrating workloads to AWS Glue

AWS Glue is a serverless information integration service that helps analytics customers to find, put together, transfer, and combine information from a number of sources. It consists of tooling for authoring, working jobs, and orchestrating workflows. With AWS Glue, you’ll be able to uncover and hook up with tons of of various information sources and handle your information in a centralized information catalog. You possibly can visually create, run, and monitor ETL pipelines to load information into your information lakes. Additionally, you’ll be able to instantly search and question cataloged information utilizing Amazon Athena, Amazon EMR, and Amazon Redshift Spectrum.

We advocate migrating your Information Pipeline workload to AWS Glue when:

  • You’re in search of a serverless information integration service that helps varied information sources, authoring interfaces together with visible editors and notebooks, and superior information administration capabilities corresponding to information high quality and delicate information detection.
  • Your workload may be migrated to AWS Glue workflows, jobs (in Python or Apache Spark) and crawlers (for instance, your present pipeline is constructed on high of Apache Spark).
  • You want a single platform that may deal with all features of your information pipeline, together with ingestion, processing, switch, integrity testing, and high quality checks.
  • Your present pipeline was created from a pre-defined template on the AWS Administration Console for Information Pipeline, corresponding to exporting a DynamoDB desk to Amazon S3, or importing DynamoDB backup information from S3, and also you’re in search of the identical template.
  • Your workload doesn’t rely upon a selected Hadoop ecosystem software corresponding to Apache Hive.
  • Your workload doesn’t require orchestrating on-premises servers, user-managed Amazon Elastic Compute Cloude (Amazon EC2) situations, or a user-managed Amazon EMR cluster.

Instance: Migrate EmrActivity on EmrCluster to export DynamoDB tables to S3

Probably the most frequent workloads on Information Pipeline is to backup Amazon DynamoDB tables to Amazon Easy Storage Service (Amazon S3). Information Pipeline has a pre-defined template named Export DynamoDB desk to S3 to export DynamoDB desk information to a given S3 bucket.

The template makes use of EmrActivity (named TableBackupActivity) which runs on EmrCluster (named EmrClusterForBackup) and backs up information on DynamoDBDataNode to S3DataNode.

You possibly can migrate these pipelines to AWS Glue as a result of it natively helps studying from DynamoDB.

To outline an AWS Glue job for the previous use case:

  1. Open the AWS Glue console.
  2. Select ETL jobs.
  3. Select Visible ETL.
  4. For Sources, choose Amazon DynamoDB.
  5. On the node Information supply - DynamoDB, for DynamoDB supply, choose Select the DynamoDB desk instantly, then choose your supply DynamoDB desk from the menu.
  6. For Connection choices, enter s3.bucket and dynamodb.s3.prefix.
  7. Select + (plus) so as to add a brand new node.
  8. For Targets, choose Amazon S3.
  9. On the node Information goal - S3 bucket, for Format, choose your most popular format, for instance, Parquet.
  10. For S3 Goal location, enter your vacation spot S3 path.
  11. On Job particulars tab, choose IAM function. In case you don’t have the IAM function, comply with Configuring IAM permissions for AWS Glue.
  12. Select Save and Run.

Your AWS Glue job has been efficiently created and began.

You would possibly discover that there isn’t a property to handle learn I/O fee. It’s as a result of the default DynamoDB reader utilized in Glue Studio doesn’t scan the supply DynamoDB desk. As an alternative it makes use of DynamoDB export.

Instance: Migrate EmrActivity on EmrCluster to import DynamoDB from S3

One other frequent workload on Information Pipeline is to revive DynamoDB tables utilizing backup information on Amazon S3. Information Pipeline has a pre-defined template named Import DynamoDB backup information from S3 to import DynamoDB desk information from a given S3 bucket.

The template makes use of EmrActivity (named TableLoadActivity) which runs on EmrCluster (named EmrClusterForLoad) and hundreds information from S3DataNode to DynamoDBDataNode.

You possibly can migrate these pipelines to AWS Glue as a result of it natively helps writing to DynamoDB.

Stipulations are to create a vacation spot DynamoDB desk and catalog it on Glue Information Catalog utilizing Glue crawler, Glue console, or the API.

  1. Open the AWS Glue console.
  2. Select ETL jobs.
  3. Select Visible ETL.
  4. For Sources, choose Amazon S3.
  5. On the node Information supply - S3 bucket, for S3 URL, enter your S3 path.
  6. Select + (plus) so as to add a brand new node.
  7. For Targets, choose AWS Glue Information Catalog.
  8. On the node Information goal - Information Catalog, for Database, choose your vacation spot database on Information Catalog.
  9. For Desk, choose your vacation spot desk on Information Catalog.
  10. On Job particulars tab, choose IAM function. In case you don’t have the IAM function, comply with Configuring IAM permissions for AWS Glue.
  11. Select Save and Run.

Your AWS Glue job has been efficiently created and began.

Migrating workloads to Step Capabilities

AWS Step Capabilities is a serverless orchestration service that allows you to construct workflows to your business-critical functions. With Step Capabilities, you employ a visible editor to construct workflows and combine instantly with over 11,000 actions for over 250 AWS providers, together with AWS Lambda, Amazon EMR, DynamoDB, and extra. You need to use Step Capabilities for orchestrating information processing pipelines, dealing with errors, and dealing with the throttling limits on the underlying AWS providers. You possibly can create workflows that course of and publish machine studying fashions, orchestrate micro-services, in addition to management AWS providers, corresponding to AWS Glue, to create ETL workflows. You can also create long-running, automated workflows for functions that require human interplay.

We advocate migrating your Information Pipeline workload to Step Capabilities when:

  • You’re in search of a serverless, extremely obtainable workflow orchestration service.
  • You’re in search of a cheap resolution that prices at single-task granularity.
  • Your workloads are orchestrating duties for a number of AWS providers, corresponding to Amazon EMR, AWS Lambda, AWS Glue, or DynamoDB.
  • You’re in search of a low-code resolution that comes with a drag-and-drop visible designer for workflow creation and doesn’t require studying new programming ideas.
  • You’re in search of a service that gives integrations with over 250 AWS providers protecting over 11,000 actions out-of-the-box, in addition to permitting integrations with customized non-AWS providers and actions.
  • Each Information Pipeline and Step Capabilities use JSON format to outline workflows. This lets you retailer your workflows in supply management, handle variations, management entry, and automate with steady integration and improvement (CI/CD). Step Capabilities use a syntax referred to as Amazon State Language, which is absolutely primarily based on JSON and permits a seamless transition between the textual and visible representations of the workflow.
  • Your workload requires orchestrating on-premises servers, user-managed EC2 situations, or a user-managed EMR cluster.

With Step Capabilities, you’ll be able to select the identical model of Amazon EMR that you simply’re at present utilizing in Information Pipeline.

For migrating actions on Information Pipeline managed sources, you should utilize AWS SDK service integration on Step Capabilities to automate useful resource provisioning and cleansing up. For migrating actions on on-premises servers, user-managed EC2 situations, or a user-managed EMR cluster, you’ll be able to set up an SSM agent to the occasion. You possibly can provoke the command by means of the AWS Programs Supervisor Run Command from Step Capabilities. It’s also possible to provoke the state machine from the schedule outlined in Amazon EventBridge.

Instance: Migrate HadoopActivity on EmrCluster

Emigrate HadoopActivity on EmrCluster on Information Pipeline to Step Capabilities:

  1. Open the AWS Step Capabilities console.
  2. Select State machines.
  3. Select Create state machine.
  4. Within the Select a template wizard, seek for emr, choose Handle an EMR job, and select Choose.
  1. For Select easy methods to use this template, choose Construct on it.
  2. Select Use template.
  1. For Create an EMR cluster state, configure API Parameters primarily based on the EMR launch label, EMR capability, IAM function, and so forth primarily based on the present EmrClusternode configuration on Information Pipeline.
  1. For Run first step state, configure API Parameters primarily based on the JAR file and arguments primarily based on the present HadoopActivity node configuration on Information Pipeline.
  2. You probably have additional actions configured on the present HadoopActivity, repeat step 8.
  3. Select Create.

Your state machine has been efficiently configured. Be taught extra in Handle an Amazon EMR Job.

Migrating workloads to Amazon MWAA

Amazon MWAA is a managed orchestration service for Apache Airflow that allows you to use the Apache Airflow platform to arrange and function end-to-end information pipelines within the cloud at scale. Apache Airflow is an open-source software used to programmatically creator, schedule, and monitor sequences of processes and duties known as workflows. Apache Airflow brings in new ideas like executors, swimming pools, and SLAs that give you superior information orchestration capabilities. With Amazon MWAA, you should utilize Airflow and Python programming language to create workflows with out having to handle the underlying infrastructure for scalability, availability, and safety. Amazon MWAA routinely scales its workflow runtime capability to fulfill your wants and is built-in with AWS safety providers to assist give you quick and safe entry to your information.

We advocate migrating your Information Pipeline workloads to Amazon MWAA when:

  • You’re in search of a managed, extremely obtainable service to orchestrate workflows written in Python.
  • You need to transition to a completely managed, extensively adopted open supply know-how—Apache Airflow—for optimum portability.
  • You require a single platform that may deal with all features of your information pipeline, together with ingestion, processing, switch, integrity testing, and high quality checks.
  • You’re in search of a service designed for information pipeline orchestration with options corresponding to wealthy UI for observability, restarts for failed workflows, backfills, retries for duties, and lineage help with OpenLineage.
  • You’re in search of a service that comes with greater than 1,000 pre-built operators and sensors, protecting AWS in addition to non-AWS providers.
  • Your workload requires orchestrating on-premises servers, user-managed EC2 situations, or a user-managed EMR cluster.

Amazon MWAA workflows are outlined as directed acyclic graphs (DAGs) utilizing Python, so it’s also possible to deal with them as supply code. Airflow’s extensible Python framework allows you to construct workflows connecting with just about any know-how. It comes with a wealthy person interface for viewing and monitoring workflows and may be simply built-in with model management programs to automate the CI/CD course of. With Amazon MWAA, you’ll be able to select the identical model of Amazon EMR that you simply’re at present utilizing in Information Pipeline.

Instance: Migrate HadoopActivity on EmrCluster

Full the next steps in case you don’t have present MWAA environments:

  1. Create an AWS CloudFormation template in your laptop by copying the template from the short begin information into an area textual content file.
  2. On the CloudFormation console, select Stacks within the navigation pane.
  3. Select Create stack with the choice With new sources (normal).
  4. Select Add a template file and choose the native template file.
  5. Select Subsequent.
  6. Full the setup steps, coming into a reputation for the surroundings, and depart the remainder of the parameters as default.
  7. On the final step, acknowledge that sources shall be created and select Submit.

The creation can take 20–half-hour, till the standing of the stack adjustments to CREATE_COMPLETE. The useful resource that can take essentially the most time is the Airflow surroundings. Whereas it’s being created, you’ll be able to proceed with the next steps, till you’re required to open the Airflow UI.

An Airflow workflow relies on a DAG, which is outlined by a Python file that programmatically specifies the totally different duties concerned and its interdependencies. Full the next scripts to create the DAG:

  1. Create an area file named emr_dag.py utilizing a textual content editor with following snippets, and configure the EMR associated parameters primarily based on the present Information Pipeline definition:
    from airflow import DAG
    from airflow.suppliers.amazon.aws.operators.emr import (
        EmrCreateJobFlowOperator,
        EmrAddStepsOperator,
    )
    from airflow.suppliers.amazon.aws.sensors.emr import EmrStepSensor
    from airflow.utils.dates import days_ago
    from datetime import timedelta
    import os
    DAG_ID = os.path.basename(__file__).substitute(".py", "")
    SPARK_STEPS = [
        {
            'Name': 'calculate_pi',
            'ActionOnFailure': 'CONTINUE',
            'HadoopJarStep': {
                'Jar': 'command-runner.jar',
                'Args': ['spark-example', 'SparkPi', '10'],
            },
        }
    ]
    JOB_FLOW_OVERRIDES = {
        'Title': 'my-demo-cluster',
        'ReleaseLabel': 'emr-6.1.0',
        'Functions': [
            {
                'Name': 'Spark'
            },
        ],
        'Situations': {
            'InstanceGroups': [
                {
                    'Name': "Master nodes",
                    'Market': 'ON_DEMAND',
                    'InstanceRole': 'MASTER',
                    'InstanceType': 'm5.xlarge',
                    'InstanceCount': 1,
                },
                {
                    'Name': "Slave nodes",
                    'Market': 'ON_DEMAND',
                    'InstanceRole': 'CORE',
                    'InstanceType': 'm5.xlarge',
                    'InstanceCount': 2,
                }
            ],
            'KeepJobFlowAliveWhenNoSteps': False,
            'TerminationProtected': False,
        },
        'VisibleToAllUsers': True,
        'JobFlowRole': 'EMR_EC2_DefaultRole',
        'ServiceRole': 'EMR_DefaultRole'
    }
    with DAG(
        dag_id=DAG_ID,
        start_date=days_ago(1),
        schedule_interval="@as soon as",
        dagrun_timeout=timedelta(hours=2),
        catchup=False,
        tags=['emr'],
    ) as dag:
        cluster_creator = EmrCreateJobFlowOperator(
            task_id='create_job_flow',
            job_flow_overrides=JOB_FLOW_OVERRIDES,
            aws_conn_id='aws_default',
        )
        step_adder = EmrAddStepsOperator(
            task_id='add_steps',
            job_flow_id=cluster_creator.output,
            aws_conn_id='aws_default',
            steps=SPARK_STEPS,
        )
        step_checker = EmrStepSensor(
            task_id='watch_step',
            job_flow_id=cluster_creator.output,
            step_id="{{ task_instance.xcom_pull(task_ids="add_steps")[0] }}",
            aws_conn_id='aws_default',
        )
        cluster_creator >> step_adder >> step_checker

Defining the schedule in Amazon MWAA is so simple as updating the schedule_interval parameter for the DAG. For instance, to run the DAG day by day, set schedule_interval="@day by day".

Now, you create a workflow that invokes the Amazon EMR step you simply created:

  1. On the Amazon S3 console, find the bucket created by the CloudFormation template, which may have a reputation beginning with the identify of the stack adopted by -environmentbucket- (for instance, myairflowstack-environmentbucket-ap1qks3nvvr4).
  2. Inside that bucket, create a folder referred to as dags, and inside that folder, add the DAG file emr_dag.py that you simply created within the earlier part.
  3. On the Amazon MWAA console, navigate to the surroundings you deployed with the CloudFormation stack.

If the standing is just not but Obtainable, wait till it reaches that state. It shouldn’t take longer than half-hour after you deployed the CloudFormation stack.

  1. Select the surroundings hyperlink on the desk to see the surroundings particulars.

It’s configured to select up DAGs from the bucket and folder you used within the earlier steps. Airflow will monitor that folder for adjustments.

  1. Select Open Airflow UI to open a brand new tab accessing the Airflow UI, utilizing the built-in IAM safety to signal you in.

If there are points with the DAG file you created, it should show an error on high of the web page indicating the strains affected. In that case, evaluation the steps and add once more. After a number of seconds, it should parse it and replace or take away the error banner.

Clear up

After you migrate your present Information Pipeline workload and confirm that the migration was profitable, delete your pipelines in Information Pipeline to cease additional runs and billing.

Conclusion

On this weblog submit, we outlined a number of alternate AWS providers for migrating your present Information Pipeline workloads. You possibly can migrate to AWS Glue to run and orchestrate Apache Spark functions, AWS Step Capabilities to orchestrate workflows involving varied different AWS providers, or Amazon MWAA to assist handle workflow orchestration utilizing Apache Airflow. By migrating, it is possible for you to to run your workloads with a broader vary of information integration functionalities. You probably have extra questions, submit within the feedback or examine migration examples in our documentation.


In regards to the authors

Noritaka Sekiyama is a Principal Large Information Architect on the AWS Glue crew and AWS Information Pipeline crew. He’s accountable for constructing software program artifacts to assist clients. In his spare time, he enjoys biking together with his street bike.

Vaibhav Porwal is a Senior Software program Growth Engineer on the AWS Glue and AWS Information Pipeline crew. He’s engaged on fixing issues in orchestration house by constructing low value, repeatable, scalable workflow programs that allows clients to create their ETL pipelines seamlessly.

Sriram Ramarathnam is a Software program Growth Supervisor on the AWS Glue and AWS Information Pipeline crew. His crew works on fixing difficult distributed programs issues for information integration throughout AWS serverless and serverfull compute choices.

Matt Su is a Senior Product Supervisor on the AWS Glue crew and AWS Information Pipeline crew. He enjoys serving to clients uncover insights and make higher choices utilizing their information with AWS Analytics providers. In his spare time, he enjoys snowboarding and gardening.

[ad_2]

Leave a Reply

Your email address will not be published. Required fields are marked *