Migrate knowledge from an on-premises Hadoop surroundings to Amazon S3 utilizing S3DistCp with AWS Direct Join

[ad_1]

This publish demonstrates tips on how to migrate almost any quantity of information from an on-premises Apache Hadoop surroundings to Amazon Easy Storage Service (Amazon S3) by utilizing S3DistCp on Amazon EMR with AWS Direct Join.

To switch assets from a goal EMR cluster, the normal Hadoop DistCp have to be run on the supply cluster to maneuver knowledge from one cluster to a different, which invokes a MapReduce job on the supply cluster and may devour a variety of cluster assets (relying on the info quantity). To keep away from this drawback and reduce the load on the supply cluster, you need to use S3DistCp with Direct Connect with migrate terabytes of information from an on-premises Hadoop surroundings to Amazon S3. This course of runs the job on the goal EMR cluster, minimizing the burden on the supply cluster.

This publish supplies directions for utilizing S3DistCp for migrating knowledge to the AWS Cloud. Apache DistCp is an open supply instrument that you need to use to repeat giant quantities of information. S3DistCp is just like DistCp, however optimized to work with AWS, notably Amazon S3. When in comparison with Hadoop DistCp, S3DistCp is extra scalable, with greater throughput and environment friendly for parallel copying of huge numbers of objects throughout s3 buckets and throughout AWS accounts.

Answer overview

The structure for this answer consists of the next parts:

  • Supply know-how stack:
    • A Hadoop cluster with connectivity to the goal EMR cluster over Direct Join
  • Goal know-how stack:

The next structure diagram exhibits how you need to use the S3DistCp from the goal EMR cluster emigrate large volumes of information from an on-premises Hadoop surroundings by way of a non-public community connection, resembling Direct Connect with Amazon S3.

S3DistCpThis migration method makes use of the next instruments to carry out the migration:

  • S3DistCp – S3DistCp is just like DistCp, however optimized to work with AWS, notably Amazon S3. The command for S3DistCp in Amazon EMR model 4.0 and later is s3-dist-cp, which you add as a step in a cluster or on the command line. With S3DistCp, you’ll be able to effectively copy giant quantities of information from Amazon S3 into Hadoop Distributed Filed System (HDFS), the place it may be processed by subsequent steps in your EMR cluster. You can even use S3DistCp to repeat knowledge between S3 buckets or from HDFS to Amazon S3. S3DistCp is extra scalable and environment friendly for parallel copying of huge numbers of objects throughout buckets and throughout AWS accounts.
  • Amazon S3 – Amazon S3 is an object storage service. You should utilize Amazon S3 to retailer and retrieve any quantity of information at any time, from anyplace on the internet.
  • Amazon VPC – Amazon VPC provisions a logically remoted part of the AWS Cloud the place you’ll be able to launch AWS assets in a digital community that you simply’ve outlined. This digital community carefully resembles a conventional community that you’d function in your individual knowledge middle, with the advantages of utilizing the scalable infrastructure of AWS.
  • AWS Identification and Entry Administration (IAM) – IAM is an internet service for securely controlling entry to AWS companies. With IAM, you’ll be able to centrally handle customers, safety credentials resembling entry keys, and permissions that management which AWS assets customers and functions can entry.
  • Direct Join – Direct Join hyperlinks your inner community to a Direct Join location over an ordinary ethernet fiber-optic cable. One finish of the cable is related to your router, the opposite to a Direct Join router. With this connection, you’ll be able to create digital interfaces on to public AWS companies (for instance, to Amazon S3) or to Amazon VPC, bypassing web service suppliers in your community path. A Direct Join location supplies entry to AWS within the AWS Area with which it’s related. You should utilize a single connection in a public Area or in AWS GovCloud (US) to entry public AWS companies in all different public Areas.

Within the following sections, we talk about the steps to carry out the info migration utilizing S3DistCp.

Stipulations

Earlier than you start, you must have the next conditions:

Get the energetic NameNode from the supply Hadoop cluster

Sign up to any of the nodes on the supply cluster and run the next instructions on bash to get the energetic NameNode on the cluster.

In newer variations of Hadoop, run the next command to get the service standing, which is able to record the energetic NameNode on the cluster:

[hadoop@hadoopcluster01 ~]$ hdfs haadmin -getAllServiceState
hadoopcluster01.check.amazon.native:8020                energetic
hadoopcluster02.check.amazon.native:8020                standby
hadoopcluster03.check.amazon.native:8020                standby

On older variations of Hadoop, run the next methodology on bash to get the energetic NameNode on the cluster:

[hadoop@hadoopcluster01 ~]$ getActiveNameNode(){
    nameservice=$(hdfs getconf -confKey dfs.nameservices);
    ns=$(hdfs getconf -confKey dfs.ha.namenodes.${nameservice});
    IFS=',' learn -ra ADDR <<< "$ns"
    activeNode=""
    for n in "${ADDR[@]}"; do
        state=$(hdfs haadmin -getServiceState $n)
        if [ $state = "active" ]; then
            echo "$state ==>$n"
            activeNode=$n
        fi
    completed
    activeNodeFQDN=$(hdfs getconf -confKey dfs.namenode.rpc-address.${nameservice}.${activeNode})
    echo $activeNodeFQDN;
}

[hadoop@hadoopcluster01 ~]$ getActiveNameNode
energetic ==>namenode863
hadoopcluster01.check.amazon.native:8020

Validate connectivity from the EMR cluster to the supply Hadoop cluster

As talked about within the conditions, you must have an EMR cluster and fix a customized IAM function for Amazon EMR. Run the next command to validate the connectivity from the goal EMR cluster to the supply Hadoop cluster:

[hadoop@emrcluster01 ~]$ telnet hadoopcluster01.check.amazon.native 8020
Making an attempt 192.168.0.1...
Related to hadoopcluster01.check.amazon.native.
Escape character is '^]'.
^]

Alternatively, you’ll be able to run the next command:

[hadoop@emrcluster01 ~]$ curl -v telnet://hadoopcluster01.check.amazon.native:8020
*   Making an attempt 192.168.0.1:8020...
* Related to hadoopcluster01.check.amazon.native (192.168.0.1) port 8020 (#0)

Validate if the supply HDFS path exists

Examine if the supply HDFS path is legitimate. If the next command returns 0, indicating that it’s legitimate, you’ll be able to proceed to the following step:

[hadoop@emrcluster01 ~]$ hdfs dfs -test -d hdfs://hadoopcluster01.check.amazon.native/consumer/hive/warehouse/check.db/test_table01
[hadoop@emrcluster01 ~]$ echo $?
0

Switch knowledge utilizing S3DistCp

To switch the supply HDFS folder to the goal S3 bucket, use the next command:

s3-dist-cp --src hdfs://hadoopcluster01.check.amazon.native/consumer/hive/warehouse/check.db/test_table01 --dest s3://<BUCKET_NAME>/consumer/hive/warehouse/check.db/test_table01

To switch giant information in multipart chunks, use the next command to set the chuck measurement:

s3-dist-cp --src hdfs://hadoopcluster01.check.amazon.native/consumer/hive/warehouse/check.db/test_table01 --dest s3://<BUCKET_NAME>/consumer/hive/warehouse/check.db/test_table01 --multipartUploadChunkSize=1024

This can invoke a MapReduce job on the goal EMR cluster. Relying on the amount of the info and the bandwidth pace, the job can take a couple of minutes up to some hours to finish.

To get the record of operating yarn functions on the cluster, run the next command:

Validate the migrated knowledge

After the previous MapReduce job completes efficiently, use the next steps to validate the info copied over:

source_size=$(hdfs dfs -du -s hdfs://hadoopcluster01.check.amazon.native/consumer/hive/warehouse/check.db/test_table01 | awk -F' ' '{print $1}')
target_size=$(aws s3 ls --summarize --recursive s3://<BUCKET_NAME>/consumer/hive/warehouse/check.db/test_table01 | grep "Whole Dimension:" | awk -F' ' '{print $3}')

printf "Supply HDFS folder Dimension in bytes: $source_size"
printf "Goal S3 folder Dimension in bytes: $target_size" 

If the supply and goal measurement aren’t equal, carry out the cleanup step within the subsequent part and repeat the previous S3DistCp step.

Clear up partially copied or errored out partitions and information

S3DistCp doesn’t clear up partially copied information and partitions if it fails whereas copying. Clear up partially copied or errored out partitions and information earlier than you reinitiate the S3DistCp course of. To wash up objects on Amazon S3, use the next AWS CLI command to carry out the delete operation:

aws s3 rm s3://<BUCKET_NAME>/path/to/the/object –recursive

Finest practices

To keep away from copy errors when utilizing S3DistCP to repeat a single file (as an alternative of a listing) from Amazon S3 to HDFS, use Amazon EMR 5.33.0 or later, or Amazon EMR 6.3.0 or later.

Limitations

The next are limitations of this method:

  • If S3DistCp is unable to repeat some or all the specified information, the cluster step fails and returns a non-zero error code. If this happens, S3DistCp doesn’t clear up partially copied information.
  • S3DistCp doesn’t help concatenation for Parquet information. Use PySpark as an alternative. For extra info, see Concatenating Parquet information in Amazon EMR.
  • VPC limitations apply to Direct Join for Amazon S3. For extra info, see AWS Direct Join quotas.

Conclusion

On this publish, we demonstrated the ability of S3DistCp emigrate large volumes of information from a supply Hadoop cluster to a goal S3 bucket or HDFS on an EMR cluster. With S3DistCp, you’ll be able to migrate terabytes of information with out affecting the compute assets on the supply cluster as in comparison with Hadoop DistCp.

For extra details about utilizing S3DistCp, see the next assets:


In regards to the Creator

Vicky Wilson Jacob is a Senior Knowledge Architect with AWS Skilled Companies Analytics Observe. Vicky focuses on Massive Knowledge, Knowledge Engineering, Machine Studying, Knowledge Science and Generative AI. He’s keen about know-how and fixing buyer challenges. At AWS, he works with firms serving to prospects implement large knowledge, machine studying, analytics, and generative AI options on cloud. Outdoors of labor, he enjoys spending time with household, singing, and enjoying guitar.

[ad_2]

Leave a Reply

Your email address will not be published. Required fields are marked *