Carry out reindexing in Amazon OpenSearch Serverless utilizing Amazon OpenSearch Ingestion


Amazon OpenSearch Serverless is a serverless deployment possibility for Amazon OpenSearch Service that makes it easy to run search and analytics workloads with out managing infrastructure. Clients utilizing OpenSearch Serverless usually want to repeat paperwork between two indexes throughout the similar assortment or throughout totally different collections. This primarily arises from two situations:

  • Reindexing – You often have to replace or modify index mapping as a result of evolving information wants or schema adjustments
  • Catastrophe restoration – Though OpenSearch Serverless information is inherently sturdy, chances are you’ll need to copy information throughout AWS Areas for added redundancy and resiliency

Amazon OpenSearch Ingestion had just lately launched a function supporting OpenSearch as a supply. OpenSearch Ingestion, a completely managed, serverless information collector, facilitates real-time ingestion of log, metric, and hint information into OpenSearch Service domains and OpenSearch Serverless collections. We are able to leverage this function to deal with these two situations, by studying the information from an OpenSearch Serverless Assortment. This functionality means that you can effortlessly copy information between indexes, making information administration duties extra streamlined and eliminating the necessity for customized code.

On this publish, we define the steps to repeat information between two indexes in the identical OpenSearch Serverless assortment utilizing the brand new OpenSearch supply function of OpenSearch Ingestion. That is notably helpful for reindexing operations the place you need to change your information schema. OpenSearch Serverless and OpenSearch Ingestion are each serverless companies that allow you to seamlessly deal with your information workflows, offering optimum efficiency and scalability.

Resolution overview

The next diagram reveals the circulate of copying paperwork from the supply index to the vacation spot index utilizing an OpenSearch Ingestion pipeline.

Implementing the answer consists of the next steps:

  1. Create an AWS Id and Entry Administration (IAM) position to make use of as an OpenSearch Ingestion pipeline position.
  2. Replace the information entry coverage hooked up to the OpenSearch Serverless assortment.
  3. Create an OpenSearch Ingestion pipeline that merely copies information from one index to a different, or you’ll be able to even create an index template utilizing the OpenSearch Ingestion pipeline to outline express mapping, after which copy the information from the supply index to the vacation spot index with the outlined mapping utilized.

Stipulations

To get began, you will need to have an energetic OpenSearch Serverless assortment with an index that you just need to reindex (copy). Consult with Creating collections to be taught extra about creating a group.

When the gathering is prepared, word the next particulars:

  • The endpoint of the OpenSearch Serverless assortment
  • The identify of the index from which the paperwork must be copied
  • If the gathering is outlined as a VPC assortment, word down the identify of the community coverage hooked up to the gathering

You utilize these particulars within the ingestion pipeline configuration.

Create an IAM position to make use of as a pipeline position

An OpenSearch Ingestion pipeline wants sure permissions to drag information from the supply and write to its sink. For this walkthrough, each the supply and sink are the identical, but when the supply and sink collections are totally different, modify the coverage accordingly.

Full the next steps:

  1. Create an IAM coverage (opensearch-ingestion-pipeline-policy) that gives permission to learn and ship information to the OpenSearch Serverless assortment. The next is a pattern coverage with least privileges (modify {account-id}, {area}, {collection-id} and {collection-name} accordingly):
    {
        "Model": "2012-10-17",
        "Assertion": [{
                "Action": [
                    "aoss:BatchGetCollection",
                    "aoss:APIAccessAll"
                ],
                "Impact": "Permit",
                "Useful resource": "arn:aws:aoss:{area}:{account-id}:assortment/{collection-id}"
            },
            {
                "Motion": [
                    "aoss:CreateSecurityPolicy",
                    "aoss:GetSecurityPolicy",
                    "aoss:UpdateSecurityPolicy"
                ],
                "Impact": "Permit",
                "Useful resource": "*",
                "Situation": {
                    "StringEquals": {
                        "aoss:assortment": "{collection-name}"
                    }
                }
            }
        ]
    }

  2. Create an IAM position (opensearch-ingestion-pipeline-role) that the OpenSearch Ingestion pipeline will assume. Whereas creating the position, use the coverage you created (opensearch-ingestion-pipeline-policy). The position ought to have the next belief relationship (modify {account-id} and {area} accordingly):
    {
        "Model": "2012-10-17",
        "Assertion": [{
            "Effect": "Allow",
            "Principal": {
                "Service": "osis-pipelines.amazonaws.com"
            },
            "Action": "sts:AssumeRole",
            "Condition": {
                "StringEquals": {
                    "aws:SourceAccount": "{account-id}"
                },
                "ArnLike": {
                    "aws:SourceArn": "arn:aws:osis:{region}:{account-id}:pipeline/*"
                }
            }
        }]
    }

  3. Report the ARN of the newly created IAM position (arn:aws:iam::111122223333:position/opensearch-ingestion-pipeline-role).

Replace the information entry coverage hooked up to the OpenSearch Serverless assortment

After you create the IAM position, it’s essential to replace the information entry coverage hooked up to the OpenSearch Serverless assortment. Information entry insurance policies management entry to the OpenSearch operations that OpenSearch Serverless helps, comparable to PUT <index> or GET _cat/indices. To carry out the replace, full the next steps:

  1. On the OpenSearch Service console, beneath Serverless within the navigation pane, select Collections.
  2. From the checklist of the collections, select your OpenSearch Serverless assortment.
  3. On the Overview tab, within the Information entry part, select the related coverage.
  4. Select Edit.
  5. Edit the coverage within the JSON editor so as to add the next JSON rule block within the present JSON (modify {account-id} and {collection-name} accordingly):
    {
        "Guidelines": [{
            "Resource": [
                "index/{collection-name}/*"
            ],
            "Permission": [
                "aoss:CreateIndex",
                "aoss:UpdateIndex",
                "aoss:DescribeIndex",
                "aoss:ReadDocument",
                "aoss:WriteDocument"
            ],
            "ResourceType": "index"
        }],
        "Principal": [
            "arn:aws:iam::{account-id}:role/opensearch-ingestion-pipeline-role"
        ],
        "Description": "Present entry to OpenSearch Ingestion Pipeline Function"
    }

You may also use the Visible Editor technique to decide on Add one other rule and add the previous permissions for arn:aws:iam::{account-id}:position/opensearch-ingestion-pipeline-role.

  1. Select Save.

Now you’ve efficiently allowed the OpenSearch Ingestion position to carry out OpenSearch operations in opposition to the OpenSearch Serverless assortment.

Create and configure the OpenSearch Ingestion pipeline to repeat the information from one index to a different

Full the next steps:

  1. On the OpenSearch Service console, select Pipelines beneath Ingestion within the navigation pane.
  2. Select Create a pipeline.
  3. In Select Blueprint, choose OpenSearchDataMigrationPipeline.
  4. For Pipeline identify, enter a reputation (for instance, sample-ingestion-pipeline).
  5. For Pipeline capability, you’ll be able to outline the minimal and most capability to scale up the assets. For this walkthrough, you should use the default worth of two Ingestion OCUs for Min capability and 4 Ingestion OCUs for Max capability. Nevertheless, you’ll be able to even select totally different values as OpenSearch Ingestion robotically scales your pipeline capability in response to your estimated workload, based mostly on the minimal and most Ingestion OpenSearch Compute Items (Ingestion OCUs) that you just specify.
  6. Replace the next data for the supply:
    1. Uncomment hosts and specify the endpoint of the prevailing OpenSearch Serverless assortment that was copied as a part of stipulations.
    2. Uncomment embody and index_name_regex, and specify the identify of the index that may act because the supply (on this demo, we’re utilizing logs-2024.03.01).
    3. Uncomment area beneath aws and specify the AWS Area the place your OpenSearch Serverless assortment is (for instance, us-east-1).
    4. Uncomment sts_role_arn beneath aws and specify the position that has permission to learn information from the OpenSearch Serverless assortment (for instance, arn:aws:iam::111122223333:position/opensearch-ingestion-pipeline-role). This is identical position that was added within the information entry coverage of the gathering.
    5. Replace the serverless flag to true.
    6. If the OpenSearch Serverless assortment has VPC entry, uncomment serverless_options and network_policy_name and specify the identify of the community coverage used for the gathering.
    7. Uncomment scheduling, interval, index_read_count, and start_time and modify these parameters accordingly.
      Utilizing these parameters makes positive the OpenSearch Ingestion pipeline processes the indexes a number of instances (to choose up new paperwork).
      Be aware – If the gathering specified within the sink is of the Time sequence or Vector search sort, you’ll be able to hold the scheduling, interval, index_read_count, and start_time parameters commented.
  1. Replace the next data for the sink:
    1. Uncomment hosts and specify the endpoint of the prevailing OpenSearch Serverless assortment.
    2. Uncomment sts_role_arn beneath aws and specify the position that has permission to jot down information into the OpenSearch Serverless assortment (for instance, arn:aws:iam::111122223333:position/opensearch-ingestion-pipeline-role). This is identical position that was added within the information entry coverage of the gathering.
    3. Replace the serverless flag to true.
    4. If the OpenSearch Serverless assortment has VPC entry, uncomment serverless_options and network_policy_name and specify the identify of the community coverage used for the gathering.
    5. Replace the worth for index and supply the index identify to which you need to switch the paperwork (for instance, new-logs-2024.03.01).
    6. For document_id, you will get the ID from the doc metadata within the supply and use the identical within the goal.
      Nevertheless, it is very important word that customized doc IDs are solely supported for the Search sort of assortment. In case your assortment is of the Time Collection or Vector Search sort, it’s best to remark out the document_id line.
    7. (Optionally available) The values for bucket, area and sts_role_arn keys throughout the dlq part could be modified to seize any failed requests in an S3 bucket.
      Be aware – Further permission to opensearch-ingestion-pipeline-role must be given, if configuring DLQ. Please refer Writing to a dead-letter queue, for the adjustments required.
      For this walkthrough, you’ll not arrange a DLQ. You possibly can take away your entire dlq block.
  1. Now click on on Validate pipeline to validate the pipeline configuration.
  2. For Community settings, select your most popular setting:
    1. Select VPC entry and choose your VPC, subnet, and safety group to arrange the entry privately. Select this feature if the OpenSearch Serverless assortment has VPC entry. AWS recommends utilizing a VPC endpoint for all manufacturing workloads.
    2. Select Public to make use of public entry. For this walkthrough, we choose Public as a result of the gathering can be accessible from public community.
  3. For Log Publishing Choice, you’ll be able to both create a brand new Amazon CloudWatch group or use an present CloudWatch group to jot down the ingestion logs. This offers entry to details about errors and warnings raised through the operation, which might help throughout troubleshooting. For this walkthrough, select Create new group.
  4. Select Subsequent, and confirm the main points you specified in your pipeline settings.
  5. Select Create pipeline.

It is going to take a few minutes to create the ingestion pipeline. After the pipeline is created, you will note the paperwork within the vacation spot index, specified within the sink (for instance, new-logs-2024.03.01). After all of the paperwork are copied, you’ll be able to validate the variety of paperwork through the use of the rely API.

When the method is full, you’ve the choice to cease or delete the pipeline. In the event you select to maintain the pipeline working, it’ll proceed to repeat new paperwork from the supply index in response to the outlined schedule, if specified.

On this walkthrough, the endpoint outlined within the hosts parameter beneath supply and sink of the pipeline configuration belonged to the identical assortment which was of the Search sort. If the collections are totally different, it’s essential to modify the permissions for the IAM position (opensearch-ingestion-pipeline-role) to permit entry to each collections. Moreover, be sure to replace the information entry coverage for each the collections to grant entry to the OpenSearch Ingestion pipeline.

Create an index template utilizing the OpenSearch Ingestion pipeline to outline mapping

In OpenSearch, you’ll be able to outline how paperwork and their fields are saved and listed by making a mapping. The mapping specifies the checklist of fields for a doc. Each area within the doc has a area sort, which defines the kind of information the sector comprises. OpenSearch Service dynamically maps information varieties in every incoming doc if an express mapping just isn’t outlined. Nevertheless, you should use the template_type parameter with the index-template worth and template_content with JSON of the content material of the index-template within the pipeline configuration to outline express mapping guidelines. You additionally have to outline the index_type parameter with the worth as customized.

The next code reveals an instance of the sink portion of the pipeline and the utilization of index_type, template_type, and template_content:

sink:
    - opensearch:
        # Present an AWS OpenSearch Service area endpoint
        hosts: [ "<<OpenSearch-Serverless-Collection-Endpoint>>" ]
        aws:
          # Present a Function ARN with entry to the area. This position ought to have a belief relationship with osis-pipelines.amazonaws.com
          sts_role_arn: "arn:aws:iam::111122223333:position/opensearch-ingestion-pipeline-role"
          # Present the area of the area.
          area: "us-east-1"
          # Allow the 'serverless' flag if the sink is an Amazon OpenSearch Serverless assortment
          serverless: true
          # serverless_options:
            # Specify a reputation right here to create or replace community coverage for the serverless assortment
            # network_policy_name: "network-policy-name"
        # This may make it so every doc within the supply cluster shall be written to the identical index within the vacation spot cluster
        index: "new-logs-2024.03.01"
        index_type: customized
        template_type: index-template
        template_content: >
          {
            "template" : {
              "mappings" : {
                "properties" : {
                  "Information" : {
                    "sort" : "textual content"
                  },
                  "EncodedColors" : {
                    "sort" : "binary"
                  },
                  "Sort" : {
                    "sort" : "key phrase"
                  },
                  "LargeDouble" : {
                    "sort" : "double"
                  }          
                }
              }
            }
          }
        # This may make it so every doc within the supply cluster shall be written with the identical document_id within the vacation spot cluster
        document_id: "${getMetadata("opensearch-document_id")}"
        # Allow the 'distribution_version' setting if the AWS OpenSearch Service area is of model Elasticsearch 6.x
        # distribution_version: "es6"
        # Allow and swap the 'enable_request_compression' flag if the default compression setting is modified within the area. See https://docs.aws.amazon.com/opensearch-service/newest/developerguide/gzip.html
        # enable_request_compression: true/false
        # Allow the S3 DLQ to seize any failed requests in an S3 bucket
        # dlq:
          # s3:
            # Present an S3 bucket
            # bucket: "<<your-dlq-bucket-name>>"
            # Present a key path prefix for the failed requests
            # key_path_prefix: "<<logs/dlq>>"
            # Present the area of the bucket.
            # area: "<<us-east-1>>"
            # Present a Function ARN with entry to the bucket. This position ought to have a belief relationship with osis-pipelines.amazonaws.com
            # sts_role_arn: "<<arn:aws:iam::111122223333:position/opensearch-ingestion-pipeline-role>>"

Or you’ll be able to create the index first, with the mapping within the assortment earlier than you begin the pipeline.

If you wish to create a template utilizing an OpenSearch Ingestion pipeline, it’s essential to present aoss:UpdateCollectionItems and aoss:DescribeCollectionItems permission for the gathering within the information entry coverage for the pipeline position (opensearch-ingestion-pipeline-role). The up to date JSON block for the rule would seem like the next:

{
    "Guidelines": [
      {
        "Resource": [
          "collection/{collection-name}"
        ],
        "Permission": [
          "aoss:UpdateCollectionItems",
          "aoss:DescribeCollectionItems"
        ],
        "ResourceType": "assortment"
      },
      {
        "Useful resource": [
          "index/{collection-name}/*"
        ],
        "Permission": [
          "aoss:CreateIndex",
          "aoss:UpdateIndex",
          "aoss:DescribeIndex",
          "aoss:ReadDocument",
          "aoss:WriteDocument"
        ],
        "ResourceType": "index"
      }
    ],
    "Principal": [
      "arn:aws:iam::{account-id}:role/opensearch-ingestion-pipeline-role"
    ],
    "Description": "Present entry to OpenSearch Ingestion Pipeline Function"
  }

Conclusion

On this publish, we confirmed methods to use an OpenSearch Ingestion pipeline to repeat information from one index to a different in an OpenSearch Serverless assortment. OpenSearch Ingestion additionally means that you can carry out transformation of knowledge utilizing numerous processors. AWS provides numerous assets so that you can shortly begin constructing pipelines utilizing OpenSearch Ingestion. You need to use numerous built-in pipeline integrations to shortly ingest information from Amazon DynamoDB, Amazon Managed Streaming for Apache Kafka (Amazon MSK), Amazon Safety Lake, Fluent Bit, and plenty of extra. You need to use the next OpenSearch Ingestion blueprints to construct information pipelines with minimal configuration adjustments.


In regards to the Authors

Utkarsh Agarwal is a Cloud Help Engineer within the Help Engineering staff at Amazon Net Companies. He focuses on Amazon OpenSearch Service. He offers steerage and technical help to prospects thus enabling them to construct scalable, extremely accessible, and safe options within the AWS Cloud. In his free time, he enjoys watching films, TV sequence, and naturally, cricket. Recently, he has additionally been making an attempt to grasp the artwork of cooking in his free time – the style buds are excited, however the kitchen may disagree.

Prashant Agrawal is a Sr. Search Specialist Options Architect with Amazon OpenSearch Service. He works carefully with prospects to assist them migrate their workloads to the cloud and helps present prospects fine-tune their clusters to realize higher efficiency and save on value. Earlier than becoming a member of AWS, he helped numerous prospects use OpenSearch and Elasticsearch for his or her search and log analytics use instances. When not working, you could find him touring and exploring new locations. Briefly, he likes doing Eat → Journey → Repeat.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *