Elasticsearch cluster migration with zero downtime

During the last months we’ve been migrating our services to Amazon Virtual Private Cloud (VPC). There is no rocket science behind it, but guaranteeing zero downtime gets tricky in a few cases.

Today we want to talk about our Elasticsearch clusters, if downtime wasn’t an issue, migrating them would have been straightforward: you could easily reindex all your documents in the new cluster and point all your apps to it. This wasn’t our scenario, we operate in almost 50 countries so there is no right time to do so, furthermore, the volume of data to reindex would require a huge time window that we cannot afford.

One of the strategies to reindex data within the same cluster is using aliases and reindexing: you just need to change the alias to the new index whenever you are done with reindexation. This approach is not valid for our scenario for two reasons:

  • We are migrating to a different cluster in a different network

  • We cannot stop indexing documents during migration. This means that while you are reindexing there are other documents being indexed on the original cluster.

Since we didn’t want to lose any document and downtime was not an option, we came up with our final approach: versioning our documents during migration. Let’s get into details!

These are the steps we followed:

1. Versioning documents

There is only one consumer in our stack responsible for indexing documents into Elasticsearch. By adding a field indicating the migration version, let’s say, migration_version = 1, we would know which documents were indexed since we started the migration.

{
    "date": "2017-01-25 19:30:00",
    "id": 111273,
    "city": "Perth",
    "name": "Bruce Springsteen Perth",
    "artist": "Bruce Springsteen",
    "country": "Australia",
    "venue": "Perth Arena",
    "location": {
        "lat": "-31.948434",
        "lon": "115.85211"
    },
    "migration_version": 1
}

2. Migrating unversioned documents

Migrate all documents without a migration_version using elasticdump. This tool works by sending an input to an output, where both can be either an Elasticsearch URL or a File. We did a two steps approach for reliability:

  • Input: original Elasticsearch cluster, output: gzipped file.
elasticdump \  
    --input=http://original-cluster:9200 \
    --input-index=catalog/events \
    --output=$ \
    --searchBody '{ "filter": { "bool": { "must_not": [ { "exists": { "field": "migration_version" } } ] } } }' | gzip > data/events.json.gz
  • Input: gzipped file, output: new Elasticsearch cluster.
elasticdump \  
        --input=data/events.json \
        --output=http://new-cluster:9200 \
        --output-index=catalog/events

3. Indexing in both clusters

Since our indexer can be configured to index documents in multiple clusters, we did so to start indexing documents in both clusters, this time, incrementing migration version: migration_version = 2.

4. Migrating v1 documents

At this point, we have all documents without migration_version and v2 on both clusters so it’s time to migrate the remaining ones: v1. This step is similar to step 2 with the slight difference of changing the query to export v1 documents:

elasticdump \  
    --input=http://original-cluster:9200 \
    --input-index=catalog/events \
    --output=$ \
    --searchBody '{ "filter": { "term": { "migration_version": 1 } } }' | gzip > data/events_v1.json.gz

That's it? are we missing something? What happens if a v1 document changes during this process? we want to make sure we do not update existing documents in our new cluster with v1 documents. This means using operation create for bulk indexing, not index.

Create will fail if a document with the same index and type exists already, whereas index will add or replace a document as necessary.

5. Switch over

All our apps need to point to the new Elasticsearch endpoint, but this step can be done gradually since no synchronization is required (remember our indexer keeps indexing in both clusters). This was a huge advantage for us since we didn’t have to coordinate deployments between teams, we just provided a reasonable time window to do the migration.

6. Clean up

Done with the migration? Then is time to stop indexing in the original cluster, shut the classic AWS instances down and grab a 🍺.