Backup and Restore an Elasticache Cluster to the same instance

Posted on Friday 29th October 2021

Intro

Recently I encountered a problem where I had to restore a backup of an Elasticache cluster to an existing running Elasticache instance. This came from a requirement to scale down a Test cluster whilst being able to scale it back up again as and when required.

As anybody who's restored an Elasticache backup will know, normally this involves creating a new cluster in which the backup will be restored. Whilst this is normally fine, it can cause issues if you have Route53 records pointing to your cluster endpoint. In our case it was problematic to then update these endpoints.

Aim

Be able to scale up and down the cluster as and when required. The cluster needs to be restored with the same data it had when it was scaled down. This will allow us to drastically save money on costs whilst also being able to bring the cluster back up for load-testing at any given time.

Logic

Approaching this problem, we have a 28gb sized Elasticache instance which we need to reduce to a cache.t3.micro We then need to scale this instance back up to a cache.r4.2xlarge.

The Scale Down Script

The first part of this task was to write a script to scale down Elasticache This script needed to do the following things:

Perform a Backup (Take a backup of the existing Elasticache instance)
Wait for the backup to be created
Empty the instance (Clear data by rebooting node. This allows us to scale to the smallest instance)
Modify the node, this changes the node to

This script assumes you have a single Redis Elasticache cluster with a single node. It also assumes the JQ is installed on the system.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
set -e

###############
##Perform Backup
###############
CACHE_CLUSTER_ID=$(aws elasticache describe-cache-clusters | jq -r '.CacheClusters[0].CacheClusterId');
REPLICATION_GROUP_ID=$(aws elasticache describe-cache-clusters | jq -r '.CacheClusters[0].ReplicationGroupId');
printf "Found cache cluster id: %s \n" "$CACHE_CLUSTER_ID";

TODAYS_DATE=$(date '+%d-%m-%Y-%H-%M');
BACKUP_NAME="manual-backup-of-redis-taken-at-${TODAYS_DATE}"
printf "Creating backup for cache %s with name %s \n" "$CACHE_CLUSTER_ID" "$BACKUP_NAME"
aws elasticache create-snapshot --cache-cluster-id "$CACHE_CLUSTER_ID" --snapshot-name "$BACKUP_NAME"

###############
##Wait for Backup to be created
###############
NEW_BACKUP=$(aws elasticache describe-snapshots | jq --arg BACKUP_NAME "$BACKUP_NAME" -r '.Snapshots[] | select(.SnapshotName == $BACKUP_NAME)')
STATUS=$(echo "$NEW_BACKUP"  | jq -r .SnapshotStatus);
while [ "$STATUS" != "available" ]; do
  BACKUP=$(aws elasticache describe-snapshots | jq --arg BACKUP_NAME "$BACKUP_NAME" -r '.Snapshots[] | select(.SnapshotName == $BACKUP_NAME)');
  STATUS=$(echo "$BACKUP"  | jq -r .SnapshotStatus);
  printf "Backup is still being created, current status is %s \n" "$STATUS"
  sleep 5
done
printf "Backup has finished creating \n"

###############
##Exporting backup to S3
###############
printf "Exporting backup to S3 in the background\n"
aws elasticache copy-snapshot --source-snapshot-name "$BACKUP_NAME" --target-snapshot-name "$BACKUP_NAME" --target-bucket my-s3-bucket

##############
#Reboot the node to empty the data
##############
NODE_ID=$(aws elasticache describe-cache-clusters --show-cache-node-info | jq -r '.CacheClusters[0].CacheNodes[0].CacheNodeId')
printf "Rebooting the node with ID %s to clear the data \n" "$NODE_ID"
aws elasticache reboot-cache-cluster --cache-cluster-id "$CACHE_CLUSTER_ID" --cache-node-ids-to-reboot "$NODE_ID"

###############
##Wait for Node to be rebooted
###############
NODE_STATUS=$(aws elasticache describe-cache-clusters | jq -r '.CacheClusters[0].CacheClusterStatus');
while [ "$NODE_STATUS" != "available" ]; do
  NODE_STATUS=$(aws elasticache describe-cache-clusters | jq -r '.CacheClusters[0].CacheClusterStatus');
  printf "Node is currently rebooting, current status is %s \n" "$NODE_STATUS"
  sleep 5
done


printf "Node has finished rebooting, reducing node size \n"
aws elasticache modify-replication-group --apply-immediately --replication-group-id "$REPLICATION_GROUP_ID" --cache-node-type cache.t3.micro
###############
##Wait for Node to be modified
###############
NODE_STATUS=$(aws elasticache describe-cache-clusters | jq -r '.CacheClusters[0].CacheClusterStatus');
while [ "$NODE_STATUS" != "available" ]; do
  NODE_STATUS=$(aws elasticache describe-cache-clusters | jq -r '.CacheClusters[0].CacheClusterStatus');
  printf "Node is currently being modified, current status is %s \n" "$NODE_STATUS"
  sleep 5
done
printf "Node has finished modifying \n"
printf "Cluster has now been scaled down \n"

So lets take a look at what's happening with this script.

Lines 6-7 we get the value of the CacheClusterId and ReplicationGroupId. We will be using these later.

On Line 13, we take the actual backup. To allow us to take multiple backups in a day, we append the current date and time to the name of the backup.

On Lines 18-25, we need to wait for the backup we've just started to finish. To do this, we get the Status of the currently running snapshot. We then wait until the status of this backup becomes 'available' before continuing.

Now that the backup has finished creating, we are now going to export this backup to S3 which is what the command on Line 32 does. The reason we do this is so the Scale Up script can download it when repopulating the data.

Before we can scale the cluster down, we need to ensure the instance we're scaling down to can accommodate the data. As we don't care about the data when scaled down, we can simply restart the Node which will empty Elasticache. To do this we get the NodeId on Line 44, reboot the Node on Line 45, and then poll for the cluster to become available on lines 44-49

Now that we have an empty cache cluster, we can scale it down to the desired size which we do on Line 53. As usual, we wait for this operation to finish by polling the cluster status on Lines 57-62

The Scale Up Script

Now that we have a script to Scale a cluster down, we now need to be able to scale it back up and restore the latest copy of the data.

Download the latest backup from S3
Modify the cache cluster from a cache.t3.micro to a cache.r4.2xlarge
Write the data from the backup in to the newly scaled cluster

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
set -e

CURRENT_DIR="${0%/*}"
#
################
###Download Backup
################
printf "Getting latest backup\n";
REPLICATION_GROUP_ID=$(aws elasticache describe-cache-clusters | jq -r '.CacheClusters[0].ReplicationGroupId');
LATEST_BACKUP=$(aws s3api list-objects-v2 --bucket "my-s3-bucket" --query 'reverse(sort_by(Contents, &LastModified))[:1].Key' --output=text)
printf "Latest backup version is: %s \n" "$LATEST_BACKUP"

printf "Copying file %s from S3 to %s" "$LATEST_BACKUP" "$CURRENT_DIR"
aws s3 cp s3://my-s3-bucket/"$LATEST_BACKUP" "$CURRENT_DIR"

###############
##Modifying Node
###############
printf "Backup has finished downloading, increasing node size \n"

aws elasticache modify-replication-group --apply-immediately --replication-group-id "$REPLICATION_GROUP_ID" --cache-node-type cache.r4.2xlarge
NODE_STATUS=$(aws elasticache describe-cache-clusters | jq -r '.CacheClusters[0].CacheClusterStatus');
while [ "$NODE_STATUS" != "available" ]; do
  NODE_STATUS=$(aws elasticache describe-cache-clusters | jq -r '.CacheClusters[0].CacheClusterStatus');
  printf "Node is currently being modified, current status is %s \n" "$NODE_STATUS"
  sleep 5
done

###############
##Pipe backup to Redis
###############
printf "Writing data to Redis. This will take a few minutes."
rdb -c protocol "$CURRENT_DIR"/"$LATEST_BACKUP" | redis-cli -h my-cluster.0001.euw1.cache.amazonaws.com -p 6379 --pipe
printf "Finished writing data to Redis"

Let's break this script down, similar to before we get the Elasticache cluster ReplicationGroupId on Line 9. But this time, on Line 10, we use the S3 API to get the latest available backup. If you've just ran the previous scale down script, this will be the backup that's just been created.

On Line 14, we use the S3 api to download our latest backup and copy it to the same directory the script is in

Before we can apply the backup, we need to increase the size of our Elasticache node, we do this on lines 21-27 where we also poll for the CacheClusterStatus to become available.

Now the Elasticache cluster is back up to the desired size, we can begin inserting the data.
Before doing that, let's take a look at the two tools that are assisting us.

Redis-Cli - This allows us to easily interact with the Redis Cluster
redis-rdb-tools - This tool allows us to turn our backup in to Redis Protocol

To ensure we insert data into Redis in the fastest way possible, we need to use the Redis protocol. More information about the protocol and mass insertion within Redis in general can be found at https://redis.io/topics/mass-insert We then stream this protocol data in to the redis-cli using the --pipe command. All of this happens on line 33

Conclusion

The two scripts above allow us to easily scale down are cluster when it's not needed and restore it to the same state as when we left it. One of the motivations for doing this was financial based, a running cache.r4.2xlarge instance costs $0.91 per hour. So a daily cost of $21.84. Let's keep it simple and say 30 days in a month which gives us a monthly cost of $655.20. If we then do the same calculations with a cache.t3.micro, we end up with a monthly running cost of $12.24. That's a saving of 98.1319%.