RUNBOOK - StarRocks Disaster Recovery

Goal

Step-by-step procedure to recover a StarRocks cluster from an S3 snapshot via AWS Backup.

Phase 1: Destroy the Existing StarRocks Cluster

Step 1: Destroy the current StarRocks cluster

Destroy the current StarRocks cluster to ensure a clean state before re-provisioning from the snapshot. Run the destroy from the starrocks directory:

terragrunt destroy

Warning

Destroying the StarRocks cluster will also destroy its dependent resources, including the secret-store components (kubernetes-secret-external-secrets-operator-starrocks and er-demo-eks-secret-store-starrocks). These must be re-applied before provisioning the new cluster.

Phase 2: Clean Up the S3 Bucket

Step 2: Delete all existing objects from the S3 bucket

Warning

AWS Backup restores the latest version of objects but will NOT overwrite objects with the same name already present in the destination bucket.

Run the following command:

aws s3 rm s3://<bucket-name> --recursive

Replace <bucket-name> with the actual bucket name (e.g., er-demo-s3-starrocks-data).

Note

This command does not truly empty the bucket. In a versioned S3 bucket, aws s3 rm creates delete markers on each object rather than permanently removing all object versions.

Phase 3: Restore S3 Recovery Point

Step 3: Open the AWS Backup console

Navigate to AWS Backup > Protected resources > S3 and locate the backup vault containing the bucket backups.

Step 4: Select the recovery point

Choose the recovery point that corresponds to the time closest to the disaster event.

Step 5: Initiate the restore

Click the Restore button. On the Restore S3 backup page, configure the following settings:

Restore type: select “Restore entire bucket”
Versions to restore: select “Latest version only”
Restore destination: select “Restore to source bucket” (the same bucket that was cleaned in Phase 2)

Scroll down to configure encryption and IAM role:

Restored object encryption: select “Use original encryption keys (default)”
Restore role: select “Choose an IAM role” and pick the appropriate role (e.g., er-demo-backup-vault-starrocks-role)

Click Restore backup to start the restore job.

Step 6: Wait for restore completion

Monitor the restore job in the AWS Backup console. Do NOT proceed to Phase 5 (Terraform apply) until the status shows Completed.

Tip

You can proceed with Phase 4 (Terragrunt dependencies) in parallel while the S3 restore is running. This reduces the total recovery time.

Phase 4: Apply Terragrunt Dependencies

Note

This phase can be executed in parallel with Phase 3 (S3 restore) — immediately after Phase 1 completes.

Step 7: Apply the required Terragrunt dependencies

The StarRocks module has the following dependencies defined in its terragrunt.hcl:

dependencies {
  paths = [
    "../../s3/er-demo-s3-starrocks-data",
    "../../secrets-manager/er-demo-starrocks-credentials",
    "../starrocks-operator",
    "../secret-store/kubernetes-secret-external-secrets-operator-starrocks",
    "../secret-store/er-demo-eks-secret-store-starrocks"
  ]
}

Before provisioning the new StarRocks cluster, all dependencies must be in place. The S3 bucket and Secrets Manager resources are typically unaffected by the cluster destroy, but the following two secret-store modules are destroyed together with the StarRocks cluster and must be re-applied:

secret-store/kubernetes-secret-external-secrets-operator-starrocks
secret-store/er-demo-eks-secret-store-starrocks

First create the starrocks namespace:

kubectl create namespace starrocks

Navigate to each dependency directory and apply:

cd ../secret-store/kubernetes-secret-external-secrets-operator-starrocks
terragrunt apply
 
cd ../er-demo-eks-secret-store-starrocks
terragrunt apply

Verify that all other dependencies (S3, Secrets Manager, starrocks-operator) are still present. Re-apply them if needed.

Phase 5: Update Terraform Configuration and Apply

Step 8: Identify the cluster snapshot path

Browse the restored S3 bucket to find the latest automated cluster snapshot. Navigate in the S3 console to the snapshot/<id>/meta/image/ path and copy the S3 URI of the snapshot folder.

The full path follows this format:

s3://<bucket>/snapshot/<id>/meta/image/automated_cluster_snapshot_<timestamp>

Step 9: Update the Terraform variables for disaster recovery

In the StarRocks Terragrunt module, set or update the following variables:

Variable	Value
`enable_disaster_recovery`	`true`
`disaster_recovery_generation`	`"1"` (no need to increment — cluster is recreated from scratch)
`cluster_snapshot_path`	Full S3 path to the snapshot identified in Step 8
`storage_volume_name`	Name of the storage volume (e.g., `s3_snapshot`)
`storage_volume_location`	S3 path to the data directory (e.g., `s3://er-demo-s3-starrocks-data/snapshot`)

Example configuration:

enable_disaster_recovery     = true
disaster_recovery_generation = "1"
cluster_snapshot_path        = "s3://er-demo-s3-starrocks-data/snapshot/b6f29097-518b-41b3-b156-054d849e497f/meta/image/automated_cluster_snapshot_1770808327856"
storage_volume_name          = "s3_snapshot"
storage_volume_location      = "s3://er-demo-s3-starrocks-data/snapshot"

Step 10: Ensure S3 restore is complete

Before applying, verify that the AWS Backup restore job from Phase 3 has completed successfully. Check status in the AWS Backup console under Jobs > Restore jobs.

Step 11: Run Terragrunt import and then apply

cd <starrocks-module-directory>
terragrunt import kubernetes_namespace_v1.starrocks starrocks
terragrunt apply

The new cluster will boot from the specified snapshot and restore all metadata: catalogs, databases, tables, users, privileges, loading tasks, etc.

Step 12: Verify the cluster

Once provisioning completes, connect to the new StarRocks cluster and run the following checks:

-- Check databases exist
SHOW DATABASES;
 
-- Check tables in key databases
USE <database_name>;
SHOW TABLES;
 
-- Verify row counts on critical tables
SELECT COUNT(*) FROM <critical_table>;

configs

Explorer

RUNBOOK - StarRocks Disaster Recovery

Phase 1: Destroy the Existing StarRocks Cluster

Step 1: Destroy the current StarRocks cluster

Phase 2: Clean Up the S3 Bucket

Step 2: Delete all existing objects from the S3 bucket

Phase 3: Restore S3 Recovery Point

Step 3: Open the AWS Backup console

Step 4: Select the recovery point

Step 5: Initiate the restore

Step 6: Wait for restore completion

Phase 4: Apply Terragrunt Dependencies

Step 7: Apply the required Terragrunt dependencies

Phase 5: Update Terraform Configuration and Apply

Step 8: Identify the cluster snapshot path

Step 9: Update the Terraform variables for disaster recovery

Step 10: Ensure S3 restore is complete

Step 11: Run Terragrunt import and then apply

Step 12: Verify the cluster

Phase 6: Update DNS records

Graph View

Table of Contents

Backlinks