Contents


When to Use This Runbook

Scope: any Terragrunt unit in this repo backed by s3 (bucket versioning enabled) with a DynamoDB lock table.

Symptom this covers: tofu plan suddenly wants to create resources that already exist in the provider; the state file “shrank” or is empty, while the real objects (Zitadel org, AWS resources, etc.) are untouched.

Typical root cause: resources were removed from state by an accidental state rm, a botched state mv, or a partial/failed write — not a real destroy. A tell-tale sign is that outputs survive in the state while resources are gone (state rm does not clear outputs; destroy would).

Blast radius: none for live infrastructure. This procedure only touches the state file. We never call the provider to mutate objects.

Golden rule

Before any manual state operation, snapshot the current state:

terragrunt state pull > /tmp/state-backup-$(date +%Y%m%dT%H%M%S).json

Prerequisites

  • Valid AWS credentials for the account holding the state bucket — verify with aws sts get-caller-identity.
  • Read/write access to the state S3 bucket and the DynamoDB lock table.
  • terragrunt, tofu, aws, python3 on PATH.
  • Run everything from the affected unit’s directory.

Auth boundary

State subcommands (list / pull / push) need only AWS creds (S3 + DynamoDB). They do not need provider auth (e.g. the Zitadel access token). Only plan (Step 6) talks to the provider. An ExpiredToken / STS 403 in the output means AWS creds expired — refresh them and retry.

# Set these once; the rest of the runbook reuses them.
export UNIT_DIR="aws/111122223333/eu-central-1/zitadel/example-tenant-dev-zitadel-backoffice"
cd "$(git rev-parse --show-toplevel)/$UNIT_DIR"

Step 1 — Diagnose the State

terragrunt state list   # how many resources are tracked
 
terragrunt state pull | python3 -c 'import sys,json; d=json.load(sys.stdin); \
print("serial:", d["serial"]); print("lineage:", d["lineage"]); \
print("resources:", len(d["resources"])); print("outputs:", list(d.get("outputs",{}).keys()))'

Interpret the result:

What you seeDiagnosis
state list empty, exit 0, but outputs still presentResources stripped from state (state rm / failed write). Provider objects are likely alive. → continue this runbook.
plan shows N to add for objects that existSame situation, confirmed.
State unreadable / corrupt JSON / state list errorsFile is damaged. → same version rollback (Step 3 onward).
outputs also empty and no objects in the providerThis may have been a real destroy. STOP — confirm intent before restoring; rollback may be wrong.

Tip

Note the current serial and lineage here. serial is needed in Step 5; lineage must match the rollback candidate in Step 4.


Step 2 — Locate the S3 Backend

The backend key is computed in root.hcl, so the most reliable source is the generated backend file in the Terragrunt cache:

find .terragrunt-cache -name "tg-backend.tf" -exec cat {} \;

Example values for this repo:

export BUCKET="terragrunt-example-dev-account-state"
export KEY="$UNIT_DIR/terraform.tfstate"
export REGION="eu-west-1"
export LOCK_TABLE="terraform-lock"

Step 3 — Check Versioning and Find a Healthy Version

aws s3api get-bucket-versioning --bucket "$BUCKET" --region "$REGION"   # expect: Status: Enabled
 
aws s3api list-object-versions --bucket "$BUCKET" --prefix "$KEY" --region "$REGION" \
  --query 'reverse(sort_by(Versions,&LastModified))[].{LastModified:LastModified,Size:Size,VersionId:VersionId,IsLatest:IsLatest}' \
  --output table

No versioning, no rollback

If Status is not Enabled, there are no prior versions to restore. Skip to Appendix A — When Rollback Is Not Possible (import).

Choosing the candidate: the newest version just before the breakage. The fastest signal is Size — a healthy state is noticeably larger than the broken one (in the reference incident: 14096 B healthy vs 3388 B broken).

export PREV_VID="<VersionId of the healthy version>"

Step 4 — Download and Validate the Candidate

aws s3api get-object --bucket "$BUCKET" --key "$KEY" --region "$REGION" \
  --version-id "$PREV_VID" /tmp/state_prev.json >/dev/null && echo "downloaded"
 
python3 - <<'PY'
import json
prev = json.load(open("/tmp/state_prev.json"))
print("PREV serial   :", prev["serial"])
print("PREV lineage  :", prev["lineage"])
print("PREV resources:", len(prev["resources"]))
for r in prev["resources"]:
    print("  ", f'{r["type"]}.{r["name"]}', f'x{len(r["instances"])}')
PY

Pre-restore checklist — all must hold:

  • Candidate lineage equals the current state’s lineage (from Step 1). Otherwise OpenTofu rejects the push — these are different state ancestries.
  • Resource/instance count matches what you expect (equals the N to add from the plan).
  • (Optional) Key resource IDs match the stale outputs.
  • (Providers with write-only secrets — e.g. Zitadel machine_key, personal_access_token) those secrets are present in the candidate:
python3 -c 'import json; d=json.load(open("/tmp/state_prev.json")); \
r={(x["type"],x["name"]):x for x in d["resources"]}; \
mk=r[("zitadel_machine_key","this")]["instances"][0]["attributes"]; \
print("machine_key key_details present:", bool(mk.get("key_details")))'

Why this matters for secrets

Providers return write-only secrets (private keys, tokens) only at creation time. A version rollback brings them back intact; an import cannot. This is the single biggest reason to prefer rollback over import.


Step 5 — Restore the State

Use state push with a serial bumped above the remote value:

# Read the live remote serial and write a candidate with serial = remote + 1
REMOTE_SERIAL=$(terragrunt state pull | python3 -c 'import sys,json; print(json.load(sys.stdin)["serial"])')
python3 -c "import json; d=json.load(open('/tmp/state_prev.json')); \
d['serial']=$REMOTE_SERIAL+1; json.dump(d, open('/tmp/state_restore.json','w')); \
print('remote serial:', $REMOTE_SERIAL, '-> restore serial:', d['serial'])"
 
terragrunt state push /tmp/state_restore.json

Why this exact approach:

  • state push, not aws s3 cppush atomically takes the lock, writes to S3, and updates the MD5 digest in DynamoDB. A raw cp leaves a stale digest, so the next tofu run complains the S3 content does not match what it expected.
  • serial > remotepush refuses to write a state whose serial is ≤ the remote one (a guard against clobbering a newer state). Bumping above it succeeds without -force.
  • If you must keep the old serial, use terragrunt state push -force /tmp/state_restore.json deliberately.

Step 6 — Verify

terragrunt state list   # resources are back
terragrunt plan         # EXPECT: "No changes. Your infrastructure matches the configuration."

plan performs a refresh — it actually reaches the provider for every id. No changes proves both that the objects exist and that the restored state matches them. Done.

Warning

If plan shows replace/recreate for resources with write-only secrets, the state is fine but the live object may have changed — investigate per-resource. Do not run a blind apply.


Step 7 — Roll Back the Rollback

Nothing is lost — the “broken” version is still in versioning:

aws s3api list-object-versions --bucket "$BUCKET" --prefix "$KEY" --region "$REGION" \
  --query 'reverse(sort_by(Versions,&LastModified))[:5].{LastModified:LastModified,Size:Size,VersionId:VersionId}' \
  --output table
# Download the version you want, bump its serial, and `terragrunt state push` again.

Troubleshooting

ErrorCause / fix
Error acquiring the state lock (DynamoDB)Stale lock from an aborted run. Inspect: aws dynamodb get-item --table-name "$LOCK_TABLE" --region "$REGION" --key '{"LockID":{"S":"'"$BUCKET/$KEY"'"}}'. If it is yours: terragrunt force-unlock <LOCK_ID>. Never break someone else’s active lock.
Failed to persist state: lineage mismatchCandidate is from a different ancestry. Make sure you pulled a version of the same object/key.
serial … is greater than or equal on pushBump the file’s serial above remote (Step 5) or use -force.
ExpiredToken / STS 403AWS creds expired — refresh and retry.
S3 versioning Suspended / disabledNo prior versions → Appendix A — When Rollback Is Not Possible (import).

Prevention

  • Keep S3 versioning enabled (already on) + a lifecycle rule on noncurrent versions (retain ≥ 90–180 days).
  • Consider MFA Delete or a bucket policy denying DeleteObjectVersion on state prefixes.
  • Always snapshot before risky ops: terragrunt state pull > /tmp/state-backup-...json.
  • Treat state rm, state mv, import, destroy with care — always with an explicit address list, never a blind wildcard.
  • After an incident, find the source via CloudTrail so it does not recur:
aws cloudtrail lookup-events --region "$REGION" \
  --lookup-attributes AttributeKey=ResourceName,AttributeValue="$KEY" \
  --max-results 10

Appendix A — When Rollback Is Not Possible (import)

If versioning is off or there is no healthy version, fall back to per-resource tofu import.

Write-only secrets are NOT restored by import

For Zitadel and any provider with write-only attributes, import does not restore machine_key.key_details or personal_access_token tokens — the provider only ever returns them at creation. After import those fields are empty and any consumer of the secret (External Secrets / K8s) breaks. In that case the secrets must be re-issued and consumers updated. This is exactly why version rollback is preferred whenever it is available.

# Example zitadel-provider import-ID formats — confirm against your provider version's docs:
terragrunt import 'zitadel_org.this' '<org_id>'
terragrunt import 'zitadel_project.this' '<org_id>:<project_id>'
terragrunt import 'zitadel_machine_user.this["example_tenant"]' '<org_id>:<user_id>'
# ...repeat for every address shown by `tofu plan`

Incident Reference

First real use of this runbook, for context:

  • Unit: example-tenant-dev-zitadel-backoffice (Zitadel org/project/users).
  • Symptom: plan wanted to create all 13 instances; state list empty (exit 0) but outputs survived → accidental state rm, not a destroy.
  • Backend: bucket terragrunt-example-dev-account-state, region eu-west-1, versioning Enabled.
  • Versions: broken latest serial 13, 3388 B; healthy prior serial 12, 14096 B, same lineage — restored with serial 14.
  • Outcome: state pushplan reported No changes. Secrets (machine key + 2 PATs) preserved; no import needed.