This runbook applies to Terraform / OpenTofu state stored on AWS: an S3 bucket (versioning enabled) with a DynamoDB lock table. The commands are AWS-specific; the rollback concept itself generalizes to other backends.
When to Use This Runbook
Scope: any Terragrunt unit in this repo backed by s3 (bucket versioning enabled) with a DynamoDB lock table.
Symptom this covers:tofu plan suddenly wants to create resources that already exist in the provider; the state file “shrank” or is empty, while the real objects (Zitadel org, AWS resources, etc.) are untouched.
Typical root cause: resources were removed from state by an accidental state rm, a botched state mv, or a partial/failed write — not a real destroy. A tell-tale sign is that outputs survive in the state while resources are gone (state rm does not clear outputs; destroy would).
Blast radius: none for live infrastructure. This procedure only touches the state file. We never call the provider to mutate objects.
Golden rule
Before any manual state operation, snapshot the current state:
terragrunt state pull > /tmp/state-backup-$(date +%Y%m%dT%H%M%S).json
Prerequisites
Valid AWS credentials for the account holding the state bucket — verify with aws sts get-caller-identity.
Read/write access to the state S3 bucket and the DynamoDB lock table.
terragrunt, tofu, aws, python3 on PATH.
Run everything from the affected unit’s directory.
Auth boundary
State subcommands (list / pull / push) need only AWS creds (S3 + DynamoDB). They do not need provider auth (e.g. the Zitadel access token). Only plan (Step 6) talks to the provider.
An ExpiredToken / STS 403 in the output means AWS creds expired — refresh them and retry.
# Set these once; the rest of the runbook reuses them.export UNIT_DIR="aws/111122223333/eu-central-1/zitadel/example-tenant-dev-zitadel-backoffice"cd "$(git rev-parse --show-toplevel)/$UNIT_DIR"
Step 1 — Diagnose the State
terragrunt state list # how many resources are trackedterragrunt state pull | python3 -c 'import sys,json; d=json.load(sys.stdin); \print("serial:", d["serial"]); print("lineage:", d["lineage"]); \print("resources:", len(d["resources"])); print("outputs:", list(d.get("outputs",{}).keys()))'
Interpret the result:
What you see
Diagnosis
state list empty, exit 0, but outputs still present
Resources stripped from state (state rm / failed write). Provider objects are likely alive. → continue this runbook.
plan shows N to add for objects that exist
Same situation, confirmed.
State unreadable / corrupt JSON / state list errors
File is damaged. → same version rollback (Step 3 onward).
outputs also empty and no objects in the provider
This may have been a real destroy. STOP — confirm intent before restoring; rollback may be wrong.
Tip
Note the current serial and lineage here. serial is needed in Step 5; lineage must match the rollback candidate in Step 4.
Step 2 — Locate the S3 Backend
The backend key is computed in root.hcl, so the most reliable source is the generated backend file in the Terragrunt cache:
Choosing the candidate: the newest version just before the breakage. The fastest signal is Size — a healthy state is noticeably larger than the broken one (in the reference incident: 14096 B healthy vs 3388 B broken).
export PREV_VID="<VersionId of the healthy version>"
Candidate lineageequals the current state’s lineage (from Step 1). Otherwise OpenTofu rejects the push — these are different state ancestries.
Resource/instance count matches what you expect (equals the N to add from the plan).
(Optional) Key resource IDs match the stale outputs.
(Providers with write-only secrets — e.g. Zitadel machine_key, personal_access_token) those secrets are present in the candidate:
python3 -c 'import json; d=json.load(open("/tmp/state_prev.json")); \r={(x["type"],x["name"]):x for x in d["resources"]}; \mk=r[("zitadel_machine_key","this")]["instances"][0]["attributes"]; \print("machine_key key_details present:", bool(mk.get("key_details")))'
Why this matters for secrets
Providers return write-only secrets (private keys, tokens) only at creation time. A version rollback brings them back intact; an import cannot. This is the single biggest reason to prefer rollback over import.
Step 5 — Restore the State
Use state push with a serial bumped above the remote value:
# Read the live remote serial and write a candidate with serial = remote + 1REMOTE_SERIAL=$(terragrunt state pull | python3 -c 'import sys,json; print(json.load(sys.stdin)["serial"])')python3 -c "import json; d=json.load(open('/tmp/state_prev.json')); \d['serial']=$REMOTE_SERIAL+1; json.dump(d, open('/tmp/state_restore.json','w')); \print('remote serial:', $REMOTE_SERIAL, '-> restore serial:', d['serial'])"terragrunt state push /tmp/state_restore.json
Why this exact approach:
state push, not aws s3 cp — push atomically takes the lock, writes to S3, and updates the MD5 digest in DynamoDB. A raw cp leaves a stale digest, so the next tofu run complains the S3 content does not match what it expected.
serial > remote — push refuses to write a state whose serial is ≤ the remote one (a guard against clobbering a newer state). Bumping above it succeeds without -force.
If you must keep the old serial, use terragrunt state push -force /tmp/state_restore.json deliberately.
Step 6 — Verify
terragrunt state list # resources are backterragrunt plan # EXPECT: "No changes. Your infrastructure matches the configuration."
plan performs a refresh — it actually reaches the provider for every id. No changes proves both that the objects exist and that the restored state matches them. Done.
Warning
If plan shows replace/recreate for resources with write-only secrets, the state is fine but the live object may have changed — investigate per-resource. Do not run a blind apply.
Step 7 — Roll Back the Rollback
Nothing is lost — the “broken” version is still in versioning:
aws s3api list-object-versions --bucket "$BUCKET" --prefix "$KEY" --region "$REGION" \ --query 'reverse(sort_by(Versions,&LastModified))[:5].{LastModified:LastModified,Size:Size,VersionId:VersionId}' \ --output table# Download the version you want, bump its serial, and `terragrunt state push` again.
Troubleshooting
Error
Cause / fix
Error acquiring the state lock (DynamoDB)
Stale lock from an aborted run. Inspect: aws dynamodb get-item --table-name "$LOCK_TABLE" --region "$REGION" --key '{"LockID":{"S":"'"$BUCKET/$KEY"'"}}'. If it is yours: terragrunt force-unlock <LOCK_ID>. Never break someone else’s active lock.
Failed to persist state: lineage mismatch
Candidate is from a different ancestry. Make sure you pulled a version of the same object/key.
serial … is greater than or equal on push
Bump the file’s serial above remote (Step 5) or use -force.
Appendix A — When Rollback Is Not Possible (import)
If versioning is off or there is no healthy version, fall back to per-resource tofu import.
Write-only secrets are NOT restored by import
For Zitadel and any provider with write-only attributes, import does not restore machine_key.key_details or personal_access_token tokens — the provider only ever returns them at creation. After import those fields are empty and any consumer of the secret (External Secrets / K8s) breaks. In that case the secrets must be re-issued and consumers updated. This is exactly why version rollback is preferred whenever it is available.
# Example zitadel-provider import-ID formats — confirm against your provider version's docs:terragrunt import 'zitadel_org.this' '<org_id>'terragrunt import 'zitadel_project.this' '<org_id>:<project_id>'terragrunt import 'zitadel_machine_user.this["example_tenant"]' '<org_id>:<user_id>'# ...repeat for every address shown by `tofu plan`