Root cause: Docker Swarm assigns a new random container ID as $HOSTNAME on every
task restart, making node_id, api_addr, and cluster_addr change with each restart.
Vault could not recognize its own Raft data → cluster never reformed after restart.
Fixes:
- docker-stack-vault.yml: add hostname: "vault-{{.Task.Slot}}.iklim.co" so each
replica gets a stable, slot-based hostname covered by the *.iklim.co wildcard cert.
Replace STABLE_ID/NODE_ID_PLACEHOLDER logic with a single HOSTNAME_PLACEHOLDER sed.
Replace single unseal attempt with a retry loop (90×2s) so peer nodes unseal as
soon as they join Raft, without needing external intervention.
- vault-bootstrap.sh: add ADIM 6b — after rolling restart, wait for Raft leader to
unseal, wait for all peers to join Raft (vault operator raft list-peers), then
attempt explicit per-peer unseal via overlay network (best-effort).
ADIM 4 early-exit now fires N requests to the shared alias; all must return
Sealed: false before declaring the cluster healthy.
ADIM 7 polls up to 4 minutes via check_cluster_unsealed (9 shared-alias requests)
and retries peer unseal on each iteration.
- deploy-prod.yml: health check now fires 9 requests to the shared alias; all must
return Sealed: false (single-node check was masking partially-sealed clusters).