Root cause: Docker Swarm assigns a new random container ID as $HOSTNAME on every
task restart, making node_id, api_addr, and cluster_addr change with each restart.
Vault could not recognize its own Raft data → cluster never reformed after restart.
Fixes:
- docker-stack-vault.yml: add hostname: "vault-{{.Task.Slot}}.iklim.co" so each
replica gets a stable, slot-based hostname covered by the *.iklim.co wildcard cert.
Replace STABLE_ID/NODE_ID_PLACEHOLDER logic with a single HOSTNAME_PLACEHOLDER sed.
Replace single unseal attempt with a retry loop (90×2s) so peer nodes unseal as
soon as they join Raft, without needing external intervention.
- vault-bootstrap.sh: add ADIM 6b — after rolling restart, wait for Raft leader to
unseal, wait for all peers to join Raft (vault operator raft list-peers), then
attempt explicit per-peer unseal via overlay network (best-effort).
ADIM 4 early-exit now fires N requests to the shared alias; all must return
Sealed: false before declaring the cluster healthy.
ADIM 7 polls up to 4 minutes via check_cluster_unsealed (9 shared-alias requests)
and retries peer unseal on each iteration.
- deploy-prod.yml: health check now fires 9 requests to the shared alias; all must
return Sealed: false (single-node check was masking partially-sealed clusters).
57 lines
1.8 KiB
YAML
57 lines
1.8 KiB
YAML
name: Deploy Vault Stack to Production
|
|
|
|
on:
|
|
push:
|
|
branches:
|
|
- prod-env
|
|
|
|
concurrency:
|
|
group: vault-prod-deploy
|
|
cancel-in-progress: false
|
|
|
|
jobs:
|
|
deploy:
|
|
runs-on: prod-runner
|
|
steps:
|
|
- name: Checkout
|
|
uses: actions/checkout@v4
|
|
|
|
- name: Connect Runner to Overlay Network
|
|
run: docker network connect iklimco-net $(hostname) || true
|
|
|
|
- name: Ensure vault_unseal_key placeholder exists
|
|
run: |
|
|
docker secret ls --format '{{.Name}}' | grep -q '^vault_unseal_key' || \
|
|
echo "bootstrap" | docker secret create vault_unseal_key -
|
|
|
|
- name: Deploy Vault Stack
|
|
run: |
|
|
docker stack deploy \
|
|
--with-registry-auth \
|
|
-c docker-stack-vault.yml \
|
|
iklimco
|
|
|
|
- name: Run Bootstrap
|
|
env:
|
|
SKIP_DEPLOY: "true"
|
|
run: bash vault-bootstrap.sh
|
|
|
|
- name: Verify Vault Cluster Health
|
|
run: |
|
|
# Fire 9 requests to the shared alias (load-balanced across all 3 nodes).
|
|
# Every request must return Sealed: false — one healthy node is not enough.
|
|
SEALED_COUNT=0
|
|
for i in $(seq 1 9); do
|
|
SEALED=$(docker run --rm --network iklimco-net hashicorp/vault:2.0.1 \
|
|
sh -c "VAULT_ADDR=https://vault.iklim.co:8200 VAULT_SKIP_VERIFY=true vault status 2>/dev/null" \
|
|
| awk '/^Sealed/{print $2}' || echo "true")
|
|
[ "$SEALED" = "true" ] && SEALED_COUNT=$((SEALED_COUNT+1))
|
|
sleep 1
|
|
done
|
|
if [ "$SEALED_COUNT" -eq 0 ]; then
|
|
echo "Vault cluster is fully unsealed and healthy (9/9 checks passed)"
|
|
else
|
|
echo "ERROR: $SEALED_COUNT/9 checks returned sealed or unreachable"
|
|
exit 1
|
|
fi
|