- Refactor production setup documentation to reflect a 3-node Vault Raft cluster starting from launch. - Update all paths to use StorageBox mounts for shared state (SWAG config, TLS certs, Monitoring data). - Switch Nginx configuration convention from proxy-confs to site-confs to align with SWAG's auto-include behavior. - Standardize TLS private key extensions to .pem. - Update node failover and recovery facts to include monitoring services. - Align deployment pipeline instructions with the latest environment variable-driven approach.
4.0 KiB
07 — Vault: 3-Node Raft Cluster (Prod)
Context
Vault starts directly as a 3-node Raft cluster in prod. The single-instance phase used in test is skipped.
Test used a single Vault instance (file storage, 1 replica on the manager node). Prod goes straight to Raft HA.
Vault service configuration
- Replicas: 3 (one per service node)
- Storage: Raft integrated storage
- Placement:
node.labels.type == service(all 3 app nodes) - Cert distribution: No SSH needed — all nodes mount StorageBox, cert-reloader writes to
SWAG_CERT_DIR=/mnt/storagebox/ssl, Vault reads from that path on every node
Prerequisites
- All 3 service nodes are running and labeled
type=service /mnt/storagebox/ssl/directory is mounted and accessible on all 3 app nodes- Vault data directory
/opt/iklimco/vault/data/exists on all 3 nodes (host path volumes)
Vault service YAML (docker-stack-infra.prod.yml overlay)
vault:
# ... (image, secrets, healthcheck unchanged from base)
environment:
VAULT_LOCAL_CONFIG: >-
{"api_addr":"https://vault.iklim.co:8200",
"cluster_addr":"https://{{ .Node.Hostname }}:8201",
"storage":{"raft":{"path":"/vault/file","node_id":"{{ .Node.Hostname }}"}},
"listener":[{"tcp":{"address":"0.0.0.0:8200",
"tls_cert_file":"/vault/certs/STAR.iklim.co.full.crt",
"tls_key_file":"/vault/certs/STAR.iklim.co_key.pem"}}],
"default_lease_ttl":"168h","max_lease_ttl":"720h","ui":true}
volumes:
- /opt/iklimco/vault/data:/vault/file # host path per node
- /mnt/storagebox/ssl:/vault/certs:ro # StorageBox — shared across all nodes, no SSH distribution needed
deploy:
mode: replicated
replicas: 3
placement:
constraints:
- node.labels.type == service
{{ .Node.Hostname }}is Docker Swarm's Go template for the node hostname — gives each Vault instance a uniquenode_id.
Raft initialization procedure (first deploy)
Step 1 — Deploy the stack
docker stack deploy -c docker-stack-infra.yml -c docker-stack-infra.prod.yml iklimco
All 3 Vault containers start. Only the first one to initialize becomes the leader.
Step 2 — Initialize Vault on the leader (iklim-app-01)
VAULT_CTR=$(docker ps -q -f name=iklimco_vault)
docker exec -it "$VAULT_CTR" vault operator init
Save the unseal keys and root token securely. Store the unseal key as a Docker secret:
echo -n "<unseal-key>" | docker secret create vault_unseal_key -
Step 3 — Unseal the leader
docker exec -it "$VAULT_CTR" vault operator unseal
The healthcheck auto-unseals on subsequent restarts via the vault_unseal_key secret.
Step 4 — Join remaining nodes to the Raft cluster
On iklim-app-02 and iklim-app-03 containers:
docker exec -it <vault-on-iklim-app-02> vault operator raft join \
https://vault.iklim.co:8200
docker exec -it <vault-on-iklim-app-03> vault operator raft join \
https://vault.iklim.co:8200
Unseal each node after joining:
docker exec -it <vault-on-iklim-app-02> vault operator unseal
docker exec -it <vault-on-iklim-app-03> vault operator unseal
Step 5 — Verify cluster
docker exec "$VAULT_CTR" vault operator raft list-peers
Expected: 3 peers, one leader, two follower.
cert-reloader — no additional changes needed for Raft
cert-reloader writes the cert to SWAG_CERT_DIR=/mnt/storagebox/ssl.
Since StorageBox is mounted on all app nodes, every Vault instance already sees the same path.
The cert renewal flow works unchanged with Raft:
cert changed → copy to /mnt/storagebox/ssl/ → docker service update --force iklimco_vault
Vault (3 replicas) restart → each auto-unseals via healthcheck