Environment_Infrastructure/roadmap/prod-env/01-swarm-init-multinode.md
2026-05-09 16:26:06 +03:00

3.2 KiB
Raw Blame History

01 — Docker Swarm Init (Prod — Multi-Node)

Context

  • Repo: iklim.co root
  • Environment: prod
  • Topology:
    • 3 × service nodes — all act as Swarm managers AND app workers (Raft quorum: 1 can fail)
    • 3 × DB nodes — NOT part of Docker Swarm (separate DB cluster, out of scope)
  • All 6 nodes are in the same private network.
  • Pipeline trigger: push to prod-env branch → Gitea runner on prod-runner (first service node).
  • Swarm has 3 nodes total; all are manager-eligible and carry workloads (no dedicated worker-only nodes).

Node labeling plan

Node Role Swarm role Labels
service-1 API services, SWAG, Vault Manager + Worker type=service
service-2 API services replicas Manager + Worker type=service
service-3 API services replicas Manager + Worker type=service

DB nodes (db-1/2/3) are not part of Docker Swarm. They run as a separate cluster and are provisioned independently. No Swarm join or label step applies to them.

Step 1 — Init Swarm on service-1 (the prod-runner node)

MANAGER_IP=$(hostname -I | awk '{print $1}')
if ! docker info --format '{{.Swarm.LocalNodeState}}' | grep -q "active"; then
  docker swarm init --advertise-addr "$MANAGER_IP"
  echo "✅ Swarm initialized on $MANAGER_IP"
else
  echo "  Swarm already active"
fi

Step 2 — Get manager join token

docker swarm join-token manager  # for service-2, service-3

Save this token — needed on service-2 and service-3.

Step 3 — Join service-2 and service-3 as managers

SSH into service-2 and service-3, run:

docker swarm join --token <MANAGER_TOKEN> <service-1-ip>:2377

Step 4 — Label all Swarm nodes

On service-1, after service-2 and service-3 have joined:

for node in service-1 service-2 service-3; do
  docker node update --label-add type=service "$node"
done

Replace service-1, etc. with actual node hostnames shown in docker node ls. DB nodes are not in Swarm — no join or label step for them.

Step 5 — Verify

docker node ls

Expected: 3 nodes, all with MANAGER STATUS = Leader or Reachable. All 3 nodes remain in AVAILABILITY=Active (not drained) so they also carry workloads.

docker node inspect service-1 --format '{{.Spec.Labels}}'

Expected: map[type:service].

Step 6 — Confirm init/swarm-init.sh multi-node awareness

The script is idempotent (skips init if already active). Verify:

grep -n "swarm init\|swarm join" init/swarm-init.sh

The prod pipeline runs on service-1 only. service-2/3 are joined via Ansible (swarm role), not via the Gitea pipeline.

Placement constraints used in docker-stack-infra.yml

Constraint Resolves to
node.role == manager service-1, service-2, service-3
node.labels.type == service service-1, service-2, service-3

SWAG, Vault, cert-reloader: pinned to node.role == manager. Microservices: no constraint (distributed across all 3 service nodes by Swarm scheduler).

node.labels.type == db constraint is not used — DB nodes are not in Swarm. PostgreSQL and MongoDB run outside Swarm as a separately managed cluster.