Environment_Infrastructure/roadmap/prod-env/01-swarm-init-multinode.md
Murat ÖZDEMİR 67f4c10c93 docs(roadmap): update various roadmap docs to align with latest infrastructure setup
- Synchronized swarm initialization, pipeline update, and certificate reloader instructions with the new monolithic stack logic and Ansible roles.
2026-06-15 16:48:04 +03:00

7.3 KiB
Raw Blame History

01 — Docker Swarm Init (Prod — Multi-Node)

Context

  • Repo: iklim.co root
  • Environment: prod
  • Topology:
    • 3 × app nodes (iklim-app-01/02/03) — all act as Swarm managers AND app workers with type=service label (Raft quorum: 1 can fail)
    • 3 × DB nodes (iklim-db-01/02/03) — join Swarm as workers with role=db label; DB services are placed exclusively on them
  • Sizing: app nodes are cpx42, DB nodes are cpx32; see ../../hetzner-sizing-report.md
  • All 6 nodes are in the same private network.
  • Pipeline trigger: push to prod-env branch → Gitea runner on prod-runner (first app node).
  • App Swarm managers: 3 nodes all manager-eligible and carry app workloads with type=service label (no dedicated worker-only app nodes).

Node labeling plan

Node Role Swarm role Labels
iklim-app-01 API services, SWAG, Vault Manager + Worker type=service
iklim-app-02 API services replicas Manager + Worker type=service
iklim-app-03 API services replicas Manager + Worker type=service
iklim-db-01 MongoDB replica + PostgreSQL (Patroni), etcd Worker role=db, db-index=01
iklim-db-02 MongoDB replica + PostgreSQL (Patroni), etcd Worker role=db, db-index=02
iklim-db-03 MongoDB replica + PostgreSQL (Patroni), etcd Worker role=db, db-index=03

Label scheme rationale

App nodes carry type=service, DB nodes carry role=db. The two different label keys are not an inconsistency — they operate on different semantic planes:

  • type=service — "this node carries service workload"; determines which node group microservices and infrastructure services (APISIX, Vault, RabbitMQ, Redis, SWAG, etc.) are scheduled on.
  • role=db — "this node is a database node"; pins PostgreSQL (Patroni) and MongoDB exclusively to DB nodes.

Docker Swarm's built-in node.role property (manager / worker) does not conflict with the custom node.labels.role label — the placement constraint syntax distinguishes them explicitly:

node.role == manager           ← Swarm built-in (manager/worker distinction)
node.labels.type == service    ← custom label (app node workload target)
node.labels.role == db         ← custom label (DB node workload target)

This scheme is applied consistently across the current prod stack (docker-stack-infra_db-prod.yml), the separate Vault stack (docker-stack-vault.yml), and microservice stack definitions. The test environment uses the same type=service label on its service node, so both environments share the same constraint syntax.

node.role == worker is intentionally not used anywhere. DB nodes are Swarm workers, but targeting them via node.role == worker would also match any future worker-only app nodes. The explicit node.labels.role == db label provides precise, unambiguous targeting regardless of Swarm role.

Otomasyon Notu

ÖNEMLİ: Aşağıda listelenen tüm Swarm ilklendirme, join token işlemleri ve node etiketleme süreçleri artık manuel yapılmamaktadır. Bu işlemler Environment_Infrastructure/ansible/prod/prod-bootstrap.yml ve ortak swarm rolü tarafından otomatik olarak yürütülmektedir. Buradaki manuel bash komutları yalnızca referans, bilgi ve sorun giderme amaçlı tutulmaktadır.

Labeling iki aşamalıdır:

  • Ortak swarm rolü app node'lara type=service, DB node'lara role=db etiketini ekler.
  • Prod playbook'u iklim-app-01 üzerinden DB node'lara db-index=01/02/03 etiketini ekler.

Step 1 — Init Swarm on iklim-app-01 (the prod-runner node)

MANAGER_IP=$(hostname -I | awk '{print $1}')
if ! docker info --format '{{.Swarm.LocalNodeState}}' | grep -q "active"; then
  docker swarm init --advertise-addr "$MANAGER_IP"
  echo "Swarm initialized on $MANAGER_IP"
else
  echo "Swarm already active"
fi

Step 2 — Get manager join token

docker swarm join-token manager  # for iklim-app-02, iklim-app-03

Save this token — needed on iklim-app-02 and iklim-app-03.

Step 3 — Join iklim-app-02 and iklim-app-03 as managers

SSH into iklim-app-02 and iklim-app-03, run:

docker swarm join --token <MANAGER_TOKEN> 10.20.10.11:2377

Step 4 — Label app nodes

On iklim-app-01, after iklim-app-02 and iklim-app-03 have joined:

for node in iklim-app-01 iklim-app-02 iklim-app-03; do
  docker node update --label-add type=service "$node"
done

Step 5 — Join DB nodes as Swarm workers

Get the worker join token on iklim-app-01:

docker swarm join-token worker

SSH into each DB node and join:

docker swarm join --token <WORKER_TOKEN> 10.20.10.11:2377

Then label them on iklim-app-01:

docker node update --label-add role=db iklim-db-01
docker node update --label-add role=db iklim-db-02
docker node update --label-add role=db iklim-db-03

docker node update --label-add db-index=01 iklim-db-01
docker node update --label-add db-index=02 iklim-db-02
docker node update --label-add db-index=03 iklim-db-03

DB nodes are Swarm workers only — they never become managers. DB services are pinned to them via node.labels.role == db placement constraint. See 08-prod-db-cluster-setup.md for DB stack deployment.

Step 6 — Verify

docker node ls

Expected: 6 nodes — 3 with MANAGER STATUS = Leader or Reachable, 3 workers with Ready.

docker node inspect iklim-app-01 --format '{{.Spec.Labels}}'
docker node inspect iklim-db-01 --format '{{.Spec.Labels}}'

Expected: map[type:service] for app nodes, map[db-index:01 role:db] (vb.) for DB nodes.

Step 7 — Confirm init/swarm-init.sh multi-node awareness

The script is idempotent (skips init if already active). Verify:

grep -n "swarm init\|swarm join" init/swarm-init.sh

The prod pipeline runs on iklim-app-01 only. iklim-app-02/03 are joined via Ansible (swarm role), not via the Gitea pipeline.

Placement Constraints Used in Current Prod Stacks

Constraint Resolves to Services
node.hostname == iklim-app-01 iklim-app-01 only SWAG, cert-reloader
node.labels.type == service iklim-app-01, iklim-app-02, iklim-app-03 Vault, Redis, RabbitMQ, APISIX, Prometheus, Grafana, SWAG support services
node.hostname == iklim-db-01/02/03 specific DB node Patroni, MongoDB, and etcd services pinned per node in docker-stack-infra_db-prod.yml
node.labels.role == db iklim-db-01, iklim-db-02, iklim-db-03 Generic DB node identity; retained for operations and compatibility

SWAG and cert-reloader are pinned to iklim-app-01 (the Floating IP node) because SWAG must match the public entry point. Vault is deployed by docker-stack-vault.yml across service nodes and reads certificates from /opt/iklimco/ssl. Microservices are distributed by the Swarm scheduler across app nodes. DB services are defined in docker-stack-infra_db-prod.yml and pinned to DB nodes by hostname constraints.

Historical / Superseded by Setup

Older notes that referred to docker-stack-infra.yml, docker-stack-infra.prod.yml, or docker-stack-db.prod.yml as the active prod deployment model are superseded by ../../setup/08-prod-db-cluster-setup.md and ../../setup/09-prod-runner-ha-and-swarm.md.