Ansible roles: - act_runner/defaults: set act_runner_name to inventory_hostname (was hardcoded to iklim-test-app); added vault_gitea_runner_token to vault.yml - prod/group_vars/all: restructured from flat files to all/ directory; added act_runner_labels override (prod-runner,ubuntu-24.04,hostname); added storagebox_managed_directories; added swarm_manager_ip and other prod-specific vars - prod/roles/db_stack: prod-specific db_node tasks using StorageBox paths (/mnt/storagebox/db/...) instead of local paths - docker/tasks: split firewalld loop into all-nodes (Swarm ports) and app-only (80/443) tasks - swarm/tasks: added --advertise-addr private_ip to join commands for correct multi-homed node advertisement - hardening/tasks: corrected firewalld drop zone configuration - node_dirs/tasks: added /opt/iklimco/vault/data for Vault Raft volume - db_stack/tasks/app_node: updated stale comment (removed pg-proxy reference) - db_stack/templates: removed pg-proxy and mongo-proxy service blocks - test/host_vars/iklim-app-01: added act_runner_name override to preserve existing test runner registration Roadmap and setup docs: - roadmap/03-infra-stack-changes: added replicas:0 for etcd/postgresql/ mongodb/pg-proxy/mongo-proxy in prod overlay; updated placement table; fixed grafana/data mkdir (auto-created by Ansible); translated Turkish note to English - roadmap/08-deploy-pipeline-update: updated stale "remains idle" note for standalone etcd (now disabled with replicas:0) - roadmap/01-swarm-init-multinode: consistency fixes - setup/06: added Outputs section and etcd firewall port documentation - setup/07: removed prometheus/data from StorageBox acceptance criteria; replaced manual StorageBox mkdir section with Ansible auto-creation note; updated prod README section with full bootstrap instructions and vault docs; added act_runner_labels prod policy - setup/08: extensive rewrite — aligned with Patroni etcd overlay DNS, corrected hcloud_firewall.app reference, updated all StorageBox paths from /prod/db/ to /db/ - setup/09: removed prometheus/data from acceptance criteria; updated runner label policy (removed docker/swarm-manager labels); added acceptance criterion for disabled services absent from docker service ls Terraform: - prod/firewall.tf: added missing DB subnet mutual rules (etcd, Patroni) - prod/outputs.tf: added prod_floating_ip and prod_private_ips outputs - prod/servers.tf: aligned placement group and naming - prod/variables.tf: corrected variable descriptions - prod/terraform.tfvars.example: updated defaults - terraform/hetzner/README.md: new comprehensive README covering both test and prod environments with firewall tables and inventory instructions ansible/README.md: expanded prod section with inventory groups, bootstrap run order, runner label policy, and vault variable documentation
6.2 KiB
01 — Docker Swarm Init (Prod — Multi-Node)
Context
- Repo:
iklim.coroot - Environment: prod
- Topology:
- 3 × app nodes (
iklim-app-01/02/03) — all act as Swarm managers AND app workers withtype=servicelabel (Raft quorum: 1 can fail) - 3 × DB nodes (
iklim-db-01/02/03) — join Swarm as workers withrole=dblabel; DB services are placed exclusively on them
- 3 × app nodes (
- Sizing: app nodes are
cpx42, DB nodes arecpx32; see../../hetzner-sizing-report.md - All 6 nodes are in the same private network.
- Pipeline trigger: push to
prod-envbranch → Gitea runner onprod-runner(first app node). - App Swarm managers: 3 nodes all manager-eligible and carry app workloads with
type=servicelabel (no dedicated worker-only app nodes).
Node labeling plan
| Node | Role | Swarm role | Labels |
|---|---|---|---|
iklim-app-01 |
API services, SWAG, Vault | Manager + Worker | type=service |
iklim-app-02 |
API services replicas | Manager + Worker | type=service |
iklim-app-03 |
API services replicas | Manager + Worker | type=service |
iklim-db-01 |
MongoDB replica + PostgreSQL (Patroni), etcd | Worker | role=db, db-index=01 |
iklim-db-02 |
MongoDB replica + PostgreSQL (Patroni), etcd | Worker | role=db, db-index=02 |
iklim-db-03 |
MongoDB replica + PostgreSQL (Patroni), etcd | Worker | role=db, db-index=03 |
Label scheme rationale
App nodes carry type=service, DB nodes carry role=db. The two different label keys are not an inconsistency — they operate on different semantic planes:
type=service— "this node carries service workload"; determines which node group microservices and infrastructure services (APISIX, Vault, RabbitMQ, Redis, SWAG, etc.) are scheduled on.role=db— "this node is a database node"; pins PostgreSQL (Patroni), MongoDB, and their proxy services exclusively to DB nodes.
Docker Swarm's built-in node.role property (manager / worker) does not conflict with the custom node.labels.role label — the placement constraint syntax distinguishes them explicitly:
node.role == manager ← Swarm built-in (manager/worker distinction)
node.labels.type == service ← custom label (app node workload target)
node.labels.role == db ← custom label (DB node workload target)
This scheme is applied consistently across docker-stack-infra.yml and all 10 microservice docker-stack-service.yml files. The test environment uses the same type=service label on its single node, so both environments share the same constraint syntax.
node.role == worker is intentionally not used anywhere. DB nodes are Swarm workers, but targeting them via node.role == worker would also match any future worker-only app nodes. The explicit node.labels.role == db label provides precise, unambiguous targeting regardless of Swarm role.
Step 1 — Init Swarm on iklim-app-01 (the prod-runner node)
MANAGER_IP=$(hostname -I | awk '{print $1}')
if ! docker info --format '{{.Swarm.LocalNodeState}}' | grep -q "active"; then
docker swarm init --advertise-addr "$MANAGER_IP"
echo "Swarm initialized on $MANAGER_IP"
else
echo "Swarm already active"
fi
Step 2 — Get manager join token
docker swarm join-token manager # for iklim-app-02, iklim-app-03
Save this token — needed on iklim-app-02 and iklim-app-03.
Step 3 — Join iklim-app-02 and iklim-app-03 as managers
SSH into iklim-app-02 and iklim-app-03, run:
docker swarm join --token <MANAGER_TOKEN> 10.20.10.11:2377
Step 4 — Label app nodes
On iklim-app-01, after iklim-app-02 and iklim-app-03 have joined:
for node in iklim-app-01 iklim-app-02 iklim-app-03; do
docker node update --label-add type=service "$node"
done
Step 5 — Join DB nodes as Swarm workers
Get the worker join token on iklim-app-01:
docker swarm join-token worker
SSH into each DB node and join:
docker swarm join --token <WORKER_TOKEN> 10.20.10.11:2377
Then label them on iklim-app-01:
docker node update --label-add role=db --label-add db-index=01 iklim-db-01
docker node update --label-add role=db --label-add db-index=02 iklim-db-02
docker node update --label-add role=db --label-add db-index=03 iklim-db-03
DB nodes are Swarm workers only — they never become managers. DB services are pinned to them via
node.labels.role == dbplacement constraint. See08-prod-db-cluster-kurulum.mdfor DB stack deployment.
Step 6 — Verify
docker node ls
Expected: 6 nodes — 3 with MANAGER STATUS = Leader or Reachable, 3 workers with Ready.
docker node inspect iklim-app-01 --format '{{.Spec.Labels}}'
docker node inspect iklim-db-01 --format '{{.Spec.Labels}}'
Expected: map[type:service] for app nodes, map[db-index:01 role:db] (vb.) for DB nodes.
Step 7 — Confirm init/swarm-init.sh multi-node awareness
The script is idempotent (skips init if already active). Verify:
grep -n "swarm init\|swarm join" init/swarm-init.sh
The prod pipeline runs on iklim-app-01 only. iklim-app-02/03 are joined via Ansible (swarm role),
not via the Gitea pipeline.
Placement constraints used in docker-stack-infra.yml
| Constraint | Resolves to | Services |
|---|---|---|
node.hostname == iklim-app-01 |
iklim-app-01 only | SWAG, cert-reloader |
node.labels.type == service |
iklim-app-01, iklim-app-02, iklim-app-03 | Vault, Redis, RabbitMQ, APISIX, Prometheus, Grafana, etcd (idle in prod — APISIX uses Patroni etcd) |
node.labels.role == db |
iklim-db-01, iklim-db-02, iklim-db-03 | PostgreSQL (Patroni), MongoDB, etcd (via docker-stack-db.prod.yml) |
SWAG and cert-reloader are pinned to iklim-app-01 (the Floating IP node) because SWAG does not support clustering and must match the public entry point. Vault floats across all service nodes; its TLS cert is read from StorageBox (/mnt/storagebox/ssl) so it is available on whichever node Vault is scheduled on. Microservices carry no placement constraint and are distributed by the Swarm scheduler across all app nodes. DB services are pinned to DB nodes via separate DB stacks.