Environment_Infrastructure/roadmap/prod-env/01-swarm-init-multinode.md
Murat ÖZDEMİR 8875af8e8a docs: fix roadmap and setup reference direction
Remove setup runbook references from prod roadmap docs so roadmap remains design intent only. Keep setup-to-roadmap links, but normalize them to explicit relative paths.
2026-06-15 19:57:21 +03:00

158 lines
7.3 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# 01 — Docker Swarm Init (Prod — Multi-Node)
## Context
- **Repo:** `iklim.co` root
- **Environment:** prod
- **Topology:**
- 3 × app nodes (`iklim-app-01/02/03`) — all act as **Swarm managers AND app workers** with `type=service` label (Raft quorum: 1 can fail)
- 3 × DB nodes (`iklim-db-01/02/03`) — join Swarm as **workers** with `role=db` label; DB services are placed exclusively on them
- **Sizing:** app nodes are `cpx42`, DB nodes are `cpx32`; see `../../hetzner-sizing-report.md`
- All 6 nodes are in the same private network.
- Pipeline trigger: push to `prod-env` branch → Gitea runner on `prod-runner` (first app node).
- App Swarm managers: 3 nodes all manager-eligible and carry app workloads with `type=service` label (no dedicated worker-only app nodes).
## Node labeling plan
| Node | Role | Swarm role | Labels |
|------|------|------------|--------|
| `iklim-app-01` | API services, SWAG, Vault | Manager + Worker | `type=service` |
| `iklim-app-02` | API services replicas | Manager + Worker | `type=service` |
| `iklim-app-03` | API services replicas | Manager + Worker | `type=service` |
| `iklim-db-01` | MongoDB replica + PostgreSQL (Patroni), etcd | Worker | `role=db`, `db-index=01` |
| `iklim-db-02` | MongoDB replica + PostgreSQL (Patroni), etcd | Worker | `role=db`, `db-index=02` |
| `iklim-db-03` | MongoDB replica + PostgreSQL (Patroni), etcd | Worker | `role=db`, `db-index=03` |
### Label scheme rationale
App nodes carry `type=service`, DB nodes carry `role=db`. The two different label keys are not an inconsistency — they operate on different semantic planes:
- **`type=service`** — "this node carries service workload"; determines which node group microservices and infrastructure services (APISIX, Vault, RabbitMQ, Redis, SWAG, etc.) are scheduled on.
- **`role=db`** — "this node is a database node"; pins PostgreSQL (Patroni) and MongoDB exclusively to DB nodes.
Docker Swarm's **built-in** `node.role` property (`manager` / `worker`) does **not** conflict with the custom `node.labels.role` label — the placement constraint syntax distinguishes them explicitly:
```
node.role == manager ← Swarm built-in (manager/worker distinction)
node.labels.type == service ← custom label (app node workload target)
node.labels.role == db ← custom label (DB node workload target)
```
This scheme is applied consistently across the current prod stack (`docker-stack-infra_db-prod.yml`), the separate Vault stack (`docker-stack-vault.yml`), and microservice stack definitions. The test environment uses the same `type=service` label on its service node, so both environments share the same constraint syntax.
`node.role == worker` is intentionally not used anywhere. DB nodes are Swarm workers, but targeting them via `node.role == worker` would also match any future worker-only app nodes. The explicit `node.labels.role == db` label provides precise, unambiguous targeting regardless of Swarm role.
## Otomasyon Notu
**ÖNEMLİ:** Aşağıda listelenen tüm Swarm ilklendirme, join token işlemleri ve node etiketleme süreçleri artık manuel yapılmamaktadır. Bu işlemler `Environment_Infrastructure/ansible/prod/prod-bootstrap.yml` ve ortak `swarm` rolü tarafından otomatik olarak yürütülmektedir. Buradaki manuel bash komutları yalnızca referans, bilgi ve sorun giderme amaçlı tutulmaktadır.
Labeling iki aşamalıdır:
- Ortak `swarm` rolü app node'lara `type=service`, DB node'lara `role=db` etiketini ekler.
- Prod playbook'u `iklim-app-01` üzerinden DB node'lara `db-index=01/02/03` etiketini ekler.
## Step 1 — Init Swarm on iklim-app-01 (the prod-runner node)
```bash
MANAGER_IP=$(hostname -I | awk '{print $1}')
if ! docker info --format '{{.Swarm.LocalNodeState}}' | grep -q "active"; then
docker swarm init --advertise-addr "$MANAGER_IP"
echo "Swarm initialized on $MANAGER_IP"
else
echo "Swarm already active"
fi
```
## Step 2 — Get manager join token
```bash
docker swarm join-token manager # for iklim-app-02, iklim-app-03
```
Save this token — needed on iklim-app-02 and iklim-app-03.
## Step 3 — Join iklim-app-02 and iklim-app-03 as managers
SSH into iklim-app-02 and iklim-app-03, run:
```bash
docker swarm join --token <MANAGER_TOKEN> 10.20.10.11:2377
```
## Step 4 — Label app nodes
On iklim-app-01, after iklim-app-02 and iklim-app-03 have joined:
```bash
for node in iklim-app-01 iklim-app-02 iklim-app-03; do
docker node update --label-add type=service "$node"
done
```
## Step 5 — Join DB nodes as Swarm workers
Get the worker join token on iklim-app-01:
```bash
docker swarm join-token worker
```
SSH into each DB node and join:
```bash
docker swarm join --token <WORKER_TOKEN> 10.20.10.11:2377
```
Then label them on iklim-app-01:
```bash
docker node update --label-add role=db iklim-db-01
docker node update --label-add role=db iklim-db-02
docker node update --label-add role=db iklim-db-03
docker node update --label-add db-index=01 iklim-db-01
docker node update --label-add db-index=02 iklim-db-02
docker node update --label-add db-index=03 iklim-db-03
```
> DB nodes are Swarm **workers** only — they never become managers.
> DB services are pinned to them via `node.labels.role == db` placement constraint.
> DB services are deployed by the current root production stack `docker-stack-infra_db-prod.yml`.
## Step 6 — Verify
```bash
docker node ls
```
Expected: 6 nodes — 3 with `MANAGER STATUS` = `Leader` or `Reachable`, 3 workers with `Ready`.
```bash
docker node inspect iklim-app-01 --format '{{.Spec.Labels}}'
docker node inspect iklim-db-01 --format '{{.Spec.Labels}}'
```
Expected: `map[type:service]` for app nodes, `map[db-index:01 role:db]` (vb.) for DB nodes.
## Step 7 — Confirm `init/swarm-init.sh` multi-node awareness
The script is idempotent (skips init if already active). Verify:
```bash
grep -n "swarm init\|swarm join" init/swarm-init.sh
```
The prod pipeline runs on iklim-app-01 only. iklim-app-02/03 are joined via Ansible (`swarm` role), not via the Gitea pipeline.
## Placement Constraints Used in Current Prod Stacks
| Constraint | Resolves to | Services |
|------------|-------------|----------|
| `node.hostname == iklim-app-01` | iklim-app-01 only | SWAG, cert-reloader |
| `node.labels.type == service` | iklim-app-01, iklim-app-02, iklim-app-03 | Vault, Redis, RabbitMQ, APISIX, Prometheus, Grafana, SWAG support services |
| `node.hostname == iklim-db-01/02/03` | specific DB node | Patroni, MongoDB, and etcd services pinned per node in `docker-stack-infra_db-prod.yml` |
| `node.labels.role == db` | iklim-db-01, iklim-db-02, iklim-db-03 | Generic DB node identity; retained for operations and compatibility |
SWAG and cert-reloader are pinned to `iklim-app-01` (the Floating IP node) because SWAG must match the public entry point. Vault is deployed by `docker-stack-vault.yml` across service nodes and reads certificates from `/opt/iklimco/ssl`. Microservices are distributed by the Swarm scheduler across app nodes. DB services are defined in `docker-stack-infra_db-prod.yml` and pinned to DB nodes by hostname constraints.
## Historical / Superseded by Setup
Older notes that referred to `docker-stack-infra.yml`, `docker-stack-infra.prod.yml`, or `docker-stack-db.prod.yml` as the active prod deployment model are superseded by the current root production stack and workflow model.