Remove setup runbook references from prod roadmap docs so roadmap remains design intent only. Keep setup-to-roadmap links, but normalize them to explicit relative paths.
158 lines
7.3 KiB
Markdown
158 lines
7.3 KiB
Markdown
# 01 — Docker Swarm Init (Prod — Multi-Node)
|
||
|
||
## Context
|
||
- **Repo:** `iklim.co` root
|
||
- **Environment:** prod
|
||
- **Topology:**
|
||
- 3 × app nodes (`iklim-app-01/02/03`) — all act as **Swarm managers AND app workers** with `type=service` label (Raft quorum: 1 can fail)
|
||
- 3 × DB nodes (`iklim-db-01/02/03`) — join Swarm as **workers** with `role=db` label; DB services are placed exclusively on them
|
||
- **Sizing:** app nodes are `cpx42`, DB nodes are `cpx32`; see `../../hetzner-sizing-report.md`
|
||
- All 6 nodes are in the same private network.
|
||
- Pipeline trigger: push to `prod-env` branch → Gitea runner on `prod-runner` (first app node).
|
||
- App Swarm managers: 3 nodes all manager-eligible and carry app workloads with `type=service` label (no dedicated worker-only app nodes).
|
||
|
||
## Node labeling plan
|
||
|
||
| Node | Role | Swarm role | Labels |
|
||
|------|------|------------|--------|
|
||
| `iklim-app-01` | API services, SWAG, Vault | Manager + Worker | `type=service` |
|
||
| `iklim-app-02` | API services replicas | Manager + Worker | `type=service` |
|
||
| `iklim-app-03` | API services replicas | Manager + Worker | `type=service` |
|
||
| `iklim-db-01` | MongoDB replica + PostgreSQL (Patroni), etcd | Worker | `role=db`, `db-index=01` |
|
||
| `iklim-db-02` | MongoDB replica + PostgreSQL (Patroni), etcd | Worker | `role=db`, `db-index=02` |
|
||
| `iklim-db-03` | MongoDB replica + PostgreSQL (Patroni), etcd | Worker | `role=db`, `db-index=03` |
|
||
|
||
### Label scheme rationale
|
||
|
||
App nodes carry `type=service`, DB nodes carry `role=db`. The two different label keys are not an inconsistency — they operate on different semantic planes:
|
||
|
||
- **`type=service`** — "this node carries service workload"; determines which node group microservices and infrastructure services (APISIX, Vault, RabbitMQ, Redis, SWAG, etc.) are scheduled on.
|
||
- **`role=db`** — "this node is a database node"; pins PostgreSQL (Patroni) and MongoDB exclusively to DB nodes.
|
||
|
||
Docker Swarm's **built-in** `node.role` property (`manager` / `worker`) does **not** conflict with the custom `node.labels.role` label — the placement constraint syntax distinguishes them explicitly:
|
||
|
||
```
|
||
node.role == manager ← Swarm built-in (manager/worker distinction)
|
||
node.labels.type == service ← custom label (app node workload target)
|
||
node.labels.role == db ← custom label (DB node workload target)
|
||
```
|
||
|
||
This scheme is applied consistently across the current prod stack (`docker-stack-infra_db-prod.yml`), the separate Vault stack (`docker-stack-vault.yml`), and microservice stack definitions. The test environment uses the same `type=service` label on its service node, so both environments share the same constraint syntax.
|
||
|
||
`node.role == worker` is intentionally not used anywhere. DB nodes are Swarm workers, but targeting them via `node.role == worker` would also match any future worker-only app nodes. The explicit `node.labels.role == db` label provides precise, unambiguous targeting regardless of Swarm role.
|
||
|
||
## Otomasyon Notu
|
||
**ÖNEMLİ:** Aşağıda listelenen tüm Swarm ilklendirme, join token işlemleri ve node etiketleme süreçleri artık manuel yapılmamaktadır. Bu işlemler `Environment_Infrastructure/ansible/prod/prod-bootstrap.yml` ve ortak `swarm` rolü tarafından otomatik olarak yürütülmektedir. Buradaki manuel bash komutları yalnızca referans, bilgi ve sorun giderme amaçlı tutulmaktadır.
|
||
|
||
Labeling iki aşamalıdır:
|
||
|
||
- Ortak `swarm` rolü app node'lara `type=service`, DB node'lara `role=db` etiketini ekler.
|
||
- Prod playbook'u `iklim-app-01` üzerinden DB node'lara `db-index=01/02/03` etiketini ekler.
|
||
|
||
## Step 1 — Init Swarm on iklim-app-01 (the prod-runner node)
|
||
|
||
```bash
|
||
MANAGER_IP=$(hostname -I | awk '{print $1}')
|
||
if ! docker info --format '{{.Swarm.LocalNodeState}}' | grep -q "active"; then
|
||
docker swarm init --advertise-addr "$MANAGER_IP"
|
||
echo "Swarm initialized on $MANAGER_IP"
|
||
else
|
||
echo "Swarm already active"
|
||
fi
|
||
```
|
||
|
||
## Step 2 — Get manager join token
|
||
|
||
```bash
|
||
docker swarm join-token manager # for iklim-app-02, iklim-app-03
|
||
```
|
||
|
||
Save this token — needed on iklim-app-02 and iklim-app-03.
|
||
|
||
## Step 3 — Join iklim-app-02 and iklim-app-03 as managers
|
||
|
||
SSH into iklim-app-02 and iklim-app-03, run:
|
||
```bash
|
||
docker swarm join --token <MANAGER_TOKEN> 10.20.10.11:2377
|
||
```
|
||
|
||
## Step 4 — Label app nodes
|
||
|
||
On iklim-app-01, after iklim-app-02 and iklim-app-03 have joined:
|
||
|
||
```bash
|
||
for node in iklim-app-01 iklim-app-02 iklim-app-03; do
|
||
docker node update --label-add type=service "$node"
|
||
done
|
||
```
|
||
|
||
## Step 5 — Join DB nodes as Swarm workers
|
||
|
||
Get the worker join token on iklim-app-01:
|
||
|
||
```bash
|
||
docker swarm join-token worker
|
||
```
|
||
|
||
SSH into each DB node and join:
|
||
|
||
```bash
|
||
docker swarm join --token <WORKER_TOKEN> 10.20.10.11:2377
|
||
```
|
||
|
||
Then label them on iklim-app-01:
|
||
|
||
```bash
|
||
docker node update --label-add role=db iklim-db-01
|
||
docker node update --label-add role=db iklim-db-02
|
||
docker node update --label-add role=db iklim-db-03
|
||
|
||
docker node update --label-add db-index=01 iklim-db-01
|
||
docker node update --label-add db-index=02 iklim-db-02
|
||
docker node update --label-add db-index=03 iklim-db-03
|
||
```
|
||
|
||
> DB nodes are Swarm **workers** only — they never become managers.
|
||
> DB services are pinned to them via `node.labels.role == db` placement constraint.
|
||
> DB services are deployed by the current root production stack `docker-stack-infra_db-prod.yml`.
|
||
|
||
## Step 6 — Verify
|
||
|
||
```bash
|
||
docker node ls
|
||
```
|
||
|
||
Expected: 6 nodes — 3 with `MANAGER STATUS` = `Leader` or `Reachable`, 3 workers with `Ready`.
|
||
|
||
```bash
|
||
docker node inspect iklim-app-01 --format '{{.Spec.Labels}}'
|
||
docker node inspect iklim-db-01 --format '{{.Spec.Labels}}'
|
||
```
|
||
|
||
Expected: `map[type:service]` for app nodes, `map[db-index:01 role:db]` (vb.) for DB nodes.
|
||
|
||
## Step 7 — Confirm `init/swarm-init.sh` multi-node awareness
|
||
|
||
The script is idempotent (skips init if already active). Verify:
|
||
|
||
```bash
|
||
grep -n "swarm init\|swarm join" init/swarm-init.sh
|
||
```
|
||
|
||
The prod pipeline runs on iklim-app-01 only. iklim-app-02/03 are joined via Ansible (`swarm` role), not via the Gitea pipeline.
|
||
|
||
## Placement Constraints Used in Current Prod Stacks
|
||
|
||
| Constraint | Resolves to | Services |
|
||
|------------|-------------|----------|
|
||
| `node.hostname == iklim-app-01` | iklim-app-01 only | SWAG, cert-reloader |
|
||
| `node.labels.type == service` | iklim-app-01, iklim-app-02, iklim-app-03 | Vault, Redis, RabbitMQ, APISIX, Prometheus, Grafana, SWAG support services |
|
||
| `node.hostname == iklim-db-01/02/03` | specific DB node | Patroni, MongoDB, and etcd services pinned per node in `docker-stack-infra_db-prod.yml` |
|
||
| `node.labels.role == db` | iklim-db-01, iklim-db-02, iklim-db-03 | Generic DB node identity; retained for operations and compatibility |
|
||
|
||
SWAG and cert-reloader are pinned to `iklim-app-01` (the Floating IP node) because SWAG must match the public entry point. Vault is deployed by `docker-stack-vault.yml` across service nodes and reads certificates from `/opt/iklimco/ssl`. Microservices are distributed by the Swarm scheduler across app nodes. DB services are defined in `docker-stack-infra_db-prod.yml` and pinned to DB nodes by hostname constraints.
|
||
|
||
## Historical / Superseded by Setup
|
||
|
||
Older notes that referred to `docker-stack-infra.yml`, `docker-stack-infra.prod.yml`, or `docker-stack-db.prod.yml` as the active prod deployment model are superseded by the current root production stack and workflow model.
|