- Updated roadmap (03-infra-stack-changes.md) to deprecate database proxies in prod. - Detailed direct subnet access via WireGuard for production developers. - Provided multi-host connection parameters for Patroni and MongoDB Replica Sets in setup guide (08-prod-db-cluster-kurulum.md). - Added environment comparison table to developer access guide.
142 lines
6.1 KiB
Markdown
142 lines
6.1 KiB
Markdown
# 01 — Docker Swarm Init (Prod — Multi-Node)
|
||
|
||
## Context
|
||
- **Repo:** `iklim.co` root
|
||
- **Environment:** prod
|
||
- **Topology:**
|
||
- 3 × app nodes (`iklim-app-01/02/03`) — all act as **Swarm managers AND app workers** with `type=service` label (Raft quorum: 1 can fail)
|
||
- 3 × DB nodes (`iklim-db-01/02/03`) — join Swarm as **workers** with `role=db` label; DB services are placed exclusively on them
|
||
- **Sizing:** app nodes are `cpx42`, DB nodes are `cpx32`; see `../../hetzner-sizing-report.md`
|
||
- All 6 nodes are in the same private network.
|
||
- Pipeline trigger: push to `prod-env` branch → Gitea runner on `prod-runner` (first app node).
|
||
- App Swarm managers: 3 nodes all manager-eligible and carry app workloads with `type=service` label (no dedicated worker-only app nodes).
|
||
|
||
## Node labeling plan
|
||
|
||
| Node | Role | Swarm role | Labels |
|
||
|------|------|------------|--------|
|
||
| `iklim-app-01` | API services, SWAG, Vault | Manager + Worker | `type=service` |
|
||
| `iklim-app-02` | API services replicas | Manager + Worker | `type=service` |
|
||
| `iklim-app-03` | API services replicas | Manager + Worker | `type=service` |
|
||
| `iklim-db-01` | MongoDB replica + PostgreSQL (Patroni), etcd | Worker | `role=db`, `db-index=01` |
|
||
| `iklim-db-02` | MongoDB replica + PostgreSQL (Patroni), etcd | Worker | `role=db`, `db-index=02` |
|
||
| `iklim-db-03` | MongoDB replica + PostgreSQL (Patroni), etcd | Worker | `role=db`, `db-index=03` |
|
||
|
||
### Label scheme rationale
|
||
|
||
App nodes carry `type=service`, DB nodes carry `role=db`. The two different label keys are not an inconsistency — they operate on different semantic planes:
|
||
|
||
- **`type=service`** — "this node carries service workload"; determines which node group microservices and infrastructure services (APISIX, Vault, RabbitMQ, Redis, SWAG, etc.) are scheduled on.
|
||
- **`role=db`** — "this node is a database node"; pins PostgreSQL (Patroni), MongoDB, and their proxy services exclusively to DB nodes.
|
||
|
||
Docker Swarm's **built-in** `node.role` property (`manager` / `worker`) does **not** conflict with the custom `node.labels.role` label — the placement constraint syntax distinguishes them explicitly:
|
||
|
||
```
|
||
node.role == manager ← Swarm built-in (manager/worker distinction)
|
||
node.labels.type == service ← custom label (app node workload target)
|
||
node.labels.role == db ← custom label (DB node workload target)
|
||
```
|
||
|
||
This scheme is applied consistently across `docker-stack-infra.yml` and all 10 microservice `docker-stack-service.yml` files. The test environment uses the same `type=service` label on its single node, so both environments share the same constraint syntax.
|
||
|
||
`node.role == worker` is intentionally not used anywhere. DB nodes are Swarm workers, but targeting them via `node.role == worker` would also match any future worker-only app nodes. The explicit `node.labels.role == db` label provides precise, unambiguous targeting regardless of Swarm role.
|
||
|
||
## Step 1 — Init Swarm on iklim-app-01 (the prod-runner node)
|
||
|
||
```bash
|
||
MANAGER_IP=$(hostname -I | awk '{print $1}')
|
||
if ! docker info --format '{{.Swarm.LocalNodeState}}' | grep -q "active"; then
|
||
docker swarm init --advertise-addr "$MANAGER_IP"
|
||
echo "Swarm initialized on $MANAGER_IP"
|
||
else
|
||
echo "Swarm already active"
|
||
fi
|
||
```
|
||
|
||
## Step 2 — Get manager join token
|
||
|
||
```bash
|
||
docker swarm join-token manager # for iklim-app-02, iklim-app-03
|
||
```
|
||
|
||
Save this token — needed on iklim-app-02 and iklim-app-03.
|
||
|
||
## Step 3 — Join iklim-app-02 and iklim-app-03 as managers
|
||
|
||
SSH into iklim-app-02 and iklim-app-03, run:
|
||
```bash
|
||
docker swarm join --token <MANAGER_TOKEN> 10.20.10.11:2377
|
||
```
|
||
|
||
## Step 4 — Label app nodes
|
||
|
||
On iklim-app-01, after iklim-app-02 and iklim-app-03 have joined:
|
||
|
||
```bash
|
||
for node in iklim-app-01 iklim-app-02 iklim-app-03; do
|
||
docker node update --label-add type=service "$node"
|
||
done
|
||
```
|
||
|
||
## Step 5 — Join DB nodes as Swarm workers
|
||
|
||
Get the worker join token on iklim-app-01:
|
||
|
||
```bash
|
||
docker swarm join-token worker
|
||
```
|
||
|
||
SSH into each DB node and join:
|
||
|
||
```bash
|
||
docker swarm join --token <WORKER_TOKEN> 10.20.10.11:2377
|
||
```
|
||
|
||
Then label them on iklim-app-01:
|
||
|
||
```bash
|
||
docker node update --label-add role=db --label-add db-index=01 iklim-db-01
|
||
docker node update --label-add role=db --label-add db-index=02 iklim-db-02
|
||
docker node update --label-add role=db --label-add db-index=03 iklim-db-03
|
||
```
|
||
|
||
> DB nodes are Swarm **workers** only — they never become managers.
|
||
> DB services are pinned to them via `node.labels.role == db` placement constraint.
|
||
> See `08-prod-db-cluster-kurulum.md` for DB stack deployment.
|
||
|
||
## Step 6 — Verify
|
||
|
||
```bash
|
||
docker node ls
|
||
```
|
||
|
||
Expected: 6 nodes — 3 with `MANAGER STATUS` = `Leader` or `Reachable`, 3 workers with `Ready`.
|
||
|
||
```bash
|
||
docker node inspect iklim-app-01 --format '{{.Spec.Labels}}'
|
||
docker node inspect iklim-db-01 --format '{{.Spec.Labels}}'
|
||
```
|
||
|
||
Expected: `map[type:service]` for app nodes, `map[db-index:01 role:db]` (vb.) for DB nodes.
|
||
|
||
## Step 7 — Confirm `init/swarm-init.sh` multi-node awareness
|
||
|
||
The script is idempotent (skips init if already active). Verify:
|
||
|
||
```bash
|
||
grep -n "swarm init\|swarm join" init/swarm-init.sh
|
||
```
|
||
|
||
The prod pipeline runs on iklim-app-01 only. iklim-app-02/03 are joined via Ansible (`swarm` role),
|
||
not via the Gitea pipeline.
|
||
|
||
## Placement constraints used in `docker-stack-infra.yml`
|
||
|
||
| Constraint | Resolves to | Services |
|
||
|------------|-------------|----------|
|
||
| `node.hostname == iklim-app-01` | iklim-app-01 only | SWAG, cert-reloader |
|
||
| `node.labels.type == service` | iklim-app-01, iklim-app-02, iklim-app-03 | Vault, Redis, RabbitMQ, APISIX, Prometheus, Grafana, etcd (idle in prod — APISIX uses Patroni etcd) |
|
||
| `node.labels.role == db` | iklim-db-01, iklim-db-02, iklim-db-03 | PostgreSQL, MongoDB, pg-proxy, mongo-proxy |
|
||
|
||
SWAG and cert-reloader are pinned to `iklim-app-01` (the Floating IP node) because SWAG does not support clustering and must match the public entry point. Vault floats across all service nodes; its TLS cert is read from StorageBox (`/mnt/storagebox/ssl`) so it is available on whichever node Vault is scheduled on. Microservices carry no placement constraint and are distributed by the Swarm scheduler across all app nodes. DB services are pinned to DB nodes via separate DB stacks.
|