Integrate DB nodes into Swarm and refine prod service deployment

- Database nodes now join the Docker Swarm as workers with `role=db` labels, allowing Swarm to manage their dedicated services. - The `docker-stack-infra.yml` has been updated for production to focus solely on application-level infrastructure components. - Dedicated database services (PostgreSQL, MongoDB, Patroni-etcd) are now explicitly deployed in separate Swarm stacks on `iklim-db-XX` nodes. - Standardizes node naming conventions (`iklim-app-XX`, `iklim-db-XX`) across the production roadmap documentation. - Clarifies that the `etcd` service within `docker-stack-infra.yml` is exclusively for APISIX configuration, distinct from Patroni's etcd cluster.
2026-05-11 14:53:21 +03:00 · 2026-05-11 14:53:21 +03:00 · 76f87aa2f9
commit 76f87aa2f9
parent 720c79d460
8 changed files with 130 additions and 108 deletions
--- a/roadmap/prod-env/01-swarm-init-multinode.md
+++ b/roadmap/prod-env/01-swarm-init-multinode.md
@ -4,79 +4,103 @@
 - **Repo:** `iklim.co` root
 - **Environment:** prod
 - **Topology:**
-  - 3 × service nodes — all act as **Swarm managers AND app workers** (Raft quorum: 1 can fail)
-  - 3 × DB nodes — **NOT part of Docker Swarm** (separate DB cluster, out of scope)
+  - 3 × app nodes (`iklim-app-01/02/03`) — all act as **Swarm managers AND app workers** (Raft quorum: 1 can fail)
+  - 3 × DB nodes (`iklim-db-01/02/03`) — join Swarm as **workers** with `role=db` label; DB services are placed exclusively on them
+- **Sizing:** app nodes are `cpx42`, DB nodes are `cpx32`; see `../../hetzner-sizing-report.md`
 - All 6 nodes are in the same private network.
- Pipeline trigger: push to `prod-env` branch → Gitea runner on `prod-runner` (first service node).
- Swarm has 3 nodes total; all are manager-eligible and carry workloads (no dedicated worker-only nodes).
+- Pipeline trigger: push to `prod-env` branch → Gitea runner on `prod-runner` (first app node).
+- App Swarm managers: 3 nodes all manager-eligible and carry app workloads (no dedicated worker-only app nodes).

 ## Node labeling plan

 | Node | Role | Swarm role | Labels |
 |------|------|------------|--------|
-| service-1 | API services, SWAG, Vault | Manager + Worker | `type=service` |
-| service-2 | API services replicas | Manager + Worker | `type=service` |
-| service-3 | API services replicas | Manager + Worker | `type=service` |
+| `iklim-app-01` | API services, SWAG, Vault | Manager + Worker | `type=service` |
+| `iklim-app-02` | API services replicas | Manager + Worker | `type=service` |
+| `iklim-app-03` | API services replicas | Manager + Worker | `type=service` |
+| `iklim-db-01` | PostgreSQL (Patroni), etcd | Worker | `role=db` |
+| `iklim-db-02` | PostgreSQL (Patroni), etcd | Worker | `role=db` |
+| `iklim-db-03` | MongoDB replica + PostgreSQL (Patroni), etcd | Worker | `role=db` |

-> DB nodes (`db-1/2/3`) are **not part of Docker Swarm**. They run as a separate cluster
-> and are provisioned independently. No Swarm join or label step applies to them.
-
-## Step 1 — Init Swarm on service-1 (the prod-runner node)
+## Step 1 — Init Swarm on iklim-app-01 (the prod-runner node)

 ```bash
 MANAGER_IP=$(hostname -I | awk '{print $1}')
 if ! docker info --format '{{.Swarm.LocalNodeState}}' | grep -q "active"; then
  docker swarm init --advertise-addr "$MANAGER_IP"
-  echo "✅ Swarm initialized on $MANAGER_IP"
+  echo "Swarm initialized on $MANAGER_IP"
 else
-  echo "ℹ️  Swarm already active"
+  echo "Swarm already active"
 fi
 ```

 ## Step 2 — Get manager join token

 ```bash
-docker swarm join-token manager  # for service-2, service-3
+docker swarm join-token manager  # for iklim-app-02, iklim-app-03
 ```

-Save this token — needed on service-2 and service-3.
+Save this token — needed on iklim-app-02 and iklim-app-03.

-## Step 3 — Join service-2 and service-3 as managers
+## Step 3 — Join iklim-app-02 and iklim-app-03 as managers

-SSH into service-2 and service-3, run:
+SSH into iklim-app-02 and iklim-app-03, run:
 ```bash
-docker swarm join --token <MANAGER_TOKEN> <service-1-ip>:2377
+docker swarm join --token <MANAGER_TOKEN> 10.10.10.11:2377
 ```

-## Step 4 — Label all Swarm nodes
+## Step 4 — Label app nodes

-On service-1, after service-2 and service-3 have joined:
+On iklim-app-01, after iklim-app-02 and iklim-app-03 have joined:

 ```bash
-for node in service-1 service-2 service-3; do
+for node in iklim-app-01 iklim-app-02 iklim-app-03; do
  docker node update --label-add type=service "$node"
 done
 ```

-> Replace `service-1`, etc. with actual node hostnames shown in `docker node ls`.
-> DB nodes are not in Swarm — no join or label step for them.
+## Step 5 — Join DB nodes as Swarm workers

-## Step 5 — Verify
+Get the worker join token on iklim-app-01:
+
+```bash
+docker swarm join-token worker
+```
+
+SSH into each DB node and join:
+
+```bash
+docker swarm join --token <WORKER_TOKEN> 10.10.10.11:2377
+```
+
+Then label them on iklim-app-01:
+
+```bash
+for node in iklim-db-01 iklim-db-02 iklim-db-03; do
+  docker node update --label-add role=db "$node"
+done
+```
+
+> DB nodes are Swarm **workers** only — they never become managers.
+> DB services are pinned to them via `node.labels.role == db` placement constraint.
+> See `08-prod-db-cluster-kurulum.md` for DB stack deployment.
+
+## Step 6 — Verify

 ```bash
 docker node ls
 ```

-Expected: 3 nodes, all with `MANAGER STATUS` = `Leader` or `Reachable`.
-All 3 nodes remain in `AVAILABILITY=Active` (not drained) so they also carry workloads.
+Expected: 6 nodes — 3 with `MANAGER STATUS` = `Leader` or `Reachable`, 3 workers with `Ready`.

 ```bash
-docker node inspect service-1 --format '{{.Spec.Labels}}'
+docker node inspect iklim-app-01 --format '{{.Spec.Labels}}'
+docker node inspect iklim-db-01 --format '{{.Spec.Labels}}'
 ```

-Expected: `map[type:service]`.
+Expected: `map[role:service]` for app nodes, `map[role:db]` for DB nodes.

-## Step 6 — Confirm `init/swarm-init.sh` multi-node awareness
+## Step 7 — Confirm `init/swarm-init.sh` multi-node awareness

 The script is idempotent (skips init if already active). Verify:

@ -84,18 +108,17 @@ The script is idempotent (skips init if already active). Verify:
 grep -n "swarm init\|swarm join" init/swarm-init.sh
 ```

-The prod pipeline runs on service-1 only. service-2/3 are joined via Ansible (`swarm` role),
+The prod pipeline runs on iklim-app-01 only. iklim-app-02/03 are joined via Ansible (`swarm` role),
 not via the Gitea pipeline.

 ## Placement constraints used in `docker-stack-infra.yml`

 | Constraint | Resolves to |
 |------------|-------------|
-| `node.role == manager` | service-1, service-2, service-3 |
-| `node.labels.type == service` | service-1, service-2, service-3 |
+| `node.role == manager` | iklim-app-01, iklim-app-02, iklim-app-03 |
+| `node.labels.type == service` | iklim-app-01, iklim-app-02, iklim-app-03 |
+| `node.labels.role == db` | iklim-db-01, iklim-db-02, iklim-db-03 |

 SWAG, Vault, cert-reloader: pinned to `node.role == manager`.
-Microservices: no constraint (distributed across all 3 service nodes by Swarm scheduler).
-
-> `node.labels.type == db` constraint is **not used** — DB nodes are not in Swarm.
-> PostgreSQL and MongoDB run outside Swarm as a separately managed cluster.
+Microservices: no constraint (distributed across all app nodes by Swarm scheduler).
+DB services (Patroni, etcd, MongoDB): pinned to `node.labels.role == db` in separate DB stacks.
--- a/roadmap/prod-env/02-godaddy-credentials.md
+++ b/roadmap/prod-env/02-godaddy-credentials.md
@ -32,7 +32,7 @@ No additional action needed in the repo.

 ## Step 3 — (Handled by pipeline) Write credentials file on prod host

-The deploy pipeline (see `08-deploy-pipeline-update.md`) runs on service-1:
+The deploy pipeline (see `08-deploy-pipeline-update.md`) runs on iklim-app-01:

 ```bash
 mkdir -p /opt/iklimco/swag/dns-conf
@ -42,16 +42,16 @@ chmod 600 /opt/iklimco/swag/dns-conf/godaddy.ini

 ## Step 4 — GoDaddy A records for prod subdomains

-In GoDaddy DNS panel for `iklim.co`, add/update A records pointing to service-1's public IP:
+In GoDaddy DNS panel for `iklim.co`, add/update A records pointing to iklim-app-01's public IP:

 | Record | Value |
 |--------|-------|
-| `api` | `<service-1-public-ip>` |
-| `apigw` | `<service-1-public-ip>` |
-| `rabbitmq` | `<service-1-public-ip>` |
-| `grafana` | `<service-1-public-ip>` |
+| `api` | `<iklim-app-01-public-ip>` |
+| `apigw` | `<iklim-app-01-public-ip>` |
+| `rabbitmq` | `<iklim-app-01-public-ip>` |
+| `grafana` | `<iklim-app-01-public-ip>` |

-> Swarm's routing mesh means any node IP would work, but service-1 is the designated
+> Swarm's routing mesh means any node IP would work, but iklim-app-01 is the designated
 > entry point (runs SWAG). Using a single IP keeps DNS simple.
 >
 > For HA: add a load balancer or use Hetzner's floating IP in front of the 3 service nodes.
--- a/roadmap/prod-env/03-infra-stack-changes.md
+++ b/roadmap/prod-env/03-infra-stack-changes.md
@ -2,41 +2,22 @@

 ## Context
 - **File:** `docker-stack-infra.yml` (repo root — shared between test and prod)
- All changes from `test-env-setup/03-infra-stack-changes.md` apply here identically.
+- All changes from `test-env/03-infra-stack-changes.md` apply here identically.
 - **Additional prod-specific changes:**
-  - PostgreSQL and MongoDB placement constraints point to `type=db` nodes.
-  - Microservices have no constraint (distributed across service nodes by Swarm).
+  - Microservices have no constraint (distributed across app nodes by Swarm).
  - Replica counts for stateless services are increased.
+- **Note:** PostgreSQL and MongoDB are **not** in `docker-stack-infra.yml` for prod. They run on
+  dedicated DB nodes in separate stacks (`iklim-db` and `iklim-patroni`). See `08-prod-db-cluster-kurulum.md`.

 ## Step 1 — Apply all test-env changes first

-Follow every step in `test-env-setup/03-infra-stack-changes.md`:
+Follow every step in `test-env/03-infra-stack-changes.md`:
 - Add `swag` service
 - Add `cert-reloader` service
 - Remove published ports for vault, apisix, rabbitmq, prometheus, grafana, apisix-dashboard
 - Add `swag-vl` volume

-## Step 2 — Update PostgreSQL placement constraint
-
-Change `postgres` service placement to use the `type=db` label:
-
-```yaml
-# CHANGE in postgres service:
-      placement:
-        constraints:
-          - node.labels.type == db
-```
-
-## Step 3 — Update MongoDB placement constraint
-
-```yaml
-# CHANGE in mongo service:
-      placement:
-        constraints:
-          - node.labels.type == db
-```
-
-## Step 4 — Pin Vault to manager node (initial prod — single instance)
+## Step 2 — Pin Vault to manager node (initial prod — single instance)

 Vault starts as a single instance pinned to the manager node.
 Raft cluster migration is handled separately in `07-vault-raft-plan.md`.
@ -48,7 +29,7 @@ Raft cluster migration is handled separately in `07-vault-raft-plan.md`.
          - node.role == manager
 ```

-## Step 5 — Increase APISIX replicas for prod
+## Step 3 — Increase APISIX replicas for prod

 ```yaml
 # CHANGE in apisix service deploy block:
@ -59,40 +40,46 @@ Raft cluster migration is handled separately in `07-vault-raft-plan.md`.
 APISIX is stateless (config in etcd) — multiple replicas are safe.
 Swarm load-balances SWAG's requests across APISIX replicas via VIP.

-## Step 6 — etcd: 3-node cluster for prod
+## Step 4 — etcd: single instance in docker-stack-infra.yml (APISIX config store only)

-For prod, etcd should run as a 3-node cluster (minimum for Raft quorum).
-The current single-instance etcd definition needs to be replaced with a 3-node
-StatefulSet-style setup using separate service definitions or a dedicated
-`docker-stack-etcd.yml`.
+The `etcd` service in `docker-stack-infra.yml` is used exclusively by APISIX as its configuration
+store. It runs as a single instance on a manager node and is separate from the etcd cluster used by
+Patroni for PostgreSQL HA.

-> **Scope note:** etcd clustering for prod is complex and out of scope for initial launch.
-> Deploy with single etcd for initial prod launch. Add etcd clustering as a follow-up task.
-> Track in: `Technical Debt/TODO.md`
+```yaml
+# etcd placement stays as:
+      placement:
+        constraints:
+          - node.role == manager
+```

-## Step 7 — Verify the complete file
+> The 3-node etcd cluster for Patroni/PostgreSQL HA is deployed separately via `08-prod-db-cluster-kurulum.md`
+> on the dedicated DB nodes. These are two independent etcd deployments with different purposes.
+
+## Step 5 — Verify the complete file

 After all edits, validate the YAML:

 ```bash
-docker stack config -c docker-stack-infra.yml > /dev/null && echo "✅ YAML valid"
+docker stack config -c docker-stack-infra.yml > /dev/null && echo "YAML valid"
 ```

 No output errors = valid.

-## Placement summary for prod
+## Placement summary for prod (docker-stack-infra.yml only)

 | Service | Placement |
 |---------|-----------|
 | swag | `node.role == manager` |
 | cert-reloader | `node.role == manager` |
 | vault | `node.role == manager` |
-| apisix (2 replicas) | no constraint (any node) |
+| apisix (2 replicas) | no constraint (distributed across app nodes) |
 | apisix-dashboard | no constraint |
-| postgres | `node.labels.type == db` |
-| mongo | `node.labels.type == db` |
 | redis | `node.role == manager` |
 | rabbitmq | `node.role == manager` |
-| etcd | `node.role == manager` |
+| etcd (APISIX store) | `node.role == manager` |
 | prometheus | `node.role == manager` |
 | grafana | `node.role == manager` |
+
+> PostgreSQL and MongoDB are deployed in separate DB stacks on `iklim-db-*` nodes.
+> See `08-prod-db-cluster-kurulum.md` for those stacks.
--- a/roadmap/prod-env/04-swag-nginx-configs.md
+++ b/roadmap/prod-env/04-swag-nginx-configs.md
@ -48,7 +48,7 @@ will contain `server_name api.iklim.co;` — correct for prod.

 ## Verification

-After deploy, on service-1:
+After deploy, on iklim-app-01:
 ```bash
 cat /opt/iklimco/swag/proxy-confs/api.conf | grep server_name
 ```
--- a/roadmap/prod-env/06-cert-reloader.md
+++ b/roadmap/prod-env/06-cert-reloader.md
@ -20,20 +20,20 @@ No cross-node distribution needed.

 ## Future behavior (3-node Vault Raft — see step 07)

-When Vault runs on service-1, service-2, service-3:
+When Vault runs on iklim-app-01, iklim-app-02, iklim-app-03:

 ```
 cert-reloader detects cert change
-→ copies cert to /opt/iklimco/ssl/ on service-1 (local)
-→ SSH copy to service-2:/opt/iklimco/ssl/
-→ SSH copy to service-3:/opt/iklimco/ssl/
+→ copies cert to /opt/iklimco/ssl/ on iklim-app-01 (local)
+→ SSH copy to iklim-app-02:/opt/iklimco/ssl/
+→ SSH copy to iklim-app-03:/opt/iklimco/ssl/
 → docker service update --force iklimco_vault  (restarts all 3 replicas)
 ```

 This requires:
- An SSH key that cert-reloader can use to reach service-2 and service-3
+- An SSH key that cert-reloader can use to reach iklim-app-02 and iklim-app-03
 - That key mounted as a Docker secret into cert-reloader
- Known_hosts for service-2 and service-3 pre-configured
+- Known_hosts for iklim-app-02 and iklim-app-03 pre-configured

 Script update for this phase is tracked in `07-vault-raft-plan.md`.

--- a/roadmap/prod-env/07-vault-raft-plan.md
+++ b/roadmap/prod-env/07-vault-raft-plan.md
@ -1,7 +1,7 @@
 # 07 — Vault: Initial Single Instance + Raft Cluster Migration Plan (Prod)

 ## Context
-Vault starts as a single instance on the manager node (service-1) for the initial prod launch.
+Vault starts as a single instance on the manager node (iklim-app-01) for the initial prod launch.
 This matches the current `docker-stack-infra.yml` configuration (file storage, single replica).

 Raft HA cluster is planned for a later phase.
@ -9,8 +9,8 @@ Raft HA cluster is planned for a later phase.
 ## Phase 1 — Initial prod launch (current)

 - **Replicas:** 1
- **Storage:** file (`/vault/file`) on service-1
- **Placement:** `node.role == manager` (service-1)
+- **Storage:** file (`/vault/file`) on iklim-app-01
+- **Placement:** `node.role == manager` (iklim-app-01)
 - **Cert:** from `/opt/iklimco/ssl/` (populated by cert-reloader from SWAG volume)
 - **TLS:** `VAULT_LOCAL_CONFIG` unchanged — `api_addr: https://vault.iklim.co:8200`

@ -22,14 +22,14 @@ No changes to `docker-stack-infra.yml` vault service for Phase 1.
 - **Replicas:** 3 (one per service node)
 - **Storage:** Raft integrated (replaces file storage)
 - **Placement:** `node.labels.type == service` (all 3 service nodes)
- **Cert distribution:** cert-reloader SSH-copies renewed cert to service-2, service-3
+- **Cert distribution:** cert-reloader SSH-copies renewed cert to iklim-app-02, iklim-app-03

 ### Prerequisites before migration
 - [ ] All 3 service nodes are running and labeled `type=service`
 - [ ] Vault data backed up from Phase 1 (snapshot via `vault operator raft snapshot save`)
- [ ] SSH key created for cert-reloader to reach service-2 and service-3
+- [ ] SSH key created for cert-reloader to reach iklim-app-02 and iklim-app-03
 - [ ] SSH key stored as Docker secret `cert_reloader_ssh_key`
- [ ] `/opt/iklimco/ssl/` directory exists on service-2 and service-3
+- [ ] `/opt/iklimco/ssl/` directory exists on iklim-app-02 and iklim-app-03
 - [ ] Vault data directory `/opt/iklimco/vault/data/` exists on all 3 nodes (host path volumes)

 ### Vault service update for Raft
@ -65,7 +65,7 @@ vault:
 Only the leader needs to be bootstrapped; others join via `vault operator raft join`:

 ```bash
-# On the primary Vault (service-1 container):
+# On the primary Vault (iklim-app-01 container):
 VAULT_CTR=$(docker ps -q -f name=iklimco_vault)

 # Unseal if needed
@ -75,22 +75,22 @@ docker exec -it "$VAULT_CTR" vault operator unseal
 docker exec "$VAULT_CTR" vault operator raft list-peers
 ```

-On service-2 and service-3 containers:
+On iklim-app-02 and iklim-app-03 containers:
 ```bash
-docker exec -it <vault-on-service-2> vault operator raft join \
+docker exec -it <vault-on-iklim-app-02> vault operator raft join \
  https://vault.iklim.co:8200
 ```

 ### cert-reloader update for Raft

 Update the cert-reloader command in `docker-stack-infra.yml` to SSH-copy the cert
-to service-2 and service-3 after renewal:
+to iklim-app-02 and iklim-app-03 after renewal:

 ```bash
 # After copying to local /opt/iklimco/ssl/:
-ssh -i /run/secrets/cert_reloader_ssh_key service-2 \
+ssh -i /run/secrets/cert_reloader_ssh_key iklim-app-02 \
  "cp /dev/stdin /opt/iklimco/ssl/STAR.iklim.co.full.crt" < /opt/iklimco/ssl/STAR.iklim.co.full.crt
-# (repeat for service-3 and privkey)
+# (repeat for iklim-app-03 and privkey)
 docker service update --force iklimco_vault
 ```

--- a/roadmap/prod-env/09-verify.md
+++ b/roadmap/prod-env/09-verify.md
@ -8,7 +8,7 @@ Run after a successful prod pipeline deployment.
 ```bash
 docker node ls
 ```
-Expected: 3 managers (`Leader` + 2 `Reachable`), 3 workers (`Ready`).
+Expected: 3 managers (`Leader` + 2 `Reachable`) for `iklim-app-01/02/03`, 3 workers (`Ready`) for `iklim-db-01/02/03`.

 ```bash
 docker service ls --filter label=project=co.iklim
@ -57,7 +57,7 @@ curl -si https://rabbitmq.iklim.co   # HTTP 200 RabbitMQ Management

 ```bash
 # From outside — must fail
-curl -sk --connect-timeout 5 https://<service-1-public-ip>:8200/v1/sys/health
+curl -sk --connect-timeout 5 https://<iklim-app-01-public-ip>:8200/v1/sys/health
 # Expected: connection refused or timeout
 ```

@ -86,10 +86,21 @@ Only `iklimco_swag` should show `*:80->80/tcp, *:443->443/tcp`.
 ## 8 — DB nodes running correct services

 ```bash
-docker service ps iklimco_postgres
-docker service ps iklimco_mongo
+# Patroni (PostgreSQL HA) stack
+docker stack services iklim-patroni
+docker service ps iklim-patroni_patroni-01
+docker service ps iklim-patroni_patroni-02
+docker service ps iklim-patroni_patroni-03
+
+# etcd cluster (for Patroni)
+docker stack services iklim-db-etcd
+
+# MongoDB replica set
+docker stack services iklim-db
+docker service ps iklim-db_mongodb
 ```
-Tasks should show node names matching `db-1`, `db-2`, or `db-3`.
+
+All tasks should show node names matching `iklim-db-01`, `iklim-db-02`, or `iklim-db-03` with placement constraint `role=db`.

 ## 9 — APISIX replicas

--- a/roadmap/test-env/01-swarm-init.md
+++ b/roadmap/test-env/01-swarm-init.md
@ -4,6 +4,7 @@
 - **Repo:** `iklim.co` root
 - **Environment:** test
 - **Server:** single node — same machine is both Swarm manager and worker
+- **Sizing:** Terraform test app node is `cpx42`; see `../../hetzner-sizing-report.md`
 - Pipeline trigger: push to `test-env` branch → Gitea runner executes directly on the test server
 - `init/swarm-init.sh` already exists in the repo and is called by the pipeline