From 76f87aa2f977dbe550455e522f32bd2f63ba8742 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Murat=20=C3=96ZDEM=C4=B0R?= Date: Mon, 11 May 2026 14:53:21 +0300 Subject: [PATCH] Integrate DB nodes into Swarm and refine prod service deployment - Database nodes now join the Docker Swarm as workers with `role=db` labels, allowing Swarm to manage their dedicated services. - The `docker-stack-infra.yml` has been updated for production to focus solely on application-level infrastructure components. - Dedicated database services (PostgreSQL, MongoDB, Patroni-etcd) are now explicitly deployed in separate Swarm stacks on `iklim-db-XX` nodes. - Standardizes node naming conventions (`iklim-app-XX`, `iklim-db-XX`) across the production roadmap documentation. - Clarifies that the `etcd` service within `docker-stack-infra.yml` is exclusively for APISIX configuration, distinct from Patroni's etcd cluster. --- roadmap/prod-env/01-swarm-init-multinode.md | 95 +++++++++++++-------- roadmap/prod-env/02-godaddy-credentials.md | 14 +-- roadmap/prod-env/03-infra-stack-changes.md | 69 ++++++--------- roadmap/prod-env/04-swag-nginx-configs.md | 2 +- roadmap/prod-env/06-cert-reloader.md | 12 +-- roadmap/prod-env/07-vault-raft-plan.md | 24 +++--- roadmap/prod-env/09-verify.md | 21 +++-- roadmap/test-env/01-swarm-init.md | 1 + 8 files changed, 130 insertions(+), 108 deletions(-) diff --git a/roadmap/prod-env/01-swarm-init-multinode.md b/roadmap/prod-env/01-swarm-init-multinode.md index f8c88e0..65c28f7 100644 --- a/roadmap/prod-env/01-swarm-init-multinode.md +++ b/roadmap/prod-env/01-swarm-init-multinode.md @@ -4,79 +4,103 @@ - **Repo:** `iklim.co` root - **Environment:** prod - **Topology:** - - 3 × service nodes — all act as **Swarm managers AND app workers** (Raft quorum: 1 can fail) - - 3 × DB nodes — **NOT part of Docker Swarm** (separate DB cluster, out of scope) + - 3 × app nodes (`iklim-app-01/02/03`) — all act as **Swarm managers AND app workers** (Raft quorum: 1 can fail) + - 3 × DB nodes (`iklim-db-01/02/03`) — join Swarm as **workers** with `role=db` label; DB services are placed exclusively on them +- **Sizing:** app nodes are `cpx42`, DB nodes are `cpx32`; see `../../hetzner-sizing-report.md` - All 6 nodes are in the same private network. -- Pipeline trigger: push to `prod-env` branch → Gitea runner on `prod-runner` (first service node). -- Swarm has 3 nodes total; all are manager-eligible and carry workloads (no dedicated worker-only nodes). +- Pipeline trigger: push to `prod-env` branch → Gitea runner on `prod-runner` (first app node). +- App Swarm managers: 3 nodes all manager-eligible and carry app workloads (no dedicated worker-only app nodes). ## Node labeling plan | Node | Role | Swarm role | Labels | |------|------|------------|--------| -| service-1 | API services, SWAG, Vault | Manager + Worker | `type=service` | -| service-2 | API services replicas | Manager + Worker | `type=service` | -| service-3 | API services replicas | Manager + Worker | `type=service` | +| `iklim-app-01` | API services, SWAG, Vault | Manager + Worker | `type=service` | +| `iklim-app-02` | API services replicas | Manager + Worker | `type=service` | +| `iklim-app-03` | API services replicas | Manager + Worker | `type=service` | +| `iklim-db-01` | PostgreSQL (Patroni), etcd | Worker | `role=db` | +| `iklim-db-02` | PostgreSQL (Patroni), etcd | Worker | `role=db` | +| `iklim-db-03` | MongoDB replica + PostgreSQL (Patroni), etcd | Worker | `role=db` | -> DB nodes (`db-1/2/3`) are **not part of Docker Swarm**. They run as a separate cluster -> and are provisioned independently. No Swarm join or label step applies to them. - -## Step 1 — Init Swarm on service-1 (the prod-runner node) +## Step 1 — Init Swarm on iklim-app-01 (the prod-runner node) ```bash MANAGER_IP=$(hostname -I | awk '{print $1}') if ! docker info --format '{{.Swarm.LocalNodeState}}' | grep -q "active"; then docker swarm init --advertise-addr "$MANAGER_IP" - echo "✅ Swarm initialized on $MANAGER_IP" + echo "Swarm initialized on $MANAGER_IP" else - echo "ℹ️ Swarm already active" + echo "Swarm already active" fi ``` ## Step 2 — Get manager join token ```bash -docker swarm join-token manager # for service-2, service-3 +docker swarm join-token manager # for iklim-app-02, iklim-app-03 ``` -Save this token — needed on service-2 and service-3. +Save this token — needed on iklim-app-02 and iklim-app-03. -## Step 3 — Join service-2 and service-3 as managers +## Step 3 — Join iklim-app-02 and iklim-app-03 as managers -SSH into service-2 and service-3, run: +SSH into iklim-app-02 and iklim-app-03, run: ```bash -docker swarm join --token :2377 +docker swarm join --token 10.10.10.11:2377 ``` -## Step 4 — Label all Swarm nodes +## Step 4 — Label app nodes -On service-1, after service-2 and service-3 have joined: +On iklim-app-01, after iklim-app-02 and iklim-app-03 have joined: ```bash -for node in service-1 service-2 service-3; do +for node in iklim-app-01 iklim-app-02 iklim-app-03; do docker node update --label-add type=service "$node" done ``` -> Replace `service-1`, etc. with actual node hostnames shown in `docker node ls`. -> DB nodes are not in Swarm — no join or label step for them. +## Step 5 — Join DB nodes as Swarm workers -## Step 5 — Verify +Get the worker join token on iklim-app-01: + +```bash +docker swarm join-token worker +``` + +SSH into each DB node and join: + +```bash +docker swarm join --token 10.10.10.11:2377 +``` + +Then label them on iklim-app-01: + +```bash +for node in iklim-db-01 iklim-db-02 iklim-db-03; do + docker node update --label-add role=db "$node" +done +``` + +> DB nodes are Swarm **workers** only — they never become managers. +> DB services are pinned to them via `node.labels.role == db` placement constraint. +> See `08-prod-db-cluster-kurulum.md` for DB stack deployment. + +## Step 6 — Verify ```bash docker node ls ``` -Expected: 3 nodes, all with `MANAGER STATUS` = `Leader` or `Reachable`. -All 3 nodes remain in `AVAILABILITY=Active` (not drained) so they also carry workloads. +Expected: 6 nodes — 3 with `MANAGER STATUS` = `Leader` or `Reachable`, 3 workers with `Ready`. ```bash -docker node inspect service-1 --format '{{.Spec.Labels}}' +docker node inspect iklim-app-01 --format '{{.Spec.Labels}}' +docker node inspect iklim-db-01 --format '{{.Spec.Labels}}' ``` -Expected: `map[type:service]`. +Expected: `map[role:service]` for app nodes, `map[role:db]` for DB nodes. -## Step 6 — Confirm `init/swarm-init.sh` multi-node awareness +## Step 7 — Confirm `init/swarm-init.sh` multi-node awareness The script is idempotent (skips init if already active). Verify: @@ -84,18 +108,17 @@ The script is idempotent (skips init if already active). Verify: grep -n "swarm init\|swarm join" init/swarm-init.sh ``` -The prod pipeline runs on service-1 only. service-2/3 are joined via Ansible (`swarm` role), +The prod pipeline runs on iklim-app-01 only. iklim-app-02/03 are joined via Ansible (`swarm` role), not via the Gitea pipeline. ## Placement constraints used in `docker-stack-infra.yml` | Constraint | Resolves to | |------------|-------------| -| `node.role == manager` | service-1, service-2, service-3 | -| `node.labels.type == service` | service-1, service-2, service-3 | +| `node.role == manager` | iklim-app-01, iklim-app-02, iklim-app-03 | +| `node.labels.type == service` | iklim-app-01, iklim-app-02, iklim-app-03 | +| `node.labels.role == db` | iklim-db-01, iklim-db-02, iklim-db-03 | SWAG, Vault, cert-reloader: pinned to `node.role == manager`. -Microservices: no constraint (distributed across all 3 service nodes by Swarm scheduler). - -> `node.labels.type == db` constraint is **not used** — DB nodes are not in Swarm. -> PostgreSQL and MongoDB run outside Swarm as a separately managed cluster. +Microservices: no constraint (distributed across all app nodes by Swarm scheduler). +DB services (Patroni, etcd, MongoDB): pinned to `node.labels.role == db` in separate DB stacks. diff --git a/roadmap/prod-env/02-godaddy-credentials.md b/roadmap/prod-env/02-godaddy-credentials.md index 48584e5..24b3b5e 100644 --- a/roadmap/prod-env/02-godaddy-credentials.md +++ b/roadmap/prod-env/02-godaddy-credentials.md @@ -32,7 +32,7 @@ No additional action needed in the repo. ## Step 3 — (Handled by pipeline) Write credentials file on prod host -The deploy pipeline (see `08-deploy-pipeline-update.md`) runs on service-1: +The deploy pipeline (see `08-deploy-pipeline-update.md`) runs on iklim-app-01: ```bash mkdir -p /opt/iklimco/swag/dns-conf @@ -42,16 +42,16 @@ chmod 600 /opt/iklimco/swag/dns-conf/godaddy.ini ## Step 4 — GoDaddy A records for prod subdomains -In GoDaddy DNS panel for `iklim.co`, add/update A records pointing to service-1's public IP: +In GoDaddy DNS panel for `iklim.co`, add/update A records pointing to iklim-app-01's public IP: | Record | Value | |--------|-------| -| `api` | `` | -| `apigw` | `` | -| `rabbitmq` | `` | -| `grafana` | `` | +| `api` | `` | +| `apigw` | `` | +| `rabbitmq` | `` | +| `grafana` | `` | -> Swarm's routing mesh means any node IP would work, but service-1 is the designated +> Swarm's routing mesh means any node IP would work, but iklim-app-01 is the designated > entry point (runs SWAG). Using a single IP keeps DNS simple. > > For HA: add a load balancer or use Hetzner's floating IP in front of the 3 service nodes. diff --git a/roadmap/prod-env/03-infra-stack-changes.md b/roadmap/prod-env/03-infra-stack-changes.md index 45e3a4d..093682a 100644 --- a/roadmap/prod-env/03-infra-stack-changes.md +++ b/roadmap/prod-env/03-infra-stack-changes.md @@ -2,41 +2,22 @@ ## Context - **File:** `docker-stack-infra.yml` (repo root — shared between test and prod) -- All changes from `test-env-setup/03-infra-stack-changes.md` apply here identically. +- All changes from `test-env/03-infra-stack-changes.md` apply here identically. - **Additional prod-specific changes:** - - PostgreSQL and MongoDB placement constraints point to `type=db` nodes. - - Microservices have no constraint (distributed across service nodes by Swarm). + - Microservices have no constraint (distributed across app nodes by Swarm). - Replica counts for stateless services are increased. +- **Note:** PostgreSQL and MongoDB are **not** in `docker-stack-infra.yml` for prod. They run on + dedicated DB nodes in separate stacks (`iklim-db` and `iklim-patroni`). See `08-prod-db-cluster-kurulum.md`. ## Step 1 — Apply all test-env changes first -Follow every step in `test-env-setup/03-infra-stack-changes.md`: +Follow every step in `test-env/03-infra-stack-changes.md`: - Add `swag` service - Add `cert-reloader` service - Remove published ports for vault, apisix, rabbitmq, prometheus, grafana, apisix-dashboard - Add `swag-vl` volume -## Step 2 — Update PostgreSQL placement constraint - -Change `postgres` service placement to use the `type=db` label: - -```yaml -# CHANGE in postgres service: - placement: - constraints: - - node.labels.type == db -``` - -## Step 3 — Update MongoDB placement constraint - -```yaml -# CHANGE in mongo service: - placement: - constraints: - - node.labels.type == db -``` - -## Step 4 — Pin Vault to manager node (initial prod — single instance) +## Step 2 — Pin Vault to manager node (initial prod — single instance) Vault starts as a single instance pinned to the manager node. Raft cluster migration is handled separately in `07-vault-raft-plan.md`. @@ -48,7 +29,7 @@ Raft cluster migration is handled separately in `07-vault-raft-plan.md`. - node.role == manager ``` -## Step 5 — Increase APISIX replicas for prod +## Step 3 — Increase APISIX replicas for prod ```yaml # CHANGE in apisix service deploy block: @@ -59,40 +40,46 @@ Raft cluster migration is handled separately in `07-vault-raft-plan.md`. APISIX is stateless (config in etcd) — multiple replicas are safe. Swarm load-balances SWAG's requests across APISIX replicas via VIP. -## Step 6 — etcd: 3-node cluster for prod +## Step 4 — etcd: single instance in docker-stack-infra.yml (APISIX config store only) -For prod, etcd should run as a 3-node cluster (minimum for Raft quorum). -The current single-instance etcd definition needs to be replaced with a 3-node -StatefulSet-style setup using separate service definitions or a dedicated -`docker-stack-etcd.yml`. +The `etcd` service in `docker-stack-infra.yml` is used exclusively by APISIX as its configuration +store. It runs as a single instance on a manager node and is separate from the etcd cluster used by +Patroni for PostgreSQL HA. -> **Scope note:** etcd clustering for prod is complex and out of scope for initial launch. -> Deploy with single etcd for initial prod launch. Add etcd clustering as a follow-up task. -> Track in: `Technical Debt/TODO.md` +```yaml +# etcd placement stays as: + placement: + constraints: + - node.role == manager +``` -## Step 7 — Verify the complete file +> The 3-node etcd cluster for Patroni/PostgreSQL HA is deployed separately via `08-prod-db-cluster-kurulum.md` +> on the dedicated DB nodes. These are two independent etcd deployments with different purposes. + +## Step 5 — Verify the complete file After all edits, validate the YAML: ```bash -docker stack config -c docker-stack-infra.yml > /dev/null && echo "✅ YAML valid" +docker stack config -c docker-stack-infra.yml > /dev/null && echo "YAML valid" ``` No output errors = valid. -## Placement summary for prod +## Placement summary for prod (docker-stack-infra.yml only) | Service | Placement | |---------|-----------| | swag | `node.role == manager` | | cert-reloader | `node.role == manager` | | vault | `node.role == manager` | -| apisix (2 replicas) | no constraint (any node) | +| apisix (2 replicas) | no constraint (distributed across app nodes) | | apisix-dashboard | no constraint | -| postgres | `node.labels.type == db` | -| mongo | `node.labels.type == db` | | redis | `node.role == manager` | | rabbitmq | `node.role == manager` | -| etcd | `node.role == manager` | +| etcd (APISIX store) | `node.role == manager` | | prometheus | `node.role == manager` | | grafana | `node.role == manager` | + +> PostgreSQL and MongoDB are deployed in separate DB stacks on `iklim-db-*` nodes. +> See `08-prod-db-cluster-kurulum.md` for those stacks. diff --git a/roadmap/prod-env/04-swag-nginx-configs.md b/roadmap/prod-env/04-swag-nginx-configs.md index 94abed5..d59ed65 100644 --- a/roadmap/prod-env/04-swag-nginx-configs.md +++ b/roadmap/prod-env/04-swag-nginx-configs.md @@ -48,7 +48,7 @@ will contain `server_name api.iklim.co;` — correct for prod. ## Verification -After deploy, on service-1: +After deploy, on iklim-app-01: ```bash cat /opt/iklimco/swag/proxy-confs/api.conf | grep server_name ``` diff --git a/roadmap/prod-env/06-cert-reloader.md b/roadmap/prod-env/06-cert-reloader.md index 8b4b59c..c49443a 100644 --- a/roadmap/prod-env/06-cert-reloader.md +++ b/roadmap/prod-env/06-cert-reloader.md @@ -20,20 +20,20 @@ No cross-node distribution needed. ## Future behavior (3-node Vault Raft — see step 07) -When Vault runs on service-1, service-2, service-3: +When Vault runs on iklim-app-01, iklim-app-02, iklim-app-03: ``` cert-reloader detects cert change -→ copies cert to /opt/iklimco/ssl/ on service-1 (local) -→ SSH copy to service-2:/opt/iklimco/ssl/ -→ SSH copy to service-3:/opt/iklimco/ssl/ +→ copies cert to /opt/iklimco/ssl/ on iklim-app-01 (local) +→ SSH copy to iklim-app-02:/opt/iklimco/ssl/ +→ SSH copy to iklim-app-03:/opt/iklimco/ssl/ → docker service update --force iklimco_vault (restarts all 3 replicas) ``` This requires: -- An SSH key that cert-reloader can use to reach service-2 and service-3 +- An SSH key that cert-reloader can use to reach iklim-app-02 and iklim-app-03 - That key mounted as a Docker secret into cert-reloader -- Known_hosts for service-2 and service-3 pre-configured +- Known_hosts for iklim-app-02 and iklim-app-03 pre-configured Script update for this phase is tracked in `07-vault-raft-plan.md`. diff --git a/roadmap/prod-env/07-vault-raft-plan.md b/roadmap/prod-env/07-vault-raft-plan.md index 68c407c..61db360 100644 --- a/roadmap/prod-env/07-vault-raft-plan.md +++ b/roadmap/prod-env/07-vault-raft-plan.md @@ -1,7 +1,7 @@ # 07 — Vault: Initial Single Instance + Raft Cluster Migration Plan (Prod) ## Context -Vault starts as a single instance on the manager node (service-1) for the initial prod launch. +Vault starts as a single instance on the manager node (iklim-app-01) for the initial prod launch. This matches the current `docker-stack-infra.yml` configuration (file storage, single replica). Raft HA cluster is planned for a later phase. @@ -9,8 +9,8 @@ Raft HA cluster is planned for a later phase. ## Phase 1 — Initial prod launch (current) - **Replicas:** 1 -- **Storage:** file (`/vault/file`) on service-1 -- **Placement:** `node.role == manager` (service-1) +- **Storage:** file (`/vault/file`) on iklim-app-01 +- **Placement:** `node.role == manager` (iklim-app-01) - **Cert:** from `/opt/iklimco/ssl/` (populated by cert-reloader from SWAG volume) - **TLS:** `VAULT_LOCAL_CONFIG` unchanged — `api_addr: https://vault.iklim.co:8200` @@ -22,14 +22,14 @@ No changes to `docker-stack-infra.yml` vault service for Phase 1. - **Replicas:** 3 (one per service node) - **Storage:** Raft integrated (replaces file storage) - **Placement:** `node.labels.type == service` (all 3 service nodes) -- **Cert distribution:** cert-reloader SSH-copies renewed cert to service-2, service-3 +- **Cert distribution:** cert-reloader SSH-copies renewed cert to iklim-app-02, iklim-app-03 ### Prerequisites before migration - [ ] All 3 service nodes are running and labeled `type=service` - [ ] Vault data backed up from Phase 1 (snapshot via `vault operator raft snapshot save`) -- [ ] SSH key created for cert-reloader to reach service-2 and service-3 +- [ ] SSH key created for cert-reloader to reach iklim-app-02 and iklim-app-03 - [ ] SSH key stored as Docker secret `cert_reloader_ssh_key` -- [ ] `/opt/iklimco/ssl/` directory exists on service-2 and service-3 +- [ ] `/opt/iklimco/ssl/` directory exists on iklim-app-02 and iklim-app-03 - [ ] Vault data directory `/opt/iklimco/vault/data/` exists on all 3 nodes (host path volumes) ### Vault service update for Raft @@ -65,7 +65,7 @@ vault: Only the leader needs to be bootstrapped; others join via `vault operator raft join`: ```bash -# On the primary Vault (service-1 container): +# On the primary Vault (iklim-app-01 container): VAULT_CTR=$(docker ps -q -f name=iklimco_vault) # Unseal if needed @@ -75,22 +75,22 @@ docker exec -it "$VAULT_CTR" vault operator unseal docker exec "$VAULT_CTR" vault operator raft list-peers ``` -On service-2 and service-3 containers: +On iklim-app-02 and iklim-app-03 containers: ```bash -docker exec -it vault operator raft join \ +docker exec -it vault operator raft join \ https://vault.iklim.co:8200 ``` ### cert-reloader update for Raft Update the cert-reloader command in `docker-stack-infra.yml` to SSH-copy the cert -to service-2 and service-3 after renewal: +to iklim-app-02 and iklim-app-03 after renewal: ```bash # After copying to local /opt/iklimco/ssl/: -ssh -i /run/secrets/cert_reloader_ssh_key service-2 \ +ssh -i /run/secrets/cert_reloader_ssh_key iklim-app-02 \ "cp /dev/stdin /opt/iklimco/ssl/STAR.iklim.co.full.crt" < /opt/iklimco/ssl/STAR.iklim.co.full.crt -# (repeat for service-3 and privkey) +# (repeat for iklim-app-03 and privkey) docker service update --force iklimco_vault ``` diff --git a/roadmap/prod-env/09-verify.md b/roadmap/prod-env/09-verify.md index 0e20ea2..87ee39a 100644 --- a/roadmap/prod-env/09-verify.md +++ b/roadmap/prod-env/09-verify.md @@ -8,7 +8,7 @@ Run after a successful prod pipeline deployment. ```bash docker node ls ``` -Expected: 3 managers (`Leader` + 2 `Reachable`), 3 workers (`Ready`). +Expected: 3 managers (`Leader` + 2 `Reachable`) for `iklim-app-01/02/03`, 3 workers (`Ready`) for `iklim-db-01/02/03`. ```bash docker service ls --filter label=project=co.iklim @@ -57,7 +57,7 @@ curl -si https://rabbitmq.iklim.co # HTTP 200 RabbitMQ Management ```bash # From outside — must fail -curl -sk --connect-timeout 5 https://:8200/v1/sys/health +curl -sk --connect-timeout 5 https://:8200/v1/sys/health # Expected: connection refused or timeout ``` @@ -86,10 +86,21 @@ Only `iklimco_swag` should show `*:80->80/tcp, *:443->443/tcp`. ## 8 — DB nodes running correct services ```bash -docker service ps iklimco_postgres -docker service ps iklimco_mongo +# Patroni (PostgreSQL HA) stack +docker stack services iklim-patroni +docker service ps iklim-patroni_patroni-01 +docker service ps iklim-patroni_patroni-02 +docker service ps iklim-patroni_patroni-03 + +# etcd cluster (for Patroni) +docker stack services iklim-db-etcd + +# MongoDB replica set +docker stack services iklim-db +docker service ps iklim-db_mongodb ``` -Tasks should show node names matching `db-1`, `db-2`, or `db-3`. + +All tasks should show node names matching `iklim-db-01`, `iklim-db-02`, or `iklim-db-03` with placement constraint `role=db`. ## 9 — APISIX replicas diff --git a/roadmap/test-env/01-swarm-init.md b/roadmap/test-env/01-swarm-init.md index 9418120..bc218ce 100644 --- a/roadmap/test-env/01-swarm-init.md +++ b/roadmap/test-env/01-swarm-init.md @@ -4,6 +4,7 @@ - **Repo:** `iklim.co` root - **Environment:** test - **Server:** single node — same machine is both Swarm manager and worker +- **Sizing:** Terraform test app node is `cpx42`; see `../../hetzner-sizing-report.md` - Pipeline trigger: push to `test-env` branch → Gitea runner executes directly on the test server - `init/swarm-init.sh` already exists in the repo and is called by the pipeline