diff --git a/roadmap/prod-env/01-swarm-init-multinode.md b/roadmap/prod-env/01-swarm-init-multinode.md index e1407a4..3f00360 100644 --- a/roadmap/prod-env/01-swarm-init-multinode.md +++ b/roadmap/prod-env/01-swarm-init-multinode.md @@ -37,12 +37,17 @@ node.labels.type == service ← custom label (app node workload target) node.labels.role == db ← custom label (DB node workload target) ``` -This scheme is applied consistently across `docker-stack-infra.yml` and all 10 microservice `docker-stack-service.yml` files. The test environment uses the same `type=service` label on its single node, so both environments share the same constraint syntax. +This scheme is applied consistently across the current prod stack (`docker-stack-infra_db-prod.yml`), the separate Vault stack (`docker-stack-vault.yml`), and microservice stack definitions. The test environment uses the same `type=service` label on its service node, so both environments share the same constraint syntax. `node.role == worker` is intentionally not used anywhere. DB nodes are Swarm workers, but targeting them via `node.role == worker` would also match any future worker-only app nodes. The explicit `node.labels.role == db` label provides precise, unambiguous targeting regardless of Swarm role. ## Otomasyon Notu -**ÖNEMLİ:** Aşağıda listelenen tüm Swarm ilklendirme, join token işlemleri ve node etiketleme (labeling) süreçleri artık manuel yapılmamaktadır. Bu işlemler `Environment_Infrastructure/ansible/prod/prod-bootstrap.yml` ve ortak `swarm` rolü tarafından **tamamen otomatik** olarak yürütülmektedir. Buradaki manuel bash komutları yalnızca referans, bilgi ve sorun giderme (troubleshooting) amaçlı tutulmaktadır. +**ÖNEMLİ:** Aşağıda listelenen tüm Swarm ilklendirme, join token işlemleri ve node etiketleme süreçleri artık manuel yapılmamaktadır. Bu işlemler `Environment_Infrastructure/ansible/prod/prod-bootstrap.yml` ve ortak `swarm` rolü tarafından otomatik olarak yürütülmektedir. Buradaki manuel bash komutları yalnızca referans, bilgi ve sorun giderme amaçlı tutulmaktadır. + +Labeling iki aşamalıdır: + +- Ortak `swarm` rolü app node'lara `type=service`, DB node'lara `role=db` etiketini ekler. +- Prod playbook'u `iklim-app-01` üzerinden DB node'lara `db-index=01/02/03` etiketini ekler. ## Step 1 — Init Swarm on iklim-app-01 (the prod-runner node) @@ -98,9 +103,13 @@ docker swarm join --token 10.20.10.11:2377 Then label them on iklim-app-01: ```bash -docker node update --label-add role=db --label-add db-index=01 iklim-db-01 -docker node update --label-add role=db --label-add db-index=02 iklim-db-02 -docker node update --label-add role=db --label-add db-index=03 iklim-db-03 +docker node update --label-add role=db iklim-db-01 +docker node update --label-add role=db iklim-db-02 +docker node update --label-add role=db iklim-db-03 + +docker node update --label-add db-index=01 iklim-db-01 +docker node update --label-add db-index=02 iklim-db-02 +docker node update --label-add db-index=03 iklim-db-03 ``` > DB nodes are Swarm **workers** only — they never become managers. @@ -130,15 +139,19 @@ The script is idempotent (skips init if already active). Verify: grep -n "swarm init\|swarm join" init/swarm-init.sh ``` -The prod pipeline runs on iklim-app-01 only. iklim-app-02/03 are joined via Ansible (`swarm` role), -not via the Gitea pipeline. +The prod pipeline runs on iklim-app-01 only. iklim-app-02/03 are joined via Ansible (`swarm` role), not via the Gitea pipeline. -## Placement constraints used in `docker-stack-infra.yml` +## Placement Constraints Used in Current Prod Stacks | Constraint | Resolves to | Services | |------------|-------------|----------| | `node.hostname == iklim-app-01` | iklim-app-01 only | SWAG, cert-reloader | -| `node.labels.type == service` | iklim-app-01, iklim-app-02, iklim-app-03 | Vault, Redis, RabbitMQ, APISIX, Prometheus, Grafana, etcd (idle in prod — APISIX uses Patroni etcd) | -| `node.labels.role == db` | iklim-db-01, iklim-db-02, iklim-db-03 | PostgreSQL (Patroni), MongoDB, etcd (via `docker-stack-db.prod.yml`) | +| `node.labels.type == service` | iklim-app-01, iklim-app-02, iklim-app-03 | Vault, Redis, RabbitMQ, APISIX, Prometheus, Grafana, SWAG support services | +| `node.hostname == iklim-db-01/02/03` | specific DB node | Patroni, MongoDB, and etcd services pinned per node in `docker-stack-infra_db-prod.yml` | +| `node.labels.role == db` | iklim-db-01, iklim-db-02, iklim-db-03 | Generic DB node identity; retained for operations and compatibility | -SWAG and cert-reloader are pinned to `iklim-app-01` (the Floating IP node) because SWAG does not support clustering and must match the public entry point. Vault floats across all service nodes; its TLS cert is read from StorageBox (`/mnt/storagebox/ssl`) so it is available on whichever node Vault is scheduled on. Microservices carry no placement constraint and are distributed by the Swarm scheduler across all app nodes. DB services are pinned to DB nodes via separate DB stacks. +SWAG and cert-reloader are pinned to `iklim-app-01` (the Floating IP node) because SWAG must match the public entry point. Vault is deployed by `docker-stack-vault.yml` across service nodes and reads certificates from `/opt/iklimco/ssl`. Microservices are distributed by the Swarm scheduler across app nodes. DB services are defined in `docker-stack-infra_db-prod.yml` and pinned to DB nodes by hostname constraints. + +## Historical / Superseded by Setup + +Older notes that referred to `docker-stack-infra.yml`, `docker-stack-infra.prod.yml`, or `docker-stack-db.prod.yml` as the active prod deployment model are superseded by `../../setup/08-prod-db-cluster-setup.md` and `../../setup/09-prod-runner-ha-and-swarm.md`. diff --git a/roadmap/prod-env/02-godaddy-credentials.md b/roadmap/prod-env/02-godaddy-credentials.md index 97c674b..205ff58 100644 --- a/roadmap/prod-env/02-godaddy-credentials.md +++ b/roadmap/prod-env/02-godaddy-credentials.md @@ -1,7 +1,7 @@ # 02 — GoDaddy DNS Credentials for SWAG (Prod) ## Context -Identical to test-env-setup/02, except the storagebox path is `prod/` instead of `test/`. +Same credential model as `../test-env/02-godaddy-credentials.md`, except the StorageBox path is `prod/` instead of `test/`. ## ⚠️ Security — Rotate credentials before use @@ -30,24 +30,22 @@ GODADDY_SECRET= ## Step 2 — Repo template file -Same file as test: `template/swag/dns-conf/godaddy.ini.tpl` (already created in test step 02). -No additional action needed in the repo. +Same file as test: `template/swag/dns-conf/godaddy.ini.tpl` (already created in test step 02). No additional action needed in the repo. -## Step 3 — (Handled by pipeline) Write credentials file on prod host +## Step 3 — (Handled by pipeline) Write credentials file on prod StorageBox path The deploy pipeline (see `08-deploy-pipeline-update.md`) runs on iklim-app-01: ```bash set -a; . ./.env; set +a -mkdir -p "$SWAG_CONFIG_DIR/dns-conf" -envsubst < template/swag/dns-conf/godaddy.ini.tpl > "$SWAG_CONFIG_DIR/dns-conf/godaddy.ini" -chmod 600 "$SWAG_CONFIG_DIR/dns-conf/godaddy.ini" +mkdir -p "$SWAG_DNS_CONFIG_DIR" +envsubst < template/swag/dns-conf/godaddy.ini.tpl > "$SWAG_DNS_CONFIG_DIR/godaddy.ini" +chmod 600 "$SWAG_DNS_CONFIG_DIR/godaddy.ini" ``` ## Step 4 — GoDaddy A records for prod subdomains (handled by pipeline) -The deploy pipeline's **Update DNS Records** step automatically manages A records via GoDaddy API. -It reads the Floating IP from the Gitea variable `vars.PROD_FLOATING_IP` — set this once in Gitea project settings. +The deploy pipeline's **Update DNS Records** step automatically manages A records via GoDaddy API. It reads the Floating IP from the Gitea variable `vars.PROD_FLOATING_IP` — set this once in Gitea project settings. To get the Floating IP: `terraform output prod_floating_ip` @@ -64,6 +62,5 @@ Logic: for each record, pipeline queries the current value via GoDaddy API. If a > If failover is needed, the Floating IP can be reassigned to another app node; DNS does not change. ## Notes -- Test and prod SWAG instances both obtain `*.iklim.co` independently from Let's Encrypt. - There is no conflict — they use the same domain, different servers. +- Test and prod SWAG instances both obtain `*.iklim.co` independently from Let's Encrypt. There is no conflict — they use the same domain, different servers. - `DNSPROPAGATION=90` handles GoDaddy's typical 30-90s propagation delay. diff --git a/roadmap/prod-env/04-swag-nginx-configs.md b/roadmap/prod-env/04-swag-nginx-configs.md index fc35738..bf6614f 100644 --- a/roadmap/prod-env/04-swag-nginx-configs.md +++ b/roadmap/prod-env/04-swag-nginx-configs.md @@ -1,10 +1,12 @@ # 04 — SWAG Nginx Proxy Configs (Prod) ## Context -Same template files as test (`template/swag/site-confs/*.conf.tpl`), different env vars. -The pipeline processes templates with prod-specific subdomain values. -## Required env vars (in `.env` on storagebox `prod/secrets/iklim.co/.env.prod`) +Production uses the same SWAG template files as test, with production subdomain values and StorageBox-backed output directories. The current setup source is `../../setup/09-prod-runner-ha-and-swarm.md`. + +## Required Environment Variables + +The production env file is `prod/secrets/iklim.co/.env` on StorageBox. ```bash API_SUBDOMAIN=api.iklim.co @@ -13,65 +15,47 @@ RABBITMQ_SUBDOMAIN=rabbitmq.iklim.co GRAFANA_SUBDOMAIN=grafana.iklim.co RESTRICTED_IPS="78.187.87.109/32,95.70.151.248/32" -# SWAG storage paths — StorageBox is mounted on all app nodes, shared filesystem -# cert-reloader writes here; Vault reads from this path on every node — no SSH distribution needed SWAG_CERT_DIR=/mnt/storagebox/ssl -# SWAG config dirs on StorageBox — all three survive node failover without pipeline re-run -SWAG_CONFIG_DIR=/mnt/storagebox/swag/config +SWAG_DNS_CONFIG_DIR=/mnt/storagebox/swag/dns-conf SWAG_SITE_CONFS_DIR=/mnt/storagebox/swag/site-confs +SWAG_PROXY_CONFS_DIR=/mnt/storagebox/swag/proxy-confs ``` -## Template files (already created in test step 04) +## Template Files +The shared templates live under root `template/swag/`: + +- `template/swag/dns-conf/godaddy.ini.tpl` - `template/swag/site-confs/default.conf` - `template/swag/site-confs/api.conf.tpl` - `template/swag/site-confs/apigw.conf.tpl` - `template/swag/site-confs/rabbitmq.conf.tpl` - `template/swag/site-confs/grafana.conf.tpl` -No new files to create — the same templates work for both environments. +## Deploy Behavior -## Deploy step (handled by pipeline — see `08-deploy-pipeline-update.md`) +The production workflow renders: -```bash -set -a; . ./.env; set +a -export RESTRICTED_IPS_BLOCK="$(echo "$RESTRICTED_IPS" | tr ',' '\n' | sed 's|.*| allow &;|')" +- GoDaddy DNS credentials into `$SWAG_DNS_CONFIG_DIR/godaddy.ini`. +- SWAG site configs into `$SWAG_SITE_CONFS_DIR`. +- Optional proxy configs into `$SWAG_PROXY_CONFS_DIR` when templates exist. -mkdir -p "$SWAG_SITE_CONFS_DIR" - -SWAG_VARS='${API_SUBDOMAIN}${APIGW_SUBDOMAIN}${GRAFANA_SUBDOMAIN}${RABBITMQ_SUBDOMAIN}${RESTRICTED_IPS_BLOCK}' -for tpl in template/swag/site-confs/*.conf.tpl; do - out="$SWAG_SITE_CONFS_DIR/$(basename "${tpl%.tpl}")" - envsubst "$SWAG_VARS" < "$tpl" | sudo tee "$out" > /dev/null - echo "✅ $out" -done - -sudo cp template/swag/site-confs/default.conf "$SWAG_SITE_CONFS_DIR/default.conf" -``` - -With `API_SUBDOMAIN=api.iklim.co`, the output file `$SWAG_SITE_CONFS_DIR/api.conf` -(`/mnt/storagebox/swag/site-confs/api.conf`) will contain `server_name api.iklim.co;` — correct for prod. +Because StorageBox is mounted on the service nodes, files rendered by the runner are visible to SWAG regardless of which service node runs the container. ## Verification -After deploy, on iklim-app-01: ```bash cat /mnt/storagebox/swag/site-confs/api.conf | grep server_name -``` -Expected: `server_name api.iklim.co;` - -```bash -docker exec $(docker ps -q -f name=iklimco_swag) nginx -t -``` -Expected: `syntax is ok` - -```bash +docker exec $(docker ps -q -f name=iklimco_swag | head -1) nginx -t curl -si https://api.iklim.co/health ``` -Expected: APISIX response with valid `*.iklim.co` cert. -## Notes -- `Prometheus` is intentionally NOT exposed via SWAG. Access it via Grafana - (internal connection: `http://prometheus:9090`) or SSH tunnel. -- If additional restricted-access subdomains are needed in the future, create a new - `template/swag/site-confs/.conf.tpl` following the same pattern. +Expected: + +- `server_name api.iklim.co;` +- Nginx config syntax is valid. +- Public API returns an APISIX response with a valid `*.iklim.co` certificate. + +## Historical / Superseded by Setup + +The previous `SWAG_CONFIG_DIR=/mnt/storagebox/swag/config` and `.env.prod` references are superseded. Use the split `SWAG_DNS_CONFIG_DIR`, `SWAG_SITE_CONFS_DIR`, and `SWAG_PROXY_CONFS_DIR` variables from the current setup. diff --git a/roadmap/prod-env/05-apisix-remove-ssl.md b/roadmap/prod-env/05-apisix-remove-ssl.md index f2eedb6..4e91a63 100644 --- a/roadmap/prod-env/05-apisix-remove-ssl.md +++ b/roadmap/prod-env/05-apisix-remove-ssl.md @@ -1,48 +1,45 @@ # 05 — APISIX: Remove SSL / Configure Trusted Proxy (Prod) ## Context -Identical to `test-env-setup/05-apisix-remove-ssl.md`. -The same `init/apisix-core/init.sh` and custom APISIX image are used for both environments. -Changes made for test already apply to prod. +The same `init/apisix-core/init.sh` and custom APISIX image are used for test and prod. TLS terminates at SWAG; APISIX receives plain HTTP over the `iklimco-net` overlay network. ## Checklist -- [ ] `ssls/1` PUT block removed from `init/apisix-core/init.sh` -- [ ] `dev` SSL block removed or confirmed non-impactful for prod -- [ ] Custom APISIX image (`custom-apisix:3.12.0`) `template/apisix-core/config.yaml.template` contains - `real_ip_header`, `real_ip_recursive`, and `set_real_ip_from` (`10.0.0.0/8`, `172.16.0.0/12`, `192.168.0.0/16`) -- [ ] New image built and pushed to Harbor if config.yaml.template was changed: - ```bash - bash ops/push-harbor-custom-images.sh - ``` +- `ssls/1` PUT block is removed from `init/apisix-core/init.sh`. +- The dev-only SSL block is removed or confirmed to be non-impactful for prod. +- The custom APISIX image includes trusted proxy settings in `template/apisix-core/config.yaml.template`: `real_ip_header`, `real_ip_recursive`, and `set_real_ip_from` for private ranges. +- The custom image is pushed to Harbor when the APISIX config template changes. -## Prod-specific note +## Current Prod Model -APISIX runs with `replicas: 3` in prod — this value is defined in the `docker-stack-infra.prod.yml` overlay (not in the base `docker-stack-infra.yml`). All replicas read the same configuration from Patroni etcd (`/apisix` prefix) — a single `init` run is sufficient. +APISIX runs with 3 replicas in `docker-stack-infra_db-prod.yml`. All replicas read configuration from the shared DB-node etcd cluster with the `/apisix` prefix, so the pipeline runs `init/apisix-core/init.sh` once. + +Production deployment uses: ```bash -# Prod deploy: -docker stack deploy -c docker-stack-infra.yml -c docker-stack-infra.prod.yml iklimco +docker stack deploy --with-registry-auth -c docker-stack-infra_db-prod.yml iklimco ``` -`init/apisix-core/init.sh` is run once by the pipeline and writes the etcd state that all APISIX instances read. +## SWAG to APISIX Load Distribution -## SWAG → APISIX load distribution +SWAG connects to APISIX through the service name: -SWAG connects to APISIX via `proxy_pass http://apisix:9080;` — using the service name directly. -No additional upstream or load balancer configuration is needed on the SWAG side. +```nginx +proxy_pass http://apisix:9080; +``` -**How it works:** Docker Swarm resolves the `apisix` service name to a VIP (Virtual IP). -Swarm's internal IPVS load balancer automatically distributes incoming connections across the 3 replicas -in round-robin. SWAG is unaware of this mechanism; it happens transparently at the overlay network layer. +Docker Swarm resolves `apisix` to the service VIP and distributes requests across APISIX replicas. SWAG does not need a separate upstream list for APISIX. ## Verification ```bash -# From a whitelisted IP, make a request and check real IP in APISIX logs docker exec $(docker ps -q -f name=iklimco_apisix | head -1) \ tail -5 /usr/local/apisix/logs/access.log ``` Client IP should appear in the log, not SWAG's internal overlay IP. + +## Historical / Superseded by Setup + +The old prod overlay command `docker stack deploy -c docker-stack-infra.yml -c docker-stack-infra.prod.yml iklimco` is superseded by `docker-stack-infra_db-prod.yml`. diff --git a/roadmap/prod-env/06-cert-reloader.md b/roadmap/prod-env/06-cert-reloader.md index 2ad2db5..d1c008a 100644 --- a/roadmap/prod-env/06-cert-reloader.md +++ b/roadmap/prod-env/06-cert-reloader.md @@ -1,61 +1,54 @@ -# 06 — cert-reloader Sidecar Service (Prod) +# 06 — Certificate Renewal and Vault Reload Flow (Prod) ## Context -Service definition is identical to test (see `test-env-setup/06-cert-reloader.md`). -In prod, Vault runs as a 3-node Raft cluster; cert distribution is handled via the StorageBox shared mount — no SSH required. -## Prod flow (3-node Vault Raft) +The production certificate flow is implemented in the current infra stack and setup runbooks. See `../../setup/09-prod-runner-ha-and-swarm.md`. -``` -SWAG renews cert → writes to SWAG_CONFIG_DIR (/mnt/storagebox/swag/config) -cert-reloader detects MD5 change - → copies to /mnt/storagebox/ssl/ (shared across all app nodes) - → docker service update --force iklimco_vault -Vault (3 replicas) restarts - → each instance has /mnt/storagebox/ssl/ mounted → reads the new cert - → healthcheck checks sealed status every 30 seconds - → if sealed: reads vault_unseal_key Docker secret and auto-unseals +## Current Flow + +```text +SWAG renews the certificate inside its persistent config volume +cert-reloader detects the MD5 change + -> copies STAR.iklim.co.full.crt and STAR.iklim.co_key.pem to /mnt/storagebox/ssl +cert-distributor syncs those files to /opt/iklimco/ssl on service nodes + -> forces iklimco_vault to restart +Vault reads /opt/iklimco/ssl through /vault/certs +Vault entrypoint retry-unseal loop reads vault_unseal_key and unseals each replica ``` -No SSH distribution, additional secrets, or cert-reloader script changes are needed. +No SSH certificate distribution is required in prod. -## Auto-unseal mechanism +## Vault Unseal Model -The Vault healthcheck is already implemented in `docker-stack-infra.yml`: +Vault auto-unseal is not implemented as the old Docker healthcheck snippet in the prod roadmap anymore. The current `docker-stack-vault.yml` and Vault entrypoint logic handle retry-unseal with the `vault_unseal_key` Docker secret. -```yaml -healthcheck: - test: - - "CMD" - - "sh" - - "-c" - - >- - vault status -format=json 2>/dev/null | grep -q '"sealed":false' || - vault operator unseal $$(cat /run/secrets/vault_unseal_key 2>/dev/null) - interval: 30s - timeout: 10s - start_period: 15s - retries: 5 -``` - -Each Vault container runs its own healthcheck independently — all 3 replicas unseal separately. -The cert renewal → restart → auto-unseal chain requires no manual intervention. +The `vault_unseal_key` secret is created/rotated by `init/vault/vault-bootstrap.sh` during bootstrap. ## Verification ```bash docker service ps iklimco_cert-reloader +docker service ps iklimco_cert-distributor docker service logs iklimco_cert-reloader --tail 20 +docker service ps iklimco_vault ``` -Expected: `[cert-reloader] started`, no error lines. +Expected: + +- `cert-reloader` is running. +- `cert-distributor` is running. +- Vault service restarts cleanly after certificate renewal. +- Vault remains unsealed. + +Confirm Vault sees the current certificate: -Confirm Vault cert is current after SWAG renewal: ```bash -# Check cert expiry on Vault's TLS endpoint from inside the overlay -docker exec $(docker ps -q -f name=iklimco_vault) \ - sh -c 'echo | openssl s_client -connect vault.iklim.co:8200 2>/dev/null \ - | openssl x509 -noout -dates' +docker exec $(docker ps -q -f name=iklimco_vault | head -1) \ + sh -c 'echo | openssl s_client -connect vault.iklim.co:8200 2>/dev/null | openssl x509 -noout -dates' ``` -`notAfter` should match the cert in `/mnt/storagebox/ssl/STAR.iklim.co.full.crt`. +`notAfter` should match the certificate distributed through `/opt/iklimco/ssl`. + +## Historical / Superseded by Setup + +The earlier plan that said “service definition is identical to test” and relied on a Vault healthcheck command is superseded. Prod now has a separate Vault stack, cert-distributor, and retry-unseal behavior. diff --git a/roadmap/prod-env/08-deploy-pipeline-update.md b/roadmap/prod-env/08-deploy-pipeline-update.md index 366ebf8..91cc40a 100644 --- a/roadmap/prod-env/08-deploy-pipeline-update.md +++ b/roadmap/prod-env/08-deploy-pipeline-update.md @@ -1,315 +1,96 @@ -# 08 — Deploy Pipeline Update (Prod) +# 08 — Production Deploy Pipeline Model ## Context -- **File:** `.gitea/workflows/deploy-prod.yml` -- Same changes as test pipeline (`test-env-setup/07-deploy-pipeline-update.md`), - adapted for prod paths and prod runner. -- **Prod-specific differences from test:** - - `SPRING_PROFILES_ACTIVE=prod` (not `test`) in Run APISIX Init - - DB hostnames: `postgresql`, `mongodb` (Swarm overlay DNS — same as test) - - Storagebox paths via env vars (`SWAG_CERT_DIR`, `SWAG_CONFIG_DIR`, vb.) instead of local host paths - - Extra steps: Update DNS Records (GoDaddy API), Wait for etcd -## Step 1 — Remove manual cert scp lines from `Initialize Workspace` +The production deploy pipeline is no longer a pending set of step additions. The current source of truth is the root `.gitea/workflows/deploy-prod.yml`, with the operational explanation in `../../setup/09-prod-runner-ha-and-swarm.md` and root `prod_env-ci_dc-pipeline.md`. -```yaml -# DELETE from "Initialize Servers" step: - scp -P 23 ${{ vars.STORAGEBOX_USER }}@${{ vars.STORAGEBOX_USER }}.your-storagebox.de:prod/app/iklim.co/ssl/STAR.iklim.co.full.crt ./STAR.iklim.co.full.crt - scp -P 23 ${{ vars.STORAGEBOX_USER }}@${{ vars.STORAGEBOX_USER }}.your-storagebox.de:prod/app/iklim.co/ssl/STAR.iklim.co_key.pem ./STAR.iklim.co_key.pem -``` +## Current Pipeline Order -Also remove from `Prepare Init Files`: -```yaml -# DELETE or make conditional: - sudo cp STAR.iklim.co.full.crt STAR.iklim.co_key.pem /opt/iklimco/ssl/ -``` +The current root production workflow runs in this order: -## Step 2 — Add `Update DNS Records` step +| # | Step | Note | +| --- | --- | --- | +| 1 | Checkout Branch | | +| 2 | Prepare Folders | | +| 3 | Set up SSH Key and Add to known_hosts | | +| 4 | Update Apt Repository and Install Required Tools | `gettext tree jq`; `jq` is required for the GoDaddy DNS API | +| 5 | Fetch Prod Env From Storagebox | Fetch `.env` and `.env.secrets.shared` | +| 6 | Fetch Service Secret Files | Fetch `.env.secrets.` and `.env.secrets.swag` | +| 7 | Prepare Database Init Files | Render PostgreSQL/MongoDB init templates | +| 8 | Docker Login to Harbor | | +| 9 | Prepare SWAG Directories | Render `dns-conf` and `site-confs`; reload node-local SWAG if present | +| 10 | Bootstrap Vault TLS Placeholder | Creates a temporary cert only if missing | +| 11 | Create Infrastructure Docker Secrets | Creates `rabbitmq_erlang_cookie` if missing | +| 12 | Deploy Swarm Stacks | Deploys `docker-stack-infra_db-prod.yml` | +| 13 | Connect Runner to Overlay Network | Connects the job container to `iklimco-net` | +| 14 | Initialize Production Infrastructure | Runs `init-infra-prod.sh`; this triggers Vault bootstrap and RabbitMQ setup | +| 15 | Wait for Infrastructure Services | Waits for `iklimco_vault` and `iklimco_rabbitmq` | +| 16 | Provision Vault AppRole IDs and Docker Secrets | Downloads service `vault-files`, runs `init/provision-all-services.sh` | +| 17 | Upload Updated Secrets to Storagebox | Uploads `.env.secrets.*` and `.env` | +| 18 | Wait for etcd | Waits for etcd health | +| 19 | Run APISIX Init | `SPRING_PROFILES_ACTIVE=prod` | +| 20 | Bootstrap SWAG Certificate | Waits for SWAG and cert-reloader output in `SWAG_CERT_DIR` | +| 21 | Initialize MongoDB Replica Set | Runs `rs.initiate()` or missing-member `rs.add()` | +| 22 | Run Database Init Scripts | Patroni primary + MongoDB replica set; SQL and JS init | +| 23 | Update DNS Records | GoDaddy API; `api`, `apigw`, `rabbitmq`, and `grafana` A records | +| 24 | Review Environment | | -Insert **after** `Docker Login to Harbor` and **before** `Prepare SWAG Directories`. +All production deploy workflows must share `concurrency.group: prod-deploy` so infra and microservice deploys cannot overlap. -```yaml - - name: Update DNS Records - run: | - set -a; . ./.env; . ./.env.secrets.swag; set +a - FLOATING_IP="${{ vars.PROD_FLOATING_IP }}" - DOMAIN="iklim.co" +## Current Environment Files - for record in api apigw rabbitmq grafana; do - CURRENT=$(curl -s \ - -H "Authorization: sso-key ${GODADDY_KEY}:${GODADDY_SECRET}" \ - "https://api.godaddy.com/v1/domains/${DOMAIN}/records/A/${record}" \ - 2>/dev/null | jq -r '.[0].data // empty' 2>/dev/null || true) +The production StorageBox env file is `prod/secrets/iklim.co/.env`. The old `.env.prod` name is superseded. - if [ "$CURRENT" = "$FLOATING_IP" ]; then - echo "✅ ${record}.${DOMAIN} → ${FLOATING_IP} (exists, skipping)" - else - curl -sf -X PUT \ - -H "Authorization: sso-key ${GODADDY_KEY}:${GODADDY_SECRET}" \ - -H "Content-Type: application/json" \ - "https://api.godaddy.com/v1/domains/${DOMAIN}/records/A/${record}" \ - -d "[{\"data\":\"${FLOATING_IP}\",\"ttl\":600}]" - echo "✅ ${record}.${DOMAIN} → ${FLOATING_IP} (added/updated)" - fi - done - working-directory: /workspace/iklim.co -``` - -> `GODADDY_KEY` and `GODADDY_SECRET` are read from `.env.secrets.swag`. -> `PROD_FLOATING_IP` must be defined as a Gitea project variable (`terraform output prod_floating_ip`). -> `jq` is required — it must have been added to the `Update Apt Repository` step: `apt-get install -y gettext tree jq`. -> Runs on every deploy; existing and correct records are skipped (idempotent). - -## Step 3 — Add `Prepare SWAG Directories` step - -Insert **before** `Bootstrap Vault TLS Placeholder`: - -```yaml - - name: Prepare SWAG Directories - run: | - set -a; . ./.env; . ./.env.secrets.swag; set +a - - mkdir -p "$SWAG_CONFIG_DIR/dns-conf" "$SWAG_SITE_CONFS_DIR" - - envsubst < template/swag/dns-conf/godaddy.ini.tpl | docker run --rm -i \ - -v "${SWAG_CONFIG_DIR}/dns-conf:/output" \ - alpine sh -c "cat > /output/godaddy.ini && chmod 600 /output/godaddy.ini" - echo "✅ godaddy.ini written" - - export RESTRICTED_IPS_BLOCK="$(echo "$RESTRICTED_IPS" | tr ',' '\n' | sed 's|.*| allow &;|')" - - SWAG_VARS='${API_SUBDOMAIN}${APIGW_SUBDOMAIN}${GRAFANA_SUBDOMAIN}${RABBITMQ_SUBDOMAIN}${RESTRICTED_IPS_BLOCK}' - for tpl in template/swag/site-confs/*.conf.tpl; do - fname=$(basename "${tpl%.tpl}") - envsubst "$SWAG_VARS" < "$tpl" | docker run --rm -i \ - -v "${SWAG_SITE_CONFS_DIR}:/output" \ - alpine sh -c "cat > /output/${fname}" - echo "✅ ${fname}" - done - - cat template/swag/site-confs/default.conf | docker run --rm -i \ - -v "${SWAG_SITE_CONFS_DIR}:/output" \ - alpine sh -c "cat > /output/default.conf" - - echo "✅ SWAG directories ready" - - SWAG_CTR=$(docker ps -q -f name=iklimco_swag 2>/dev/null | head -1) - if [ -n "$SWAG_CTR" ]; then - docker exec "$SWAG_CTR" nginx -t && docker exec "$SWAG_CTR" nginx -s reload - echo "✅ SWAG nginx reloaded" - fi - working-directory: /workspace/iklim.co -``` - -> `.env` is sourced first so `API_SUBDOMAIN=api.iklim.co` (prod values) are used. -> Ensure these vars are in `prod/secrets/iklim.co/.env.prod` on storagebox. - -## Step 4 — Add `Wait for etcd` step - -Insert **after** `Deploy Swarm Stack` and **before** `Run APISIX Init`. -APISIX reads its entire configuration from etcd; init script will fail silently if etcd is not ready. - -```yaml - - name: Wait for etcd - run: | - echo "⏳ Waiting for Patroni etcd..." - for i in $(seq 1 30); do - if docker run --rm --network iklimco-net alpine \ - sh -c "wget -qO- http://etcd:2379/health 2>/dev/null | grep -q '\"health\":\"true\"'"; then - echo "✅ Patroni etcd ready" - break - fi - [ "$i" -eq 30 ] && echo "❌ Patroni etcd did not become ready in time" && exit 1 - echo " attempt $i/30 — waiting 5s..." - sleep 5 - done -``` - -> **Note:** In prod, APISIX uses the 3-node Patroni etcd cluster on DB nodes (`etcd/02/03:2379`) via the `/apisix` prefix — resolved through `iklimco-net` overlay DNS aliases defined in `docker-stack-db.prod.yml`. The standalone `etcd` service from the base stack is disabled (`replicas: 0` in the prod overlay) and removed from the service list by a post-deploy step. This step waits for Patroni etcd (`etcd:2379`) to be healthy before running the APISIX init script. - -## Step 5 — Add `Run APISIX Init` step - -Insert **after** `Wait for etcd` and **before** `Bootstrap SWAG Certificate`. - -```yaml - - name: Run APISIX Init - run: | - set -a; . ./.env; . ./.env.secrets.shared; set +a - echo "⏳ Waiting for Swarm APISIX..." - until curl -sf -o /dev/null \ - -H "X-API-KEY: ${APISIX_ADMIN_KEY}" \ - "http://apisix:9180/apisix/admin/upstreams" 2>/dev/null; do - sleep 5 - done - export SPRING_PROFILES_ACTIVE=prod - /bin/bash init/apisix-core/init.sh - echo "✅ APISIX routes configured" - working-directory: /workspace/iklim.co -``` - -> **Prod-specific:** `SPRING_PROFILES_ACTIVE=prod` — test pipeline uses `test`. -> `APISIX_ADMIN_KEY` is sourced from `.env.secrets.shared`. -> The init script is idempotent (PUT semantics); safe to re-run on subsequent deploys. -> With `replicas: 3` in prod, all APISIX instances read the same etcd state — no per-replica init needed. - -## Step 6 — Add `Bootstrap SWAG Certificate` step - -Insert **after** `Run APISIX Init`: - -```yaml - - name: Bootstrap SWAG Certificate - run: | - set -a; . ./.env; set +a - echo "Waiting for SWAG container to start..." - SWAG_CTR="" - for i in $(seq 1 24); do - SWAG_CTR=$(docker ps -q -f name=iklimco_swag 2>/dev/null | head -1) - [ -n "$SWAG_CTR" ] && break - sleep 10 - done - - if [ -z "$SWAG_CTR" ]; then - echo "❌ SWAG container did not start" - exit 1 - fi - - CERT_PATH="/config/etc/letsencrypt/live/iklim.co/fullchain.pem" - echo "Waiting for cert (up to 10 min)..." - for i in $(seq 1 20); do - if docker exec "$SWAG_CTR" test -f "$CERT_PATH" 2>/dev/null; then - echo "✅ Cert obtained" - break - fi - echo " attempt $i/20 — waiting 30s..." - sleep 30 - done - - if ! docker exec "$SWAG_CTR" test -f "$CERT_PATH" 2>/dev/null; then - echo "❌ SWAG did not obtain cert. Logs:" - docker service logs iklimco_swag --tail 50 - exit 1 - fi - - docker exec "$SWAG_CTR" cat "$CERT_PATH" | \ - docker run --rm -i -v "${SWAG_CERT_DIR}:/output" alpine \ - sh -c "cat > /output/STAR.iklim.co.full.crt && chmod 644 /output/STAR.iklim.co.full.crt" - docker exec "$SWAG_CTR" cat "/config/etc/letsencrypt/live/iklim.co/privkey.pem" | \ - docker run --rm -i -v "${SWAG_CERT_DIR}:/output" alpine \ - sh -c "cat > /output/STAR.iklim.co_key.pem && chmod 644 /output/STAR.iklim.co_key.pem" - echo "✅ Cert bootstrapped to ${SWAG_CERT_DIR}/" - working-directory: /workspace/iklim.co -``` - -## Step 7 — Add `Run Database Init Scripts` step - -Insert **after** `Bootstrap SWAG Certificate` and **before** `Review Environment`. - -```yaml - - name: Run Database Init Scripts - run: | - set -a; . ./.env; . ./.env.secrets.shared; set +a - - echo "⏳ Waiting for PostgreSQL..." - until docker run --rm --network iklimco-net \ - -e PGPASSWORD="${DATABASE_POSTGRES_ROOT_PASSWD}" \ - postgis/postgis:18-3.6 \ - pg_isready -h postgresql -U "${DATABASE_POSTGRES_ROOT_USER}" -q 2>/dev/null; do - sleep 5 - done - for sql_file in $(ls ./init/postgresql/*.sql 2>/dev/null | sort); do - echo "▶ $(basename "$sql_file")" - docker run --rm -i --network iklimco-net \ - -e PGPASSWORD="${DATABASE_POSTGRES_ROOT_PASSWD}" \ - postgis/postgis:18-3.6 \ - psql -h postgresql -U "${DATABASE_POSTGRES_ROOT_USER}" < "$sql_file" - done - - echo "⏳ Waiting for MongoDB..." - until docker run --rm --network iklimco-net mongo:8.3.2 \ - mongosh "mongodb://${DATABASE_MONGODB_ROOT_USER}:${DATABASE_MONGODB_ROOT_PASSWD}@mongodb/admin" \ - --eval "db.runCommand({ping:1})" --quiet 2>/dev/null; do - sleep 5 - done - for js_file in $(ls ./init/mongodb/*.js 2>/dev/null | sort); do - echo "▶ $(basename "$js_file")" - docker run --rm -i --network iklimco-net mongo:8.3.2 \ - mongosh "mongodb://${DATABASE_MONGODB_ROOT_USER}:${DATABASE_MONGODB_ROOT_PASSWD}@mongodb/admin" \ - --quiet < "$js_file" - done - echo "✅ Database init scripts completed" - working-directory: /workspace/iklim.co -``` - -> **Prod-specific:** DB hostnames are `postgresql` and `mongodb` (Swarm VIP service names). -> Test pipeline uses `postgresql` / `mongodb` (unqualified aliases within the same stack). -> SQL and JS files are generated by `Prepare Init Files` step via `init_postgresql` / `init_mongodb` functions in `common-functions-prod.sh`. -> Step is idempotent — scripts use `CREATE IF NOT EXISTS` / `createCollection` semantics. - -## Step 8 — Microservice prod deploy overlay - -Each microservice has its own `docker-stack-service.prod.yml` overlay file. This file contains prod-specific `replicas: 3` and `max_replicas_per_node: 1` settings. - -In microservice deploy pipelines (`deploy-prod.yml`), the `docker stack deploy` command should be: +Current SWAG-related variables include: ```bash -docker stack deploy \ - -c BE-/docker-stack-service.yml \ - -c BE-/docker-stack-service.prod.yml \ - iklimco +SWAG_CERT_DIR=/mnt/storagebox/ssl +SWAG_DNS_CONFIG_DIR=/mnt/storagebox/swag/dns-conf +SWAG_SITE_CONFS_DIR=/mnt/storagebox/swag/site-confs +SWAG_PROXY_CONFS_DIR=/mnt/storagebox/swag/proxy-confs ``` -For example, for `BE-Authentication`: +## Current Stack Deployment + +The pipeline deploys the current production infra/DB stack: ```bash -docker stack deploy \ - -c BE-Authentication/docker-stack-service.yml \ - -c BE-Authentication/docker-stack-service.prod.yml \ - iklimco +docker stack deploy --with-registry-auth -c docker-stack-infra_db-prod.yml iklimco ``` -> When a new microservice is added, `BE-/docker-stack-service.prod.yml` must be created and the pipeline must include this overlay. - -## Step 9 — Ensure subdomain env vars are in prod `.env` - -Add to `prod/secrets/iklim.co/.env.prod` on storagebox: +Vault is not part of that stack. Vault is deployed and bootstrapped by `init/vault/vault-bootstrap.sh` through `init-infra-prod.sh` using: ```bash -API_SUBDOMAIN=api.iklim.co -APIGW_SUBDOMAIN=apigw.iklim.co -RABBITMQ_SUBDOMAIN=rabbitmq.iklim.co -GRAFANA_SUBDOMAIN=grafana.iklim.co +docker stack deploy --with-registry-auth -c docker-stack-vault.yml iklimco ``` -## Step 10 — Final step order for prod pipeline +## Database Initialization -To prevent concurrent deploys, a Gitea Actions `concurrency` block is added per pipeline: +MongoDB replica set initialization is a dedicated workflow step. It runs `rs.initiate()` when the replica set is uninitialized and `rs.add()` when members from `DATABASE_MONGODB_HOST` are missing. -```yaml -concurrency: - group: prod-deploy - cancel-in-progress: false -``` +Database init scripts run after Patroni primary and MongoDB replica set readiness. PostgreSQL uses the multi-host Patroni connection with `target_session_attrs=read-write`; MongoDB uses the replica set host list from `DATABASE_MONGODB_HOST`. -With `cancel-in-progress: false`, a new run waits in the queue until the previous one finishes; Gitea UI shows it as "queued" and does not return an error. +## Microservice Deploy Model -1. Checkout Branch -2. Prepare Folders -3. Set up SSH Key and Add to known_hosts -4. Update Apt Repository and Install Required Tools (`gettext tree jq`) -5. Fetch Service Secret Files -6. Initialize Workspace ← cert scp lines removed -7. Upload Updated Secrets to Storagebox -8. Provision Vault AppRole IDs and Docker Secrets -9. Upload Updated Env to Storagebox -10. Prepare Init Files ← cert copy lines removed -11. Initialize Docker Swarm -12. Docker Login to Harbor -13. **Update DNS Records** ← NEW (GoDaddy API, idempotent) -14. **Prepare SWAG Directories** ← NEW (`$SWAG_CONFIG_DIR/dns-conf`; renders nginx conf templates) -15. Bootstrap Vault TLS Placeholder -16. Deploy Swarm Stack -17. **Wait for etcd** ← NEW (Patroni etcd `etcd:2379` overlay DNS) -18. **Run APISIX Init** ← NEW (`SPRING_PROFILES_ACTIVE=prod`) -19. **Bootstrap SWAG Certificate** ← NEW -20. **Run Database Init Scripts** ← NEW (`postgresql`, `mongodb`) -21. Review Environment +Prod microservice workflows do not use a separate `docker-stack-service.prod.yml` overlay anymore. + +The current model is: + +- read `deploy/prod.env`; +- promote the tested Harbor digest to the stable prod tag; +- call `swarm_service_update` with `deploy/docker-stack-service.yml`; +- use `docker service update` with `--update-order start-first` and rollback behavior for existing services. + +## Historical / Superseded by Setup + +The following earlier plan items are superseded: + +- Removing cert `scp` lines from an `Initialize Workspace` step as a live action; those lines are already gone. +- Creating prod deploy steps around `docker-stack-infra.yml` + `docker-stack-infra.prod.yml`. +- Waiting for a legacy `etcd:2379` service from a base stack. +- Using `docker-stack-db.prod.yml` as the DB stack reference. +- Writing SWAG DNS files through `SWAG_CONFIG_DIR/dns-conf`. +- Storing prod env in `prod/secrets/iklim.co/.env.prod`. +- Deploying microservices with `docker-stack-service.yml` plus `docker-stack-service.prod.yml`. + +Keep this file as a roadmap summary. For exact commands, use the root workflow and `../../setup/09-prod-runner-ha-and-swarm.md`. diff --git a/roadmap/prod-env/09-verify.md b/roadmap/prod-env/09-verify.md index 8e116c4..a144dc5 100644 --- a/roadmap/prod-env/09-verify.md +++ b/roadmap/prod-env/09-verify.md @@ -1,147 +1,158 @@ # 09 — Verification Checklist (Prod) ## Context -Run after a successful prod pipeline deployment. -## 1 — Swarm cluster health +Run these checks after a successful production pipeline deployment. The current setup source is `../../setup/09-prod-runner-ha-and-swarm.md`. + +## 1 — Swarm Cluster Health ```bash docker node ls ``` -Expected: 3 managers (`Leader` + 2 `Reachable`) for `iklim-app-01/02/03`, 3 workers (`Ready`) for `iklim-db-01/02/03`. + +Expected: 3 managers (`Leader` + 2 `Reachable`) for `iklim-app-01/02/03`, and 3 workers (`Ready`) for `iklim-db-01/02/03`. + +```bash +docker node inspect iklim-app-01 --format '{{.Spec.Labels}}' +docker node inspect iklim-db-01 --format '{{.Spec.Labels}}' +``` + +Expected: app nodes have `type=service`; DB nodes have `role=db` and `db-index=01/02/03`. + +## 2 — Infra, DB, and Vault Services ```bash docker service ls --filter label=project=co.iklim +docker service ps iklimco_vault +docker service ps iklimco_rabbitmq +docker service ps iklimco_apisix ``` -All services show `REPLICAS X/X` (target met). -## 2 — Precipitation image directory exists +Expected: all current services show their desired replica counts. + +Vault is deployed by `docker-stack-vault.yml`; the main infra and DB services are deployed by `docker-stack-infra_db-prod.yml`. + +## 3 — DB Node Placement ```bash -ls -ld /mnt/storagebox/precipitation/images +docker service ps iklimco_patroni-01 +docker service ps iklimco_patroni-02 +docker service ps iklimco_patroni-03 +docker service ps iklimco_mongodb-01 +docker service ps iklimco_mongodb-02 +docker service ps iklimco_mongodb-03 +docker service ps iklimco_etcd-01 +docker service ps iklimco_etcd-02 +docker service ps iklimco_etcd-03 ``` -Expected: directory exists. This must be created before `iklimco_precipitation-service` is deployed. +Expected: tasks run on their matching `iklim-db-0X` hostnames according to the stack placement constraints. + +## 4 — Service-Node Infrastructure Placement ```bash -docker volume inspect iklimco_image-data +docker service ps iklimco_redis +docker service ps iklimco_redis-sentinel +docker service ps iklimco_rabbitmq +docker service ps iklimco_swag +docker service ps iklimco_cert-reloader +docker service ps iklimco_cert-distributor ``` -Expected: `Options.device` is `/mnt/storagebox/precipitation/images`. +Expected: Redis, Sentinel, RabbitMQ, SWAG, and cert services run on app/service nodes, not DB nodes. -## 3 — SWAG cert is valid +## 5 — SWAG Certificate Is Valid ```bash -docker exec $(docker ps -q -f name=iklimco_swag) certbot certificates +docker exec $(docker ps -q -f name=iklimco_swag | head -1) certbot certificates ``` -Expected: `*.iklim.co`, `VALID: XX days` (Let's Encrypt, not the old manual cert). + +Expected: certificate for `*.iklim.co`, valid and issued by Let's Encrypt. TLS check from outside: + ```bash echo | openssl s_client -connect api.iklim.co:443 -servername api.iklim.co 2>/dev/null \ | openssl x509 -noout -subject -dates ``` -Expected: `CN=*.iklim.co`, `notAfter` > 2026-07-15 (cert is Let's Encrypt, not expiring old one). -## 4 — Public API +Expected: `CN=*.iklim.co` and a current `notAfter` date. + +## 6 — Public API and Restricted Subdomains ```bash curl -si https://api.iklim.co/health ``` -HTTP 2xx, no TLS errors. -## 5 — IP restriction working +Expected: HTTP 2xx or an APISIX response, with no TLS error. From a non-whitelisted IP: + ```bash curl -si https://grafana.iklim.co curl -si https://apigw.iklim.co curl -si https://rabbitmq.iklim.co ``` -All expected: HTTP 403. -From whitelisted IP (78.187.87.109 or 95.70.151.248): +Expected: HTTP 403. + +From a whitelisted IP: + ```bash -curl -si https://grafana.iklim.co # HTTP 200 Grafana -curl -si https://apigw.iklim.co # HTTP 200 APISIX Dashboard -curl -si https://rabbitmq.iklim.co # HTTP 200 RabbitMQ Management +curl -si https://grafana.iklim.co +curl -si https://apigw.iklim.co +curl -si https://rabbitmq.iklim.co ``` -## 6 — Vault not reachable externally +Expected: HTTP 200 or the expected login/management page. + +## 7 — Vault Is Not Publicly Reachable + +From outside: ```bash -# From outside — must fail curl -sk --connect-timeout 5 https://:8200/v1/sys/health -# Expected: connection refused or timeout ``` +Expected: connection refused or timeout. + +From inside overlay: + ```bash -# From inside overlay — must succeed docker exec $(docker ps -q -f name=iklimco_apisix | head -1) \ curl -sk https://vault.iklim.co:8200/v1/sys/health -# Expected: {"sealed":false,...} ``` -## 7 — cert-reloader watching +Expected: JSON response with `"sealed":false`. + +## 8 — Certificate Reload Chain ```bash -docker service logs iklimco_cert-reloader --tail 5 +docker service logs iklimco_cert-reloader --tail 10 +docker service ps iklimco_cert-distributor +docker exec $(docker ps -q -f name=iklimco_vault | head -1) ls /vault/certs/ ``` -Expected: `[cert-reloader] started`, no errors. -## 8 — No unexpected published ports +Expected: cert-reloader has no errors, cert-distributor is running, and Vault sees `STAR.iklim.co.full.crt` plus `STAR.iklim.co_key.pem`. + +## 9 — No Unexpected Published Ports ```bash -docker service ls --format "{{.Name}}\t{{.Ports}}" \ - --filter label=project=co.iklim -``` -Only `iklimco_swag` should show `*:80->80/tcp, *:443->443/tcp`. - -## 9 — DB nodes running correct services - -```bash -# Patroni (PostgreSQL HA) stack -docker stack services iklim-patroni -docker service ps iklim-patroni_patroni-01 -docker service ps iklim-patroni_patroni-02 -docker service ps iklim-patroni_patroni-03 - -# etcd cluster (for Patroni) -docker stack services iklim-etcd - -# MongoDB replica set -docker stack services iklimco -docker service ps iklimco_mongodb-01 -docker service ps iklimco_mongodb-02 -docker service ps iklimco_mongodb-03 +docker service ls --format "{{.Name}}\t{{.Ports}}" --filter label=project=co.iklim ``` -All tasks should show node names matching `iklim-db-01`, `iklim-db-02`, or `iklim-db-03` with placement constraint `role=db`. +Expected: only services intentionally published by the stack expose ports. Redis and RabbitMQ must not appear as DB-node host-mode services. -## 10 — APISIX replicas +## 10 — Microservice Health -```bash -docker service ps iklimco_apisix -``` -Expected: 3 tasks, all `Running`, on different nodes. +After microservices are deployed by their separate production workflows: -## 11 — fail2ban active - -```bash -docker exec $(docker ps -q -f name=iklimco_swag) fail2ban-client status -``` -Expected: multiple jails listed. - -## 12 — Microservice health (post-deploy) - -After microservices are deployed (separate pipeline), verify via the public API: ```bash curl -si https://api.iklim.co/v1/weather/current?lat=39&lon=35 ``` -Expected: valid JSON weather response. -## ⚠️ Old cert expiry reminder -The manually managed `*.iklim.co` cert expires **2026-07-15**. -SWAG's Let's Encrypt cert auto-renews every ~60 days. -After first SWAG cert is confirmed valid, the manual cert in storagebox can be archived -and is no longer used. +Expected: valid JSON response. + +## Historical / Superseded by Setup + +Older verification snippets that used `iklim-patroni`, `iklim-etcd`, or separate DB stack names are superseded. Current prod DB services are part of the `iklimco` stack deployed from `docker-stack-infra_db-prod.yml`. diff --git a/roadmap/test-env/03-infra-stack-changes.md b/roadmap/test-env/03-infra-stack-changes.md index f71a47e..63cad85 100644 --- a/roadmap/test-env/03-infra-stack-changes.md +++ b/roadmap/test-env/03-infra-stack-changes.md @@ -2,9 +2,7 @@ ## Context - **File:** `docker-stack-infra.yml` (repo root) -- **Goal:** Add SWAG as TLS-terminating reverse proxy; remove all published ports from internal - services (they become reachable only via SWAG through the `iklimco-net` overlay network); - remove Vault's external port entirely. +- **Goal:** Add SWAG as TLS-terminating reverse proxy; remove all published ports from internal services (they become reachable only via SWAG through the `iklimco-net` overlay network); remove Vault's external port entirely. ## Changes Summary @@ -46,7 +44,7 @@ Add after the `apisix-dashboard` service block: - DNSPROPAGATION=90 volumes: - ${SWAG_CONFIG_DIR:-swag-vl}:/config - - ${SWAG_DNS_CONF_DIR:-/opt/iklimco/swag/dns-conf}:/config/dns-conf + - ${SWAG_DNS_CONFIG_DIR:-/opt/iklimco/swag/dns-conf}:/config/dns-conf - ${SWAG_SITE_CONFS_DIR:-/opt/iklimco/swag/site-confs}:/config/nginx/site-confs ports: - target: 80 @@ -130,8 +128,7 @@ Find the `vault` service `ports:` block and **delete it entirely**: mode: host ``` -Vault remains reachable within `iklimco-net` via the overlay alias `vault.iklim.co:8200`. -The `VAULT_LOCAL_CONFIG` `api_addr` and `networks.default.aliases` entries stay unchanged. +Vault remains reachable within `iklimco-net` via the overlay alias `vault.iklim.co:8200`. The `VAULT_LOCAL_CONFIG` `api_addr` and `networks.default.aliases` entries stay unchanged. ## Step 4 — Remove `apisix` published ports @@ -154,8 +151,7 @@ Find the `apisix` service `ports:` block and **delete it entirely**: mode: host ``` -APISIX admin API (9180) access: use `docker exec` or SSH tunnel. -APISIX is reachable from SWAG via `http://apisix:9080` on the overlay network. +APISIX admin API (9180) access: use `docker exec` or SSH tunnel. APISIX is reachable from SWAG via `http://apisix:9080` on the overlay network. ## Step 5 — Remove `apisix-dashboard` published port diff --git a/roadmap/test-env/06-cert-reloader.md b/roadmap/test-env/06-cert-reloader.md index 67299f2..b015aac 100644 --- a/roadmap/test-env/06-cert-reloader.md +++ b/roadmap/test-env/06-cert-reloader.md @@ -1,10 +1,8 @@ # 06 — cert-reloader Sidecar Service (Test) ## Context -- **Purpose:** Watches SWAG's certificate volume for changes; copies renewed certs to - `/opt/iklimco/ssl/` on the host; forces Vault to reload its TLS cert. -- **Replaces:** `ops/vault-reload-after-swag-renewal.sh` (which was designed for manual use). - The sidecar automates this after every SWAG renewal. +- **Purpose:** Watches SWAG's certificate volume for changes; copies renewed certs to `/opt/iklimco/ssl/` on the host; forces Vault to reload its TLS cert. +- **Replaces:** `ops/vault-reload-after-swag-renewal.sh` (which was designed for manual use). The sidecar automates this after every SWAG renewal. - **Runs on:** manager node (same node as SWAG and Vault, ensuring volume + socket access). ## How it works @@ -22,16 +20,13 @@ Vault restarts ## Step 1 — Service definition (already in `03-infra-stack-changes.md`) -The `cert-reloader` service is added to `docker-stack-infra.yml` as documented in step 03. -No separate action needed here beyond that file change. +The `cert-reloader` service is added to `docker-stack-infra.yml` as documented in step 03. No separate action needed here beyond that file change. ## Step 2 — Ensure `/opt/iklimco/ssl/` exists on the host -The `Prepare Init Files` step in the pipeline already creates this directory and copies -the initial cert. The cert-reloader handles subsequent renewals. +The `Prepare Init Files` step in the pipeline already creates this directory and copies the initial cert. The cert-reloader handles subsequent renewals. -On first deploy, the bootstrap cert (copied during pipeline init) is used until SWAG -obtains its first Let's Encrypt cert (see `07-deploy-pipeline-update.md`). +On first deploy, the bootstrap cert (copied during pipeline init) is used until SWAG obtains its first Let's Encrypt cert (see `07-deploy-pipeline-update.md`). ## Step 3 — Verify cert-reloader is running @@ -65,15 +60,9 @@ fi ``` ## Notes -- Docker socket (`/var/run/docker.sock`) is mounted into cert-reloader — this is intentional - and necessary. The service is pinned to manager and is minimal (`docker:27-cli` image). -- cert-reloader checks every 3600s (1 hour). Let's Encrypt certs renew every ~60 days; - the 1-hour check window is more than sufficient. -- If Vault restarts (due to cert reload), it may need to be **unsealed** automatically. - Vault's healthcheck in `docker-stack-infra.yml` already handles auto-unseal via the - `vault_unseal_key` Docker secret. Verify this works after a cert reload. +- Docker socket (`/var/run/docker.sock`) is mounted into cert-reloader — this is intentional and necessary. The service is pinned to manager and is minimal (`docker:27-cli` image). +- cert-reloader checks every 3600s (1 hour). Let's Encrypt certs renew every ~60 days; the 1-hour check window is more than sufficient. +- If Vault restarts (due to cert reload), it may need to be **unsealed** automatically. Vault's healthcheck in `docker-stack-infra.yml` already handles auto-unseal via the `vault_unseal_key` Docker secret. Verify this works after a cert reload. ## Future — Multi-node Vault (prod) -When Vault runs as a 3-node Raft cluster on different physical machines, -cert-reloader must also SSH-copy the cert to the other nodes' `/opt/iklimco/ssl/`. -This is handled in `prod-env-setup/06-cert-reloader.md`. +Production no longer requires SSH-copy based certificate distribution. The current prod model uses StorageBox plus `cert-distributor` to sync certificates to `/opt/iklimco/ssl` on service nodes. See `../prod-env/06-cert-reloader.md`. diff --git a/roadmap/test-env/07-deploy-pipeline-update.md b/roadmap/test-env/07-deploy-pipeline-update.md index 19e0301..3f1fc99 100644 --- a/roadmap/test-env/07-deploy-pipeline-update.md +++ b/roadmap/test-env/07-deploy-pipeline-update.md @@ -19,8 +19,7 @@ scp -P 23 ${{ vars.STORAGEBOX_USER }}@${{ vars.STORAGEBOX_USER }}.your-storagebox.de:test/app/iklim.co/ssl/STAR.iklim.co_key.pem ./STAR.iklim.co_key.pem ``` -Also remove any references to `STAR.iklim.co.full.crt` and `STAR.iklim.co_key.pem` in -the `Prepare Init Files` step's `sudo cp` commands: +Also remove any references to `STAR.iklim.co.full.crt` and `STAR.iklim.co_key.pem` in the `Prepare Init Files` step's `sudo cp` commands: ```yaml # DELETE or make conditional: @@ -78,8 +77,7 @@ Insert this step **before** `Deploy Swarm Stack`: ## Step 3 — Add `Bootstrap SWAG Certificate` step -Insert this step **after** `Deploy Swarm Stack` and **before** any step that depends on -Vault being accessible (e.g., `Provision Vault AppRole IDs`): +Insert this step **after** `Deploy Swarm Stack` and **before** any step that depends on Vault being accessible (e.g., `Provision Vault AppRole IDs`): ```yaml - name: Bootstrap SWAG Certificate @@ -163,11 +161,6 @@ Final step order in the pipeline: > move step 16 before step 8. Adjust based on observed behavior. ## Notes -- `.env` must contain the subdomain env vars added in step 04. Add them to storagebox - `test/secrets/iklim.co/.env` before the first deploy. -- `RESTRICTED_IP_1` and `RESTRICTED_IP_2` are hardcoded in the pipeline step above. - Move to `.env` if they change often. -- Precipitation service expects its image-data bind mount at - `/mnt/storagebox/precipitation/images`. This directory is provisioned by the - test Ansible bootstrap through `storagebox_managed_directories`; do not rely on - the deploy pipeline to create it. +- `.env` must contain the subdomain env vars added in step 04. Add them to storagebox `test/secrets/iklim.co/.env` before the first deploy. +- `RESTRICTED_IPS` should be kept as a comma-separated CIDR list in `.env`, then rendered into nginx `allow` directives by the pipeline. +- Precipitation service expects its image-data bind mount at `/mnt/storagebox/precipitation/images`. This directory is provisioned by the test Ansible bootstrap through `storagebox_managed_directories`; do not rely on the deploy pipeline to create it.