Environment_Infrastructure/setup/09-prod-runner-ha-and-swarm.md
Murat ÖZDEMİR 67dc2986dd docs(infra): restructure and update infrastructure setup documentation
- Anglicized setup and facts markdown file names for better consistency.

- Updated 01-swarm-init-multinode.md to highlight Ansible automation of Swarm initialization and labeling.

- Overhauled 03-infra-stack-changes.md to describe the single monolithic file strategy and reflect current Redis, RabbitMQ, and etcd cluster configurations.

- Fixed minor overrides and typos in Patroni templates and Ansible bootstrap documents.

- Restructured README and roadmap mapping to align with the renamed setup documents.
2026-06-15 16:42:18 +03:00

593 lines
27 KiB
Markdown

# 09 - Prod Runner HA and Swarm Deploy Model
The purpose of this phase is to set up Gitea Actions runners in prod so they run in HA mode and define the prerequisites for distributing services across 3 nodes on Swarm.
## Runner Count
A single runner is functionally enough, but it is not HA. Because the prod target is HA, `act_runner` will be installed as a systemd service on all 3 Swarm manager nodes:
| Host | Runner |
| --- | --- |
| `iklim-app-01` | `act_runner` systemd |
| `iklim-app-02` | `act_runner` systemd |
| `iklim-app-03` | `act_runner` systemd |
In this model, if any manager/runner is lost, the other runners can pick up pipeline jobs.
## Runner Installation Model
The runner will not run as a Docker container. It runs as a systemd service on the app nodes. Job containers start on Docker `bridge`; deploy workflows connect the job container to `iklimco-net` after the stack creates that network.
Installation:
- `gitea-runner` sistem kullanicisi
- `/usr/local/bin/act_runner`
- `/etc/gitea-act-runner/config.yaml`
- `/var/lib/gitea-runner`
- `gitea-act-runner.service`
If runner jobs use Docker CLI for deploy, the `gitea-runner` user needs access to the Docker daemon. Docker group membership is considered close to root-level permission; only trusted repos/jobs should use these runner labels.
## Runner Label Policy
Shared labels on all prod runners:
```text
prod-runner:docker://catthehacker/ubuntu:act-22.04
ubuntu-24.04
```
Node-specific labels (hostname of each app node):
```text
iklim-app-01
iklim-app-02
iklim-app-03
```
If existing prod workflows use `runs-on: prod-runner`, any of the 3 runners can pick up the job. If pinning to a specific node is required, use a node-specific label.
## Deploy Race Risk
When there is more than one runner, multiple deploy jobs can run at the same time. This is good for HA, but it can create race risk on shared resources.
Risk areas:
- Concurrent `docker stack deploy` on the same stack
- Concurrent `docker service update` for the same service
- Concurrent updates to the same `.env` or manifest file on StorageBox
- Root infrastructure pipeline and microservice deploy pipeline running at the same time
Required measure:
- Prod root infrastructure deploy should run manually or with approval.
- Prod deploy for the same service must not be triggered more than once at the same time.
- All prod deploy workflows are queued with the Gitea Actions `concurrency: group: prod-deploy` block; concurrent execution is prevented by Gitea.
## Prerequisites — StorageBox Secrets
Before the deploy pipeline runs, the following files must exist on StorageBox. These files are not created automatically; they are created manually during the initial setup.
### SWAG / GoDaddy Credentials
```
prod/secrets/iklim.co/.env.secrets.swag
```
```bash
GODADDY_KEY=<api-key>
GODADDY_SECRET=<api-secret>
```
For the GoDaddy API key: https://developer.godaddy.com/keys — create a **Production** key. If an existing key is known to have been shared in any chat, Slack, or email, revoke it before use and create a new one.
> `.env.secrets.swag` contains only SWAG/GoDaddy credentials.
> `.env.secrets.shared` contains AppRole IDs, DB passwords, and other runtime secrets — do not mix these two files.
### Gitea `PROD_FLOATING_IP` Variable
For DNS automation, `PROD_FLOATING_IP` must be defined as a Gitea project variable. See the "Gitea Variable: PROD_FLOATING_IP" step in `06-prod-terraform-iac.md`.
### Docker Secrets
Before the infra stack is deployed, `rabbitmq_erlang_cookie` must exist as a Docker secret. The current prod workflow creates it in the `Create Infrastructure Docker Secrets` step if it is missing.
```bash
# RabbitMQ Erlang cluster cookie; must be the same on all RabbitMQ nodes.
# The workflow does this automatically if the secret is missing:
openssl rand -hex 32 | docker secret create rabbitmq_erlang_cookie -
```
> The `vault_unseal_key` secret is managed by `init/vault/vault-bootstrap.sh`. The bootstrap script creates a placeholder on first deploy, deploys `docker-stack-vault.yml`, initializes/unseals Vault, and rotates the secret to the real unseal key.
Verify secrets:
```bash
docker secret ls
# rabbitmq_erlang_cookie row must appear
```
### SWAG Nginx Configuration Templates
Before the deploy pipeline runs, the following template files must exist in the repo:
- `template/swag/site-confs/default.conf`
- `template/swag/site-confs/api.conf.tpl`
- `template/swag/site-confs/apigw.conf.tpl`
- `template/swag/site-confs/rabbitmq.conf.tpl`
- `template/swag/site-confs/grafana.conf.tpl`
These files are created in the test environment (`test-env/04-swag-nginx-configs.md`); they are not created separately for prod. Template files are shared by both environments; prod-specific values are injected with environment variables during deploy.
Verify that the `prod/secrets/iklim.co/.env` file on StorageBox contains the following variables:
```bash
API_SUBDOMAIN=api.iklim.co
APIGW_SUBDOMAIN=apigw.iklim.co
RABBITMQ_SUBDOMAIN=rabbitmq.iklim.co
GRAFANA_SUBDOMAIN=grafana.iklim.co
RESTRICTED_IPS="78.187.87.109/32,95.70.151.248/32"
SWAG_CERT_DIR=/mnt/storagebox/ssl
SWAG_DNS_CONFIG_DIR=/mnt/storagebox/swag/dns-conf
SWAG_SITE_CONFS_DIR=/mnt/storagebox/swag/site-confs
SWAG_PROXY_CONFS_DIR=/mnt/storagebox/swag/proxy-confs
```
The pipeline sources these variables and renders the template files into the `$SWAG_SITE_CONFS_DIR` (`/mnt/storagebox/swag/site-confs`) directory. Because StorageBox is mounted commonly on all app nodes, even if the configuration is created on a single runner, SWAG containers on other nodes access the same files.
### APISIX Configuration
The following prerequisites must be satisfied before deploy.
#### init.sh SSL Block
The `ssls/1` PUT block and the `dev` SSL block inside `init/apisix-core/init.sh` must be removed. This change is made in the test environment (`test-env/05-apisix-remove-ssl.md`); the same `init.sh` file is also used in prod, so no separate change is required for prod.
#### Custom APISIX Image
The prod stack uses the `registry.tarla.io/iklimco/custom-apisix:3.12.0` image. This image's `config.yaml` must contain real IP header configuration for the overlay CIDR:
```yaml
nginx_config:
http:
real_ip_header: "X-Real-IP"
real_ip_recursive: "on"
set_real_ip_from:
- "10.0.0.0/8"
- "172.16.0.0/12"
- "192.168.0.0/16"
```
These three CIDR ranges cover all typical Docker Swarm overlay subnet allocations. APISIX reads the real client IP from SWAG's `X-Real-IP` header instead of the overlay container IP.
If the image requires a rebuild because `config.yaml` changed, run from the project root:
```bash
bash ops/push-harbor-custom-images.sh
```
During deploy, `init/apisix-core/init.sh` is run once by the pipeline. It writes the APISIX configuration to Patroni etcd with the `/apisix` prefix; the 3 replicas in prod read this etcd state commonly, so no separate init per replica is required. Detail: `roadmap/prod-env/05-apisix-remove-ssl.md`.
## Deploy Serialization with Gitea Concurrency
Because 3 runners run in prod, more than one deploy job can be triggered at the same time. Instead of a StorageBox-based `mkdir/rmdir` lock mechanism, the Gitea Actions `concurrency` feature is used.
Add the following block to the pipeline file (`deploy-prod.yml`):
```yaml
concurrency:
group: prod-deploy
cancel-in-progress: false
```
With `cancel-in-progress: false`, a new run in the same group is queued by Gitea until the previous one finishes. It appears as "queued" in the UI and is not shown as an error. There is no stale lock risk: even if the runner crashes or the job is canceled, Gitea handles state management.
All prod deploy workflows, including infra and microservices, must use the same `group: prod-deploy` value so infra deploy and microservice deploy cannot overlap.
## Deploy Pipeline
`.gitea/workflows/deploy-prod.yml` is the full step order of the prod deploy pipeline. Steps marked with `*` are prod-specific and do not exist in the test pipeline.
| # | Step | Note |
| --- | --- | --- |
| 1 | Checkout Branch | |
| 2 | Prepare Folders | |
| 3 | Set up SSH Key and Add to known_hosts | |
| 4 | Update Apt Repository and Install Required Tools | `gettext tree jq``jq` is required for the GoDaddy DNS API |
| 5 | Fetch Prod Env From Storagebox | Fetch `.env` and `.env.secrets.shared` |
| 6 | Fetch Service Secret Files | Fetch `.env.secrets.<svc>` and `.env.secrets.swag` |
| 7 | Prepare Database Init Files | Render PostgreSQL/MongoDB init templates |
| 8 | Docker Login to Harbor | |
| 9 | Prepare SWAG Directories | Render `dns-conf` and `site-confs`; reload node-local SWAG if present |
| 10 | Bootstrap Vault TLS Placeholder | Creates temporary cert only if missing |
| 11 | Create Infrastructure Docker Secrets | Creates `rabbitmq_erlang_cookie` if missing |
| 12 | Deploy Swarm Stacks | `docker-stack-infra_db-prod.yml` |
| 13 | Connect Runner to Overlay Network | Connects job container to `iklimco-net` |
| 14 | Initialize Production Infrastructure | Runs `init-infra-prod.sh`; this triggers Vault bootstrap and RabbitMQ setup |
| 15 | Wait for Infrastructure Services | Waits for `iklimco_vault` and `iklimco_rabbitmq` |
| 16 | Provision Vault AppRole IDs and Docker Secrets | Downloads service `vault-files`, runs `init/provision-all-services.sh` |
| 17 | Upload Updated Secrets to Storagebox | Uploads `.env.secrets.*` and `.env` |
| 18 | Wait for etcd | Waits for etcd health |
| 19 | Run APISIX Init | `SPRING_PROFILES_ACTIVE=prod` |
| 20 | Bootstrap SWAG Certificate | Waits for SWAG and cert-reloader output in `SWAG_CERT_DIR` |
| 21 | Initialize MongoDB Replica Set | `rs.initiate()` or missing-member `rs.add()` |
| 22 | Run Database Init Scripts | Patroni primary + MongoDB replica set; SQL+JS init |
| 23 | Update DNS Records | GoDaddy API; `api/apigw/rabbitmq/grafana` A records |
| 24 | Review Environment | |
### Stack Placement Boundary
`docker-stack-infra_db-prod.yml` is intentionally a mixed infrastructure stack. The DB/cluster services in that file are placed on DB nodes and expose host-mode cluster ports:
- Patroni/PostgreSQL, MongoDB, and etcd run on `iklim-db-*` workers.
The service-node infrastructure in the same file remains overlay-only unless a reverse proxy or explicit published port is defined by the stack:
- Redis, Redis Sentinel, and RabbitMQ run on `node.labels.type == service` app/service nodes.
- Redis and RabbitMQ must not be treated as DB-node host-mode services.
### Historical Note: Removed Cert Scp Lines
Older workflow versions copied certificate files manually in an `Initialize Workspace` step. That step no longer exists in the current root prod workflow. The removed lines are kept here only as a historical reference:
```yaml
# REMOVED — manual cert copy with scp is no longer required:
scp -P 23 ${{ vars.STORAGEBOX_USER }}@${{ vars.STORAGEBOX_USER }}.your-storagebox.de:prod/app/iklim.co/ssl/STAR.iklim.co.full.crt ./STAR.iklim.co.full.crt
scp -P 23 ${{ vars.STORAGEBOX_USER }}@${{ vars.STORAGEBOX_USER }}.your-storagebox.de:prod/app/iklim.co/ssl/STAR.iklim.co_key.pem ./STAR.iklim.co_key.pem
```
This line was also removed from the old `Prepare Init Files` step:
```yaml
# REMOVED:
sudo cp STAR.iklim.co.full.crt STAR.iklim.co_key.pem /opt/iklimco/ssl/
```
The certificate is now obtained by SWAG from Let's Encrypt and written to the `SWAG_CERT_DIR` (`/mnt/storagebox/ssl/`) directory in the `Bootstrap SWAG Certificate` step. Later renewals are handled automatically by cert-reloader.
### Bootstrap SWAG Certificate (Step 20)
On the first deploy, SWAG obtains the Let's Encrypt certificate with the GoDaddy DNS-01 challenge. The current step waits for the Swarm `iklimco_swag` service to be running, then waits for `cert-reloader` to write `STAR.iklim.co.full.crt` to `SWAG_CERT_DIR`.
```yaml
- name: Bootstrap SWAG Certificate
run: |
set -a; . ./.env; set +a
echo "Waiting for SWAG service..."
docker service ps iklimco_swag --filter 'desired-state=running'
echo "Waiting for cert-reloader output in ${SWAG_CERT_DIR}..."
docker run --rm -v "${SWAG_CERT_DIR}:/ssl:ro" alpine \
test -f /ssl/STAR.iklim.co.full.crt
working-directory: /workspace/iklim.co
```
After this step, certificate files exist inside `SWAG_CERT_DIR` (`/mnt/storagebox/ssl/`). `cert-distributor` syncs these files to node-local `/opt/iklimco/ssl`, where Vault reads them. Later renewals are handled automatically by SWAG, cert-reloader, and cert-distributor.
### Run Database Init Scripts (Step 22)
PostgreSQL and MongoDB init scripts run after Patroni primary and MongoDB replica set readiness:
```yaml
- name: Run Database Init Scripts
run: |
set -a; . ./.env; . ./.env.secrets.shared; set +a
PG_URI="postgresql://${DATABASE_POSTGRES_ROOT_USER}@${DATABASE_POSTGRES_HOST}/postgres?connect_timeout=5&target_session_attrs=read-write"
MONGO_URI="mongodb://${DATABASE_MONGODB_ROOT_USER}:${DATABASE_MONGODB_ROOT_PASSWD}@${DATABASE_MONGODB_HOST}/admin?${DATABASE_MONGODB_PARAMS}"
for sql_file in $(ls ./init/postgresql/*.sql 2>/dev/null | sort); do
echo "▶ $(basename "$sql_file")"
docker run --rm -i --network iklimco-net \
-e PGPASSWORD="${DATABASE_POSTGRES_ROOT_PASSWD}" \
postgis/postgis:18-3.6 \
psql "$PG_URI" < "$sql_file"
done
for js_file in $(ls ./init/mongodb/*.js 2>/dev/null | sort); do
echo "▶ $(basename "$js_file")"
docker run --rm -i --network iklimco-net "${IMAGE_MONGODB}" \
sh -c 'cat > /tmp/init.js && mongosh "$MONGO_INIT_URI" --quiet --file /tmp/init.js' \
< "$js_file"
done
echo "✅ Database init scripts completed"
working-directory: /workspace/iklim.co
```
- `DATABASE_POSTGRES_HOST`: multi-host Patroni target; the workflow uses `target_session_attrs=read-write` to reach the primary
- `DATABASE_MONGODB_HOST`: MongoDB replica set host list
- SQL files `./init/postgresql/*.sql` and JS files `./init/mongodb/*.js` are created in the `Prepare Init Files` step by the `init_postgresql`/`init_mongodb` functions in `common-functions-prod.sh`
- Idempotent: `CREATE IF NOT EXISTS` / `createCollection` semantics; runs safely again on later deploys
## Swarm Service Distribution
In prod, all 3 app nodes are manager + app worker, so services can be distributed across 3 nodes.
### Microservices
Prod microservice workflows do not rebuild application images. They read `deploy/prod.env`, promote the tested Harbor digest to a stable prod tag, and call `swarm_service_update` with `deploy/docker-stack-service.yml`.
For first deploy, `swarm_service_update` exports `SERVICE_IMAGE` and runs:
```bash
docker stack deploy --with-registry-auth -c deploy/docker-stack-service.yml iklimco
```
For existing services it performs `docker service update` with `--update-order start-first` and `--update-failure-action rollback`.
### Infra Services
The current prod infra stack is `docker-stack-infra_db-prod.yml`. Vault is not inside this stack; it is deployed separately by `vault-bootstrap.sh` using `docker-stack-vault.yml`.
#### cert-reloader and Vault Auto-Unseal
The `cert-reloader` sidecar service runs as `replicas: 1` inside the infra stack. It detects the Let's Encrypt certificate renewed by SWAG and distributes it to Vault. Because prod uses the shared StorageBox mount, SSH-based distribution is not required.
Certificate renewal flow:
```
SWAG renews the certificate -> stores it inside the SWAG named volume
cert-reloader detects the MD5 change
-> copies it to /mnt/storagebox/ssl/ directory (StorageBox)
cert-distributor syncs it to /opt/iklimco/ssl on service nodes
-> runs docker service update --force iklimco_vault
Vault (3 replicas) restarts
-> each instance reads the new certificate from /opt/iklimco/ssl
-> entrypoint retry-unseal loop reads from the vault_unseal_key Docker secret and unseals
```
The 3 Vault replicas run their own retry-unseal loop independently. The certificate renewal -> distribution -> restart -> unseal chain requires no manual intervention after bootstrap.
#### Vault Raft Configuration
Vault is defined as 3 replicas with Raft storage in `docker-stack-vault.yml`:
```yaml
vault:
volumes:
- vault-data-vl:/vault/file
- vault-logs-vl:/vault/logs
- /opt/iklimco/ssl:/vault/certs:ro
deploy:
mode: replicated
replicas: 3
placement:
max_replicas_per_node: 1
constraints:
- node.labels.type == service
```
The Vault stack uses `vault-template-v2.json`, `vault_unseal_key`, and the `iklimco-net` external network. Bootstrap and unseal are handled by `init/vault/vault-bootstrap.sh`.
## Vault Raft Cluster Initial Setup
Vault Raft cluster setup is no longer a manual post-deploy procedure. It is handled by `init/vault/vault-bootstrap.sh`, called through `init-infra-prod.sh` by the root prod workflow.
### Step 1 — Stack Deploy
The bootstrap script deploys:
```bash
docker stack deploy --with-registry-auth -c docker-stack-vault.yml iklimco
```
### Step 2 — Vault Initialize (iklim-app-01)
The script runs `vault operator init -key-shares=1 -key-threshold=1` if Vault is not initialized. It stores bootstrap output under `/tmp/vault-bootstrap/main-vault-init.txt` during the run.
```bash
echo "bootstrap" | docker secret create vault_unseal_key -
```
Then it rotates `vault_unseal_key` to the real unseal key and unseals the leader and peers.
### Step 3 — Unseal the Leader
No manual unseal command is required in the normal path.
### Step 4 — Join the Other Nodes to the Raft Cluster
Peer join and peer unseal are handled by `vault-bootstrap.sh`.
### Step 5 — Verify the Cluster
```bash
docker exec "$VAULT_CTR" vault operator raft list-peers
```
Expected: 3 peers — one `leader`, two `follower`.
## Gateway and Public Traffic
Public internet enters only through SWAG on `80/tcp` and `443/tcp`. SWAG is pinned to `iklim-app-01`, where the Floating IP is located. APISIX admin ports (`9180`) and other service ports are not opened publicly; SWAG forwards all public traffic to APISIX as a reverse proxy.
### Subdomain Routing
| Subdomain | Target Service | Restriction |
| --- | --- | --- |
| `api.iklim.co` | APISIX `:9080` | Public |
| `apigw.iklim.co` | APISIX Dashboard `:9000` | IP restricted with `RESTRICTED_IPS` |
| `rabbitmq.iklim.co` | RabbitMQ Management `:15672` | IP restricted with `RESTRICTED_IPS` |
| `grafana.iklim.co` | Grafana `:3000` | IP restricted with `RESTRICTED_IPS` |
IP restriction is done with the `RESTRICTED_IPS_BLOCK` nginx allow block derived from the `RESTRICTED_IPS` variable; it is applied in SWAG nginx configuration, not in the Hetzner firewall.
### SWAG -> APISIX Load Distribution
SWAG connects to APISIX through the Docker Swarm service name with `proxy_pass http://apisix:9080;`. Swarm resolves the `apisix` service name to a VIP (Virtual IP); the IPVS load balancer distributes incoming connections round-robin across the 3 replicas in prod. No additional upstream or load balancer configuration is required on the SWAG side; load distribution happens transparently at the overlay network layer.
`Prometheus` is intentionally not exposed externally through SWAG. Access uses Grafana, whose internal connection is `http://prometheus:9090`, or an SSH tunnel.
Detay: `Environment_Infrastructure/roadmap/prod-env/04-swag-nginx-configs.md`.
## Post-Deploy Verification
After a successful prod pipeline deploy, run the following checks.
### Swarm Health
```bash
docker node ls
```
Expected: 3 managers (`Leader` + 2 `Reachable`) — `iklim-app-01/02/03`; 3 workers (`Ready`) — `iklim-db-01/02/03`.
```bash
docker service ls --filter label=project=co.iklim
```
All services must show `REPLICAS X/X`; target met.
### Precipitation Image Directory
```bash
ls -ld /mnt/storagebox/precipitation/images
```
The directory must exist; it is required before `iklimco_precipitation-service` is deployed.
```bash
docker volume inspect iklimco_image-data
```
Expected: `Options.device` -> `/mnt/storagebox/precipitation/images`.
### SWAG Certificate
```bash
docker exec $(docker ps -q -f name=iklimco_swag) certbot certificates
```
Expected: `*.iklim.co`, `VALID: XX days` (Let's Encrypt — not the old manual cert).
TLS check from outside:
```bash
echo | openssl s_client -connect api.iklim.co:443 -servername api.iklim.co 2>/dev/null \
| openssl x509 -noout -subject -dates
```
Expected: `CN=*.iklim.co`, `notAfter > 2026-07-15`.
> Warning: The old manual `*.iklim.co` certificate expires on **2026-07-15**. After SWAG's Let's Encrypt certificate is verified for the first time, the old cert on StorageBox can be archived and is no longer used.
### Public API Access
```bash
curl -si https://api.iklim.co/health
```
It must return HTTP 2xx; there must be no TLS error.
### IP Restriction
From a disallowed IP:
```bash
curl -si https://grafana.iklim.co
curl -si https://apigw.iklim.co
curl -si https://rabbitmq.iklim.co
```
All must return HTTP 403.
From an allowed IP (78.187.87.109 or 95.70.151.248):
```bash
curl -si https://grafana.iklim.co # HTTP 200 Grafana
curl -si https://apigw.iklim.co # HTTP 200 APISIX Dashboard
curl -si https://rabbitmq.iklim.co # HTTP 200 RabbitMQ Management
```
### Vault Access Control
Must not be reachable from outside:
```bash
# Expected: connection refused or timeout
curl -sk --connect-timeout 5 https://<iklim-app-01-public-ip>:8200/v1/sys/health
```
Must be reachable from inside the overlay:
```bash
# Expected: {"sealed":false,...}
docker exec $(docker ps -q -f name=iklimco_apisix | head -1) \
curl -sk https://vault.iklim.co:8200/v1/sys/health
```
### No Unexpected Ports
```bash
docker service ls --format "{{.Name}}\t{{.Ports}}" \
--filter label=project=co.iklim
```
Only `iklimco_swag` -> `*:80->80/tcp, *:443->443/tcp` should publish ports; other services must not publish ports.
### APISIX Replica Distribution
```bash
docker service ps iklimco_apisix
```
Expected: 3 tasks, all `Running`, on different nodes.
### fail2ban (SWAG Container)
```bash
docker exec $(docker ps -q -f name=iklimco_swag) fail2ban-client status
```
Expected: a list with more than one jail.
### Microservice Health (After Microservices Are Deployed)
After microservices are deployed with a separate pipeline:
```bash
curl -si "https://api.iklim.co/v1/weather/current?lat=39&lon=35"
```
Expected: valid JSON weather response.
## Acceptance Criteria
- 3 prod runners appear online in the Gitea UI.
- Every runner has the `prod-runner` label.
- Any runner can run a simple Docker command.
- `docker node ls` shows 3 managers.
- When one runner/node is shut down, another runner can pick up a new job.
- All prod deploy workflows (`concurrency: group: prod-deploy`) are queued by Gitea; concurrent execution is prevented.
- Public ingress is limited to only `22`, `80`, and `443`.
- `prod/secrets/iklim.co/.env.secrets.swag` exists on StorageBox and contains valid GoDaddy credentials.
- `PROD_FLOATING_IP` project variable is defined in Gitea.
- `rabbitmq_erlang_cookie` appears in `docker secret ls`.
- The `ssl`, `swag/config`, `swag/site-confs`, `grafana/data`, and `precipitation/images` directories exist on StorageBox; see `07-prod-ansible-bootstrap.md` — StorageBox Directory Structure.
- The `template/swag/site-confs/default.conf`, `api.conf.tpl`, `apigw.conf.tpl`, `rabbitmq.conf.tpl`, and `grafana.conf.tpl` template files exist in the repo.
- StorageBox `prod/secrets/iklim.co/.env` has correct values for `API_SUBDOMAIN`, `APIGW_SUBDOMAIN`, `RABBITMQ_SUBDOMAIN`, `GRAFANA_SUBDOMAIN`, `RESTRICTED_IPS`, `SWAG_CERT_DIR`, `SWAG_DNS_CONFIG_DIR`, `SWAG_SITE_CONFS_DIR`, and `SWAG_PROXY_CONFS_DIR`.
- After the first deploy, `docker exec $(docker ps -q -f name=iklimco_swag) nginx -t` succeeds and returns `syntax is ok`.
- The output of `cat /mnt/storagebox/swag/site-confs/api.conf | grep server_name` contains `server_name api.iklim.co;`.
- The `ssls/1` PUT block does not exist inside `init/apisix-core/init.sh`.
- The `registry.tarla.io/iklimco/custom-apisix:3.12.0` image exists in Harbor and its `config.yaml` contains `real_ip_header`, `real_ip_recursive`, and `set_real_ip_from` (covering `10.0.0.0/8`, `172.16.0.0/12`, `192.168.0.0/16`) configuration.
- After the first deploy, real client IP appears in APISIX access logs, not the SWAG overlay IP: `docker exec $(docker ps -q -f name=iklimco_apisix | head -1) tail -5 /usr/local/apisix/logs/access.log`
- `docker service ps iklimco_cert-reloader` shows that the service is running.
- `docker service ls` contains the current prod infra services from `docker-stack-infra_db-prod.yml` and the separate `iklimco_vault` service from `docker-stack-vault.yml`; deprecated base-stack services such as `iklimco_postgresql`, `iklimco_mongodb`, `iklimco_pg-proxy`, and `iklimco_mongo-proxy` are not present.
- The output of `docker service logs iklimco_cert-reloader --tail 20` contains `[cert-reloader] started` and has no error lines.
- The `notAfter` date of the Vault TLS endpoint certificate matches `/mnt/storagebox/ssl/STAR.iklim.co.full.crt`: `docker exec $(docker ps -q -f name=iklimco_vault | head -1) sh -c 'echo | openssl s_client -connect vault.iklim.co:8200 2>/dev/null | openssl x509 -noout -dates'`
- `vault operator raft list-peers` returns 3 peers: 1 leader, 2 followers.
- The `vault_unseal_key` Docker secret exists and appears in `docker secret ls`.
- 3 Vault containers are not sealed: `docker exec $(docker ps -q -f name=iklimco_vault | head -1) vault status | grep Sealed` -> `Sealed false`.
- The first deploy pipeline successfully completes all current root workflow steps; the `Review Environment` step succeeds.
- After the `Bootstrap SWAG Certificate` step, `ls /mnt/storagebox/ssl/` -> `STAR.iklim.co.full.crt` and `STAR.iklim.co_key.pem` exist.
- The `Run Database Init Scripts` step completes without error; PostgreSQL and MongoDB are healthy and init scripts are applied.
- In the output of `docker service ls --filter label=project=co.iklim`, all infra services show `X/X`.
- `docker volume inspect iklimco_image-data``Options.device=/mnt/storagebox/precipitation/images`.
- `docker exec $(docker ps -q -f name=iklimco_swag) certbot certificates` -> `*.iklim.co` Let's Encrypt certificate is valid; it is not the old manual cert.
- `echo | openssl s_client -connect api.iklim.co:443 2>/dev/null | openssl x509 -noout -subject -dates``CN=*.iklim.co`, `notAfter > 2026-07-15`.
- `curl -si https://api.iklim.co/health` -> HTTP 2xx; no TLS error.
- `https://grafana.iklim.co`, `https://apigw.iklim.co`, `https://rabbitmq.iklim.co` — returns HTTP 403 from a disallowed IP and HTTP 200 from an allowed IP.
- `curl --connect-timeout 5 https://<public-ip>:8200` -> connection refused or timeout; Vault is not reachable from outside.
- `docker exec $(docker ps -q -f name=iklimco_apisix | head -1) curl -sk https://vault.iklim.co:8200/v1/sys/health` -> `{"sealed":false,...}`; reachable from inside the overlay.
- `docker service ls --format "{{.Name}}\t{{.Ports}}" --filter label=project=co.iklim` -> only `iklimco_swag` publishes ports.
- `docker service ps iklimco_apisix` -> 3 tasks, `Running`, on different nodes.
- `docker exec $(docker ps -q -f name=iklimco_swag) fail2ban-client status` -> more than one jail appears.