- Updated roadmap (03-infra-stack-changes.md) to deprecate database proxies in prod. - Detailed direct subnet access via WireGuard for production developers. - Provided multi-host connection parameters for Patroni and MongoDB Replica Sets in setup guide (08-prod-db-cluster-kurulum.md). - Added environment comparison table to developer access guide.
677 lines
29 KiB
Markdown
677 lines
29 KiB
Markdown
# 09 - Prod Runner HA and Swarm Deploy Model
|
|
|
|
The purpose of this phase is to set up Gitea Actions runners in prod so they run in HA mode and define the prerequisites for distributing services across 3 nodes on Swarm.
|
|
|
|
## Runner Count
|
|
|
|
A single runner is functionally enough, but it is not HA. Because the prod target is HA, `act_runner` will be installed as a systemd service on all 3 Swarm manager nodes:
|
|
|
|
| Host | Runner |
|
|
| --- | --- |
|
|
| `iklim-app-01` | `act_runner` systemd |
|
|
| `iklim-app-02` | `act_runner` systemd |
|
|
| `iklim-app-03` | `act_runner` systemd |
|
|
|
|
In this model, if any manager/runner is lost, the other runners can pick up pipeline jobs.
|
|
|
|
## Runner Installation Model
|
|
|
|
The runner will not run as a Docker container. There is no Docker socket mount.
|
|
|
|
Installation:
|
|
|
|
- `gitea-runner` sistem kullanicisi
|
|
- `/usr/local/bin/act_runner`
|
|
- `/etc/gitea-act-runner/config.yaml`
|
|
- `/var/lib/gitea-runner`
|
|
- `gitea-act-runner.service`
|
|
|
|
If runner jobs use Docker CLI for deploy, the `gitea-runner` user needs access to the Docker daemon. Docker group membership is considered close to root-level permission; only trusted repos/jobs should use these runner labels.
|
|
|
|
## Runner Label Policy
|
|
|
|
Shared labels on all prod runners:
|
|
|
|
```text
|
|
prod-runner
|
|
docker
|
|
swarm-manager
|
|
ubuntu-24.04
|
|
```
|
|
|
|
Node-specific labels:
|
|
|
|
```text
|
|
iklim-app-01
|
|
iklim-app-02
|
|
iklim-app-03
|
|
```
|
|
|
|
If existing prod workflows use `runs-on: prod-runner`, any of the 3 runners can pick up the job. If pinning to a specific node is required, use a node-specific label.
|
|
|
|
## Deploy Race Risk
|
|
|
|
When there is more than one runner, multiple deploy jobs can run at the same time. This is good for HA, but it can create race risk on shared resources.
|
|
|
|
Risk areas:
|
|
|
|
- Concurrent `docker stack deploy` on the same stack
|
|
- Concurrent `docker service update` for the same service
|
|
- Concurrent updates to the same `.env` or manifest file on StorageBox
|
|
- Root infrastructure pipeline and microservice deploy pipeline running at the same time
|
|
|
|
Required measure:
|
|
|
|
- Prod root infrastructure deploy should run manually or with approval.
|
|
- Prod deploy for the same service must not be triggered more than once at the same time.
|
|
- All prod deploy workflows are queued with the Gitea Actions `concurrency: group: prod-deploy` block; concurrent execution is prevented by Gitea.
|
|
|
|
## Prerequisites — StorageBox Secrets
|
|
|
|
Before the deploy pipeline runs, the following files must exist on StorageBox. These files are not created automatically; they are created manually during the initial setup.
|
|
|
|
### SWAG / GoDaddy Credentials
|
|
|
|
```
|
|
prod/secrets/iklim.co/.env.secrets.swag
|
|
```
|
|
|
|
```bash
|
|
GODADDY_KEY=<api-key>
|
|
GODADDY_SECRET=<api-secret>
|
|
```
|
|
|
|
For the GoDaddy API key: https://developer.godaddy.com/keys — create a **Production** key. If an existing key is known to have been shared in any chat, Slack, or email, revoke it before use and create a new one.
|
|
|
|
> `.env.secrets.swag` contains only SWAG/GoDaddy credentials.
|
|
> `.env.secrets.shared` contains AppRole IDs, DB passwords, and other runtime secrets — do not mix these two files.
|
|
|
|
### Gitea `PROD_FLOATING_IP` Variable
|
|
|
|
For DNS automation, `PROD_FLOATING_IP` must be defined as a Gitea project variable. See the "Gitea Variable: PROD_FLOATING_IP" step in `06-prod-terraform-iaac.md`.
|
|
|
|
### Docker Secrets
|
|
|
|
Before the infra stack is deployed, the following Docker secrets must be created on `iklim-app-01`. These secrets are referenced by `docker-stack-infra.prod.yml`; if they do not exist, stack deploy fails.
|
|
|
|
```bash
|
|
# Redis password, used by Redis master, replica, and sentinel:
|
|
openssl rand -hex 32 | docker secret create redis_password -
|
|
|
|
# RabbitMQ Erlang cluster cookie; must be the same on all RabbitMQ nodes:
|
|
openssl rand -hex 32 | docker secret create rabbitmq_erlang_cookie -
|
|
```
|
|
|
|
> The `vault_unseal_key` secret is created after Vault is started for the first time; see `roadmap/prod-env/07-vault-raft-plan.md` Step 3. It is not required for the first infra stack deploy; it is waited for until the health check is triggered.
|
|
>
|
|
> This secret is also used during Vault restarts triggered by cert-reloader: when `cert-reloader` detects a certificate change, it runs `docker service update --force iklimco_vault`; while Vault containers restart, they read from the `vault_unseal_key` Docker secret and automatically unseal. If the secret is missing, Vault remains sealed after every certificate renewal.
|
|
|
|
Verify secrets:
|
|
|
|
```bash
|
|
docker secret ls
|
|
# redis_password and rabbitmq_erlang_cookie rows must appear
|
|
```
|
|
|
|
### SWAG Nginx Configuration Templates
|
|
|
|
Before the deploy pipeline runs, the following template files must exist in the repo:
|
|
|
|
- `swag/site-confs/default.conf`
|
|
- `swag/site-confs/api.conf.tpl`
|
|
- `swag/site-confs/apigw.conf.tpl`
|
|
- `swag/site-confs/rabbitmq.conf.tpl`
|
|
- `swag/site-confs/grafana.conf.tpl`
|
|
|
|
These files are created in the test environment (`test-env/04-swag-nginx-configs.md`); they are not created separately for prod. Template files are shared by both environments; prod-specific values are injected with environment variables during deploy.
|
|
|
|
Verify that the `prod/secrets/iklim.co/.env.prod` file on StorageBox contains the following variables:
|
|
|
|
```bash
|
|
API_SUBDOMAIN=api.iklim.co
|
|
APIGW_SUBDOMAIN=apigw.iklim.co
|
|
RABBITMQ_SUBDOMAIN=rabbitmq.iklim.co
|
|
GRAFANA_SUBDOMAIN=grafana.iklim.co
|
|
RESTRICTED_IPS="78.187.87.109/32,95.70.151.248/32"
|
|
SWAG_CERT_DIR=/mnt/storagebox/ssl
|
|
SWAG_CONFIG_DIR=/mnt/storagebox/swag/config
|
|
SWAG_SITE_CONFS_DIR=/mnt/storagebox/swag/site-confs
|
|
```
|
|
|
|
The pipeline sources these variables and renders the template files into the `$SWAG_SITE_CONFS_DIR` (`/mnt/storagebox/swag/site-confs`) directory. Because StorageBox is mounted commonly on all app nodes, even if the configuration is created on a single runner, SWAG containers on other nodes access the same files. Detail: `roadmap/prod-env/04-swag-nginx-configs.md`.
|
|
|
|
### APISIX Configuration
|
|
|
|
The following prerequisites must be satisfied before deploy.
|
|
|
|
#### init.sh SSL Block
|
|
|
|
The `ssls/1` PUT block and the `dev` SSL block inside `init/apisix-core/init.sh` must be removed. This change is made in the test environment (`test-env/05-apisix-remove-ssl.md`); the same `init.sh` file is also used in prod, so no separate change is required for prod.
|
|
|
|
#### Custom APISIX Image
|
|
|
|
The prod stack uses the `registry.tarla.io/iklimco/custom-apisix:3.12.0` image. This image's `config.yaml` must contain real IP header configuration for the overlay CIDR:
|
|
|
|
```yaml
|
|
nginx_config:
|
|
http:
|
|
real_ip_header: "X-Real-IP"
|
|
set_real_ip_from: "10.0.0.0/8"
|
|
```
|
|
|
|
`set_real_ip_from: 10.0.0.0/8` covers all container addresses in the Swarm overlay network; this skips SWAG's internal overlay IP and writes the real client IP to APISIX access logs.
|
|
|
|
If the image requires a rebuild because `config.yaml` changed:
|
|
|
|
```bash
|
|
docker build -t registry.tarla.io/iklimco/custom-apisix:3.12.0 .
|
|
docker push registry.tarla.io/iklimco/custom-apisix:3.12.0
|
|
```
|
|
|
|
During deploy, `init/apisix-core/init.sh` is run once by the pipeline. It writes the APISIX configuration to Patroni etcd with the `/apisix` prefix; the 3 replicas in prod read this etcd state commonly, so no separate init per replica is required. Detail: `roadmap/prod-env/05-apisix-remove-ssl.md`.
|
|
|
|
## Deploy Serialization with Gitea Concurrency
|
|
|
|
Because 3 runners run in prod, more than one deploy job can be triggered at the same time. Instead of a StorageBox-based `mkdir/rmdir` lock mechanism, the Gitea Actions `concurrency` feature is used.
|
|
|
|
Add the following block to the pipeline file (`deploy-prod.yml`):
|
|
|
|
```yaml
|
|
concurrency:
|
|
group: prod-deploy
|
|
cancel-in-progress: false
|
|
```
|
|
|
|
With `cancel-in-progress: false`, a new run in the same group is queued by Gitea until the previous one finishes. It appears as "queued" in the UI and is not shown as an error. There is no stale lock risk: even if the runner crashes or the job is canceled, Gitea handles state management.
|
|
|
|
All prod deploy workflows, including infra and microservices, must use the same `group: prod-deploy` value so infra deploy and microservice deploy cannot overlap.
|
|
|
|
## Deploy Pipeline
|
|
|
|
`.gitea/workflows/deploy-prod.yml` is the full step order of the prod deploy pipeline. Steps marked with `*` are prod-specific and do not exist in the test pipeline.
|
|
|
|
| # | Step | Note |
|
|
| --- | --- | --- |
|
|
| 1 | Checkout Branch | |
|
|
| 2 | Prepare Folders | |
|
|
| 3 | Set up SSH Key and Add to known_hosts | |
|
|
| 4 | Update Apt Repository and Install Required Tools | `gettext tree jq` — `jq` is required for the GoDaddy DNS API |
|
|
| 5 | Fetch Service Secret Files | Fetch `.env.secrets.*` from StorageBox |
|
|
| 6 | Initialize Workspace | Fetch `.env` and `.env.secrets.shared` from StorageBox; run `init-base.sh` |
|
|
| 7 | Upload Updated Secrets to Storagebox | |
|
|
| 8 | Provision Vault AppRole IDs and Docker Secrets | |
|
|
| 9 | Upload Updated Env to Storagebox | |
|
|
| 10 | Prepare Init Files | Cert copy lines removed |
|
|
| 11 | Initialize Docker Swarm | |
|
|
| 12 | Docker Login to Harbor | |
|
|
| 13 | **Update DNS Records** * | GoDaddy API; `api/apigw/rabbitmq/grafana` A records; idempotent |
|
|
| 14 | **Prepare SWAG Directories** * | `$SWAG_CONFIG_DIR/dns-conf`; renders nginx conf templates; reloads running SWAG |
|
|
| 15 | Bootstrap Vault TLS Placeholder | |
|
|
| 16 | Deploy Swarm Stack | base + prod overlay together |
|
|
| 17 | **Wait for etcd** * | Waits until Patroni etcd (`iklim-db-01:2379`) is healthy |
|
|
| 18 | **Run APISIX Init** * | `SPRING_PROFILES_ACTIVE=prod`; idempotent; writes to etcd |
|
|
| 19 | **Bootstrap SWAG Certificate** * | Waits for SWAG to obtain the cert; copies it to `SWAG_CERT_DIR` |
|
|
| 20 | **Run Database Init Scripts** * | `postgresql`/`mongodb` Swarm VIP; SQL+JS init; idempotent |
|
|
| 21 | Review Environment | |
|
|
|
|
### Removal of Cert Scp Lines
|
|
|
|
Lines removed from the `Initialize Workspace` step:
|
|
|
|
```yaml
|
|
# REMOVED — manual cert copy with scp is no longer required:
|
|
scp -P 23 ${{ vars.STORAGEBOX_USER }}@${{ vars.STORAGEBOX_USER }}.your-storagebox.de:prod/app/iklim.co/ssl/STAR.iklim.co.full.crt ./STAR.iklim.co.full.crt
|
|
scp -P 23 ${{ vars.STORAGEBOX_USER }}@${{ vars.STORAGEBOX_USER }}.your-storagebox.de:prod/app/iklim.co/ssl/STAR.iklim.co_key.pem ./STAR.iklim.co_key.pem
|
|
```
|
|
|
|
Line also removed from the `Prepare Init Files` step:
|
|
|
|
```yaml
|
|
# REMOVED:
|
|
sudo cp STAR.iklim.co.full.crt STAR.iklim.co_key.pem /opt/iklimco/ssl/
|
|
```
|
|
|
|
The certificate is now obtained by SWAG from Let's Encrypt and written to the `SWAG_CERT_DIR` (`/mnt/storagebox/ssl/`) directory in the `Bootstrap SWAG Certificate` step. Later renewals are handled automatically by cert-reloader.
|
|
|
|
### Bootstrap SWAG Certificate (Step 19)
|
|
|
|
On the first deploy, SWAG obtains the Let's Encrypt certificate with the GoDaddy DNS-01 challenge. This step waits for SWAG to obtain the certificate, for up to 10 minutes, and then copies it to the `SWAG_CERT_DIR` directory:
|
|
|
|
```yaml
|
|
- name: Bootstrap SWAG Certificate
|
|
run: |
|
|
set -a; . ./.env; set +a
|
|
echo "Waiting for SWAG container to start..."
|
|
SWAG_CTR=""
|
|
for i in $(seq 1 24); do
|
|
SWAG_CTR=$(docker ps -q -f name=iklimco_swag 2>/dev/null | head -1)
|
|
[ -n "$SWAG_CTR" ] && break
|
|
sleep 10
|
|
done
|
|
|
|
if [ -z "$SWAG_CTR" ]; then
|
|
echo "❌ SWAG container did not start"
|
|
exit 1
|
|
fi
|
|
|
|
CERT_PATH="/config/etc/letsencrypt/live/iklim.co/fullchain.pem"
|
|
echo "Waiting for cert (up to 10 min)..."
|
|
for i in $(seq 1 20); do
|
|
if docker exec "$SWAG_CTR" test -f "$CERT_PATH" 2>/dev/null; then
|
|
echo "✅ Cert obtained"
|
|
break
|
|
fi
|
|
echo " attempt $i/20 — waiting 30s..."
|
|
sleep 30
|
|
done
|
|
|
|
if ! docker exec "$SWAG_CTR" test -f "$CERT_PATH" 2>/dev/null; then
|
|
echo "❌ SWAG did not obtain cert. Logs:"
|
|
docker service logs iklimco_swag --tail 50
|
|
exit 1
|
|
fi
|
|
|
|
docker exec "$SWAG_CTR" cat "$CERT_PATH" | \
|
|
docker run --rm -i -v "${SWAG_CERT_DIR}:/output" alpine \
|
|
sh -c "cat > /output/STAR.iklim.co.full.crt && chmod 644 /output/STAR.iklim.co.full.crt"
|
|
docker exec "$SWAG_CTR" cat "/config/etc/letsencrypt/live/iklim.co/privkey.pem" | \
|
|
docker run --rm -i -v "${SWAG_CERT_DIR}:/output" alpine \
|
|
sh -c "cat > /output/STAR.iklim.co_key.pem && chmod 644 /output/STAR.iklim.co_key.pem"
|
|
echo "✅ Cert bootstrapped to ${SWAG_CERT_DIR}/"
|
|
working-directory: /workspace/iklim.co
|
|
```
|
|
|
|
After this step, certificate files exist inside `SWAG_CERT_DIR` (`/mnt/storagebox/ssl/`); Vault TLS reads these files. Later renewals are handled automatically by cert-reloader. When the pipeline runs again, this step only waits for the SWAG container to be ready; certificate issuance is managed by SWAG/cert-reloader within Let's Encrypt's 90-day cycle.
|
|
|
|
### Run Database Init Scripts (Step 20)
|
|
|
|
PostgreSQL and MongoDB init scripts run through Swarm overlay DNS service names (`postgresql`, `mongodb`):
|
|
|
|
```yaml
|
|
- name: Run Database Init Scripts
|
|
run: |
|
|
set -a; . ./.env; . ./.env.secrets.shared; set +a
|
|
|
|
echo "⏳ Waiting for PostgreSQL..."
|
|
until docker run --rm --network iklimco-net \
|
|
-e PGPASSWORD="${DATABASE_POSTGRES_ROOT_PASSWD}" \
|
|
postgis/postgis:17-3.5 \
|
|
pg_isready -h postgresql -U "${DATABASE_POSTGRES_ROOT_USER}" -q 2>/dev/null; do
|
|
sleep 5
|
|
done
|
|
for sql_file in $(ls ./init/postgresql/*.sql 2>/dev/null | sort); do
|
|
echo "▶ $(basename "$sql_file")"
|
|
docker run --rm -i --network iklimco-net \
|
|
-e PGPASSWORD="${DATABASE_POSTGRES_ROOT_PASSWD}" \
|
|
postgis/postgis:17-3.5 \
|
|
psql -h postgresql -U "${DATABASE_POSTGRES_ROOT_USER}" < "$sql_file"
|
|
done
|
|
|
|
echo "⏳ Waiting for MongoDB..."
|
|
until docker run --rm --network iklimco-net mongo:8 \
|
|
mongosh "mongodb://${DATABASE_MONGODB_ROOT_USER}:${DATABASE_MONGODB_ROOT_PASSWD}@mongodb/admin" \
|
|
--eval "db.runCommand({ping:1})" --quiet 2>/dev/null; do
|
|
sleep 5
|
|
done
|
|
for js_file in $(ls ./init/mongodb/*.js 2>/dev/null | sort); do
|
|
echo "▶ $(basename "$js_file")"
|
|
docker run --rm -i --network iklimco-net mongo:8 \
|
|
mongosh "mongodb://${DATABASE_MONGODB_ROOT_USER}:${DATABASE_MONGODB_ROOT_PASSWD}@mongodb/admin" \
|
|
--quiet < "$js_file"
|
|
done
|
|
echo "✅ Database init scripts completed"
|
|
working-directory: /workspace/iklim.co
|
|
```
|
|
|
|
- `postgresql` and `mongodb`: Swarm VIP service names, resolved on the `iklimco-net` overlay; Patroni primary automatic routing happens at VIP level
|
|
- SQL files `./init/postgresql/*.sql` and JS files `./init/mongodb/*.js` are created in the `Prepare Init Files` step by the `init_postgresql`/`init_mongodb` functions in `common-functions.sh`
|
|
- Idempotent: `CREATE IF NOT EXISTS` / `createCollection` semantics; runs safely again on later deploys
|
|
|
|
## Swarm Service Distribution
|
|
|
|
In prod, all 3 app nodes are manager + app worker, so services can be distributed across 3 nodes.
|
|
|
|
### Microservices
|
|
|
|
Each microservice has two stack files:
|
|
|
|
| File | Content | Environment |
|
|
| --- | --- | --- |
|
|
| `BE-<Service>/docker-stack-service.yml` | Base definitions, `replicas: 1` | Test + Prod |
|
|
| `BE-<Service>/docker-stack-service.prod.yml` | `replicas: 3`, `max_replicas_per_node: 1` | Prod only |
|
|
|
|
Prod deploy command:
|
|
|
|
```bash
|
|
docker stack deploy \
|
|
-c BE-<Service>/docker-stack-service.yml \
|
|
-c BE-<Service>/docker-stack-service.prod.yml \
|
|
iklimco
|
|
```
|
|
|
|
`max_replicas_per_node: 1` is mandatory; without it, when the Swarm node count is lower than the replica count, Swarm places more than one replica on the same node.
|
|
|
|
### Infra Services
|
|
|
|
`docker-stack-infra.yml` (base) and `docker-stack-infra.prod.yml` (overlay) are deployed together. The overlay overrides services such as Vault, APISIX, RabbitMQ, and Redis Sentinel with `replicas: 3` and `max_replicas_per_node: 1`. Detail: `Environment_Infrastructure/roadmap/prod-env/03-infra-stack-changes.md`.
|
|
|
|
#### cert-reloader and Vault Auto-Unseal
|
|
|
|
The `cert-reloader` sidecar service runs as `replicas: 1` inside the infra stack. It detects the Let's Encrypt certificate renewed by SWAG and distributes it to Vault. Because prod uses the shared StorageBox mount, SSH-based distribution is not required.
|
|
|
|
Certificate renewal flow:
|
|
|
|
```
|
|
SWAG renews the certificate -> writes it to SWAG_CONFIG_DIR (/mnt/storagebox/swag/config)
|
|
cert-reloader detects the MD5 change
|
|
-> copies it to /mnt/storagebox/ssl/ directory (common mount on all app nodes)
|
|
-> runs docker service update --force iklimco_vault
|
|
Vault (3 replicas) restarts
|
|
-> each instance reads the new certificate from the /mnt/storagebox/ssl/ mount
|
|
-> healthcheck checks sealed status every 30 seconds
|
|
-> if sealed: reads from the vault_unseal_key Docker secret and automatically unseals
|
|
```
|
|
|
|
The auto-unseal mechanism is provided by the Vault healthcheck inside `docker-stack-infra.yml`:
|
|
|
|
```yaml
|
|
healthcheck:
|
|
test:
|
|
- "CMD"
|
|
- "sh"
|
|
- "-c"
|
|
- >-
|
|
vault status -format=json 2>/dev/null | grep -q '"sealed":false' ||
|
|
vault operator unseal $$(cat /run/secrets/vault_unseal_key 2>/dev/null)
|
|
interval: 30s
|
|
timeout: 10s
|
|
start_period: 15s
|
|
retries: 5
|
|
```
|
|
|
|
The 3 replicas run their own healthchecks independently; all of them unseal separately. The certificate renewal -> restart -> auto-unseal chain requires no manual intervention. Detail: `roadmap/prod-env/06-cert-reloader.md`.
|
|
|
|
#### Vault Raft Configuration
|
|
|
|
Vault is defined as 3 replicas with Raft storage in the `docker-stack-infra.prod.yml` overlay:
|
|
|
|
```yaml
|
|
vault:
|
|
environment:
|
|
VAULT_LOCAL_CONFIG: >-
|
|
{"api_addr":"https://vault.iklim.co:8200",
|
|
"cluster_addr":"https://{{ .Node.Hostname }}:8201",
|
|
"storage":{"raft":{"path":"/vault/file","node_id":"{{ .Node.Hostname }}"}},
|
|
"listener":[{"tcp":{"address":"0.0.0.0:8200",
|
|
"tls_cert_file":"/vault/certs/STAR.iklim.co.full.crt",
|
|
"tls_key_file":"/vault/certs/STAR.iklim.co_key.pem"}}],
|
|
"default_lease_ttl":"168h","max_lease_ttl":"720h","ui":true}
|
|
volumes:
|
|
- /opt/iklimco/vault/data:/vault/file # separate host path on each node — created with Ansible
|
|
- ${SWAG_CERT_DIR}:/vault/certs:ro # StorageBox shared — all nodes see the same path
|
|
deploy:
|
|
mode: replicated
|
|
replicas: 3
|
|
placement:
|
|
max_replicas_per_node: 1
|
|
constraints:
|
|
- node.labels.type == service
|
|
```
|
|
|
|
`{{ .Node.Hostname }}` is a Docker Swarm Go template; it gives each Vault instance a unique `node_id` and `cluster_addr`. Because `/opt/iklimco/vault/data` is a host path volume, it is not an overlay volume; it must be created separately on each app node during Ansible bootstrap. See `07-prod-ansible-bootstrap.md` — Node Directory Role. Detail: `roadmap/prod-env/07-vault-raft-plan.md`.
|
|
|
|
## Vault Raft Cluster Initial Setup
|
|
|
|
After the infra stack is deployed for the first time, the Vault Raft cluster is initialized manually once. These steps are not repeated on every deploy; they are applied only during initial setup.
|
|
|
|
### Step 1 — Stack Deploy
|
|
|
|
```bash
|
|
docker stack deploy -c docker-stack-infra.yml -c docker-stack-infra.prod.yml iklimco
|
|
```
|
|
|
|
3 Vault containers start. The first initialized node becomes the leader.
|
|
|
|
### Step 2 — Vault Initialize (iklim-app-01)
|
|
|
|
```bash
|
|
VAULT_CTR=$(docker ps -q -f name=iklimco_vault)
|
|
docker exec -it "$VAULT_CTR" vault operator init
|
|
```
|
|
|
|
Store the unseal keys and root token from the output securely. Save the unseal key as a Docker secret:
|
|
|
|
```bash
|
|
echo -n "<unseal-key>" | docker secret create vault_unseal_key -
|
|
```
|
|
|
|
> After this step, the `vault_unseal_key` secret exists. During later certificate renewals, cert-reloader restarts Vault; the healthcheck reads this secret and automatically unseals, so no manual intervention is required.
|
|
|
|
### Step 3 — Unseal the Leader
|
|
|
|
```bash
|
|
docker exec -it "$VAULT_CTR" vault operator unseal
|
|
```
|
|
|
|
### Step 4 — Join the Other Nodes to the Raft Cluster
|
|
|
|
The Vault containers on `iklim-app-02` and `iklim-app-03` join the cluster:
|
|
|
|
```bash
|
|
docker exec -it <vault-on-iklim-app-02> vault operator raft join \
|
|
https://vault.iklim.co:8200
|
|
|
|
docker exec -it <vault-on-iklim-app-03> vault operator raft join \
|
|
https://vault.iklim.co:8200
|
|
```
|
|
|
|
Each node is also unsealed after it joins:
|
|
|
|
```bash
|
|
docker exec -it <vault-on-iklim-app-02> vault operator unseal
|
|
docker exec -it <vault-on-iklim-app-03> vault operator unseal
|
|
```
|
|
|
|
### Step 5 — Verify the Cluster
|
|
|
|
```bash
|
|
docker exec "$VAULT_CTR" vault operator raft list-peers
|
|
```
|
|
|
|
Expected: 3 peers — one `leader`, two `follower`.
|
|
|
|
## Gateway and Public Traffic
|
|
|
|
Public internet enters only through SWAG on `80/tcp` and `443/tcp`. SWAG is pinned to `iklim-app-01`, where the Floating IP is located. APISIX admin ports (`9180`) and other service ports are not opened publicly; SWAG forwards all public traffic to APISIX as a reverse proxy.
|
|
|
|
### Subdomain Routing
|
|
|
|
| Subdomain | Target Service | Restriction |
|
|
| --- | --- | --- |
|
|
| `api.iklim.co` | APISIX `:9080` | Public |
|
|
| `apigw.iklim.co` | APISIX Dashboard `:9000` | IP restricted with `RESTRICTED_IPS` |
|
|
| `rabbitmq.iklim.co` | RabbitMQ Management `:15672` | IP restricted with `RESTRICTED_IPS` |
|
|
| `grafana.iklim.co` | Grafana `:3000` | IP restricted with `RESTRICTED_IPS` |
|
|
|
|
IP restriction is done with the `RESTRICTED_IPS_BLOCK` nginx allow block derived from the `RESTRICTED_IPS` variable; it is applied in SWAG nginx configuration, not in the Hetzner firewall.
|
|
|
|
### SWAG -> APISIX Load Distribution
|
|
|
|
SWAG connects to APISIX through the Docker Swarm service name with `proxy_pass http://apisix:9080;`. Swarm resolves the `apisix` service name to a VIP (Virtual IP); the IPVS load balancer distributes incoming connections round-robin across the 3 replicas in prod. No additional upstream or load balancer configuration is required on the SWAG side; load distribution happens transparently at the overlay network layer.
|
|
|
|
`Prometheus` is intentionally not exposed externally through SWAG. Access uses Grafana, whose internal connection is `http://prometheus:9090`, or an SSH tunnel.
|
|
|
|
Detay: `Environment_Infrastructure/roadmap/prod-env/04-swag-nginx-configs.md`.
|
|
|
|
## Post-Deploy Verification
|
|
|
|
After a successful prod pipeline deploy, run the following checks.
|
|
|
|
### Swarm Health
|
|
|
|
```bash
|
|
docker node ls
|
|
```
|
|
|
|
Expected: 3 managers (`Leader` + 2 `Reachable`) — `iklim-app-01/02/03`; 3 workers (`Ready`) — `iklim-db-01/02/03`.
|
|
|
|
```bash
|
|
docker service ls --filter label=project=co.iklim
|
|
```
|
|
|
|
All services must show `REPLICAS X/X`; target met.
|
|
|
|
### Precipitation Image Directory
|
|
|
|
```bash
|
|
ls -ld /mnt/storagebox/precipitation/images
|
|
```
|
|
|
|
The directory must exist; it is required before `iklimco_precipitation-service` is deployed.
|
|
|
|
```bash
|
|
docker volume inspect iklimco_image-data
|
|
```
|
|
|
|
Expected: `Options.device` -> `/mnt/storagebox/precipitation/images`.
|
|
|
|
### SWAG Certificate
|
|
|
|
```bash
|
|
docker exec $(docker ps -q -f name=iklimco_swag) certbot certificates
|
|
```
|
|
|
|
Expected: `*.iklim.co`, `VALID: XX days` (Let's Encrypt — not the old manual cert).
|
|
|
|
TLS check from outside:
|
|
|
|
```bash
|
|
echo | openssl s_client -connect api.iklim.co:443 -servername api.iklim.co 2>/dev/null \
|
|
| openssl x509 -noout -subject -dates
|
|
```
|
|
|
|
Expected: `CN=*.iklim.co`, `notAfter > 2026-07-15`.
|
|
|
|
> Warning: The old manual `*.iklim.co` certificate expires on **2026-07-15**. After SWAG's Let's Encrypt certificate is verified for the first time, the old cert on StorageBox can be archived and is no longer used.
|
|
|
|
### Public API Access
|
|
|
|
```bash
|
|
curl -si https://api.iklim.co/health
|
|
```
|
|
|
|
It must return HTTP 2xx; there must be no TLS error.
|
|
|
|
### IP Restriction
|
|
|
|
From a disallowed IP:
|
|
|
|
```bash
|
|
curl -si https://grafana.iklim.co
|
|
curl -si https://apigw.iklim.co
|
|
curl -si https://rabbitmq.iklim.co
|
|
```
|
|
|
|
All must return HTTP 403.
|
|
|
|
From an allowed IP (78.187.87.109 or 95.70.151.248):
|
|
|
|
```bash
|
|
curl -si https://grafana.iklim.co # HTTP 200 Grafana
|
|
curl -si https://apigw.iklim.co # HTTP 200 APISIX Dashboard
|
|
curl -si https://rabbitmq.iklim.co # HTTP 200 RabbitMQ Management
|
|
```
|
|
|
|
### Vault Access Control
|
|
|
|
Must not be reachable from outside:
|
|
|
|
```bash
|
|
# Expected: connection refused or timeout
|
|
curl -sk --connect-timeout 5 https://<iklim-app-01-public-ip>:8200/v1/sys/health
|
|
```
|
|
|
|
Must be reachable from inside the overlay:
|
|
|
|
```bash
|
|
# Expected: {"sealed":false,...}
|
|
docker exec $(docker ps -q -f name=iklimco_apisix | head -1) \
|
|
curl -sk https://vault.iklim.co:8200/v1/sys/health
|
|
```
|
|
|
|
### No Unexpected Ports
|
|
|
|
```bash
|
|
docker service ls --format "{{.Name}}\t{{.Ports}}" \
|
|
--filter label=project=co.iklim
|
|
```
|
|
|
|
Only `iklimco_swag` -> `*:80->80/tcp, *:443->443/tcp` should publish ports; other services must not publish ports.
|
|
|
|
### APISIX Replica Distribution
|
|
|
|
```bash
|
|
docker service ps iklimco_apisix
|
|
```
|
|
|
|
Expected: 3 tasks, all `Running`, on different nodes.
|
|
|
|
### fail2ban (SWAG Container)
|
|
|
|
```bash
|
|
docker exec $(docker ps -q -f name=iklimco_swag) fail2ban-client status
|
|
```
|
|
|
|
Expected: a list with more than one jail.
|
|
|
|
### Microservice Health (After Microservices Are Deployed)
|
|
|
|
After microservices are deployed with a separate pipeline:
|
|
|
|
```bash
|
|
curl -si "https://api.iklim.co/v1/weather/current?lat=39&lon=35"
|
|
```
|
|
|
|
Expected: valid JSON weather response.
|
|
|
|
## Acceptance Criteria
|
|
|
|
- 3 prod runners appear online in the Gitea UI.
|
|
- Every runner has the `prod-runner` label.
|
|
- Any runner can run a simple Docker command.
|
|
- `docker node ls` shows 3 managers.
|
|
- When one runner/node is shut down, another runner can pick up a new job.
|
|
- All prod deploy workflows (`concurrency: group: prod-deploy`) are queued by Gitea; concurrent execution is prevented.
|
|
- Public ingress is limited to only `22`, `80`, and `443`.
|
|
- `prod/secrets/iklim.co/.env.secrets.swag` exists on StorageBox and contains valid GoDaddy credentials.
|
|
- `PROD_FLOATING_IP` project variable is defined in Gitea.
|
|
- `redis_password` and `rabbitmq_erlang_cookie` appear in `docker secret ls`.
|
|
- The `ssl`, `swag/config`, `swag/site-confs`, `grafana/data`, `prometheus/data`, and `precipitation/images` directories exist on StorageBox; see `07-prod-ansible-bootstrap.md` — StorageBox Directory Structure.
|
|
- The `swag/site-confs/default.conf`, `api.conf.tpl`, `apigw.conf.tpl`, `rabbitmq.conf.tpl`, and `grafana.conf.tpl` template files exist in the repo.
|
|
- StorageBox `prod/secrets/iklim.co/.env.prod` has correct values for `API_SUBDOMAIN`, `APIGW_SUBDOMAIN`, `RABBITMQ_SUBDOMAIN`, `GRAFANA_SUBDOMAIN`, `RESTRICTED_IPS`, `SWAG_CERT_DIR`, `SWAG_CONFIG_DIR`, and `SWAG_SITE_CONFS_DIR`.
|
|
- After the first deploy, `docker exec $(docker ps -q -f name=iklimco_swag) nginx -t` succeeds and returns `syntax is ok`.
|
|
- The output of `cat /mnt/storagebox/swag/site-confs/api.conf | grep server_name` contains `server_name api.iklim.co;`.
|
|
- The `ssls/1` PUT block does not exist inside `init/apisix-core/init.sh`.
|
|
- The `registry.tarla.io/iklimco/custom-apisix:3.12.0` image exists in Harbor and its `config.yaml` contains `set_real_ip_from: 10.0.0.0/8` configuration.
|
|
- After the first deploy, real client IP appears in APISIX access logs, not the SWAG overlay IP: `docker exec $(docker ps -q -f name=iklimco_apisix | head -1) tail -5 /usr/local/apisix/logs/access.log`
|
|
- `docker service ps iklimco_cert-reloader` shows that the service is running.
|
|
- The output of `docker service logs iklimco_cert-reloader --tail 20` contains `[cert-reloader] started` and has no error lines.
|
|
- The `notAfter` date of the Vault TLS endpoint certificate matches `/mnt/storagebox/ssl/STAR.iklim.co.full.crt`: `docker exec $(docker ps -q -f name=iklimco_vault | head -1) sh -c 'echo | openssl s_client -connect vault.iklim.co:8200 2>/dev/null | openssl x509 -noout -dates'`
|
|
- `vault operator raft list-peers` returns 3 peers: 1 leader, 2 followers.
|
|
- The `vault_unseal_key` Docker secret exists and appears in `docker secret ls`.
|
|
- 3 Vault containers are not sealed: `docker exec $(docker ps -q -f name=iklimco_vault | head -1) vault status | grep Sealed` -> `Sealed false`.
|
|
- The first deploy pipeline successfully completes all 21 steps; the `Review Environment` step succeeds.
|
|
- After the `Bootstrap SWAG Certificate` step, `ls /mnt/storagebox/ssl/` -> `STAR.iklim.co.full.crt` and `STAR.iklim.co_key.pem` exist.
|
|
- The `Run Database Init Scripts` step completes without error; PostgreSQL and MongoDB are healthy and init scripts are applied.
|
|
- In the output of `docker service ls --filter label=project=co.iklim`, all infra services show `X/X`.
|
|
- `docker volume inspect iklimco_image-data` → `Options.device=/mnt/storagebox/precipitation/images`.
|
|
- `docker exec $(docker ps -q -f name=iklimco_swag) certbot certificates` -> `*.iklim.co` Let's Encrypt certificate is valid; it is not the old manual cert.
|
|
- `echo | openssl s_client -connect api.iklim.co:443 2>/dev/null | openssl x509 -noout -subject -dates` → `CN=*.iklim.co`, `notAfter > 2026-07-15`.
|
|
- `curl -si https://api.iklim.co/health` -> HTTP 2xx; no TLS error.
|
|
- `https://grafana.iklim.co`, `https://apigw.iklim.co`, `https://rabbitmq.iklim.co` — returns HTTP 403 from a disallowed IP and HTTP 200 from an allowed IP.
|
|
- `curl --connect-timeout 5 https://<public-ip>:8200` -> connection refused or timeout; Vault is not reachable from outside.
|
|
- `docker exec $(docker ps -q -f name=iklimco_apisix | head -1) curl -sk https://vault.iklim.co:8200/v1/sys/health` -> `{"sealed":false,...}`; reachable from inside the overlay.
|
|
- `docker service ls --format "{{.Name}}\t{{.Ports}}" --filter label=project=co.iklim` -> only `iklimco_swag` publishes ports.
|
|
- `docker service ps iklimco_apisix` -> 3 tasks, `Running`, on different nodes.
|
|
- `docker exec $(docker ps -q -f name=iklimco_swag) fail2ban-client status` -> more than one jail appears.
|