docs: update production roadmap for HA Vault and shared storage

- Refactor production setup documentation to reflect a 3-node Vault Raft cluster starting from launch.
- Update all paths to use StorageBox mounts for shared state (SWAG config, TLS certs, Monitoring data).
- Switch Nginx configuration convention from proxy-confs to site-confs to align with SWAG's auto-include behavior.
- Standardize TLS private key extensions to .pem.
- Update node failover and recovery facts to include monitoring services.
- Align deployment pipeline instructions with the latest environment variable-driven approach.
This commit is contained in:
Murat ÖZDEMİR 2026-05-16 16:18:21 +03:00
parent f4b7f49968
commit 5ddba7eba4
17 changed files with 743 additions and 231 deletions

View File

@ -77,31 +77,34 @@ Prod ortamında birden fazla manager node (en az 3) çalıştırılır. Tek mana
---
# Prod — SWAG Failover
# Prod — Monitoring & SWAG Failover
SWAG cluster-native değildir; her zaman tek instance çalışır ve `iklim-app-01`'e (Floating IP node) sabitlenmiştir. `iklim-app-01` çöktüğünde SWAG ve cert-reloader da durur; DNS ve HTTPS erişimi kesilir. Swarm quorum 2 manager ile devam eder; mikroservisler ve Vault başka node'lara taşınır.
SWAG, cert-reloader, Prometheus ve Grafana cluster-native (replicated) değildir; her zaman tek instance çalışırlar ve varsayılan olarak `iklim-app-01`'e (Floating IP node) sabitlenmişlerdir. `iklim-app-01` çöktüğünde bu servisler durur; DNS/HTTPS erişimi ve izleme (monitoring) kesilir. Swarm quorum 2 manager ile devam eder; mikroservisler ve Vault başka node'lara taşınır.
SWAG konfigürasyonu (`/config`, letsencrypt sertifikaları dahil) StorageBox'ta tutulduğu için (`SWAG_CONFIG_DIR=/mnt/storagebox/prod/swag/config`) manuel failover hızlı yapılabilir.
Tüm bu servislerin verileri ve konfigürasyonları StorageBox'ta tutulur:
- **SWAG:** `/mnt/storagebox/swag/config`
- **SSL:** `/mnt/storagebox/ssl`
- **Prometheus:** `/mnt/storagebox/prometheus/data`
- **Grafana:** `/mnt/storagebox/grafana/data`
## Prod Senaryo: `iklim-app-01` Çöktü
### 1. SWAG'ı Başka Node'a Taşı
### 1. Servisleri Başka Node'a Taşı
SWAG ve cert-reloader birlikte taşınmalıdır. Prometheus ve Grafana da bağımsız olarak veya aynı anda taşınabilir.
```bash
# iklim-app-02 veya iklim-app-03 üzerinde (aktif manager):
docker service update \
--constraint-add "node.hostname == iklim-app-02" \
--constraint-rm "node.hostname == iklim-app-01" \
iklimco_swag
docker service update \
--constraint-add "node.hostname == iklim-app-02" \
--constraint-rm "node.hostname == iklim-app-01" \
iklimco_cert-reloader
# SWAG & Cert-Reloader taşıma
docker service update --constraint-add "node.hostname == iklim-app-02" --constraint-rm "node.hostname == iklim-app-01" iklimco_swag
docker service update --constraint-add "node.hostname == iklim-app-02" --constraint-rm "node.hostname == iklim-app-01" iklimco_cert-reloader
# Prometheus & Grafana taşıma
docker service update --constraint-add "node.hostname == iklim-app-02" --constraint-rm "node.hostname == iklim-app-01" iklimco_prometheus
docker service update --constraint-add "node.hostname == iklim-app-02" --constraint-rm "node.hostname == iklim-app-01" iklimco_grafana
```
SWAG StorageBox'taki mevcut letsencrypt sertifikalarını bulur; yeni sertifika talep etmez. cert-reloader yeni node'da başlar ve `/mnt/storagebox/prod/ssl`'e yazar.
### 2. Floating IP'yi Yeni Node'a Taşı
**CLI ile:**
@ -118,31 +121,24 @@ hcloud floating-ip assign <floating-ip-id> <iklim-app-02-server-id>
4. `iklim-prod-app-fip` satırının sağındaki **⋮** (üç nokta) menüsünü aç → **Reassign**.
5. Açılan listeden **`iklim-app-02`**'yi seç → **Reassign** butonuna tıkla.
DNS A kaydı zaten Floating IP'yi gösterdiği için ek DNS değişikliği gerekmez.
### 3. Doğrula
```bash
docker service ps iklimco_swag
docker service ps iklimco_cert-reloader
docker service ls | grep -E 'swag|cert-reloader|prometheus|grafana'
curl -si https://api.iklim.co/health
```
### `iklim-app-01` Geri Döndüğünde
Node Swarm'a yeniden katıldıktan sonra servisleri tekrar `iklim-app-01`'e taşı ve Floating IP'yi geri aktar:
Node Swarm'a yeniden katıldıktan sonra tüm servisleri tekrar `iklim-app-01`'e taşıyıp Floating IP'yi geri aktarabilirsiniz.
```bash
docker service update \
--constraint-add "node.hostname == iklim-app-01" \
--constraint-rm "node.hostname == iklim-app-02" \
iklimco_swag
docker service update \
--constraint-add "node.hostname == iklim-app-01" \
--constraint-rm "node.hostname == iklim-app-02" \
iklimco_cert-reloader
# Servisleri geri taşı
for svc in iklimco_swag iklimco_cert-reloader iklimco_prometheus iklimco_grafana; do
docker service update --constraint-add "node.hostname == iklim-app-01" --constraint-rm "node.hostname == iklim-app-02" $svc
done
# Floating IP'yi iklim-app-01'e geri ata
hcloud floating-ip assign <floating-ip-id> <iklim-app-01-server-id>
```
@ -153,4 +149,5 @@ hcloud floating-ip assign <floating-ip-id> <iklim-app-01-server-id>
| Swarm quorum | Otomatik — 2 manager yeterli |
| Vault, mikroservisler | Otomatik — `node.labels.type == service` constraint ile başka node'a schedule edilir |
| SWAG, cert-reloader | Manuel — `docker service update --constraint-*` + Floating IP taşıma |
| TLS sertifikaları | StorageBox'ta; failover node hemen erişir, yeniden istek gerekmez |
| Prometheus, Grafana | Manuel — `docker service update --constraint-*` |
| Veriler & Konfig | StorageBox'ta; failover node hemen erişir, veri kaybı yaşanmaz |

View File

@ -135,7 +135,7 @@ not via the Gitea pipeline.
| Constraint | Resolves to | Services |
|------------|-------------|----------|
| `node.hostname == iklim-app-01` | iklim-app-01 only | SWAG, cert-reloader |
| `node.labels.type == service` | iklim-app-01, iklim-app-02, iklim-app-03 | Vault, Redis, RabbitMQ, APISIX, Prometheus, Grafana, etcd |
| `node.labels.type == service` | iklim-app-01, iklim-app-02, iklim-app-03 | Vault, Redis, RabbitMQ, APISIX, Prometheus, Grafana, etcd (idle in prod — APISIX uses Patroni etcd) |
| `node.labels.role == db` | iklim-db-01, iklim-db-02, iklim-db-03 | PostgreSQL, MongoDB, pg-proxy, mongo-proxy |
SWAG and cert-reloader are pinned to `iklim-app-01` (the Floating IP node) because SWAG does not support clustering and must match the public entry point. Vault floats across all service nodes; its TLS cert is read from StorageBox (`/mnt/storagebox/prod/ssl`) so it is available on whichever node Vault is scheduled on. Microservices carry no placement constraint and are distributed by the Swarm scheduler across all app nodes. DB services are pinned to DB nodes via separate DB stacks.
SWAG and cert-reloader are pinned to `iklim-app-01` (the Floating IP node) because SWAG does not support clustering and must match the public entry point. Vault floats across all service nodes; its TLS cert is read from StorageBox (`/mnt/storagebox/ssl`) so it is available on whichever node Vault is scheduled on. Microservices carry no placement constraint and are distributed by the Swarm scheduler across all app nodes. DB services are pinned to DB nodes via separate DB stacks.

View File

@ -35,15 +35,16 @@ No additional action needed in the repo.
The deploy pipeline (see `08-deploy-pipeline-update.md`) runs on iklim-app-01:
```bash
mkdir -p /opt/iklimco/swag/dns-conf
envsubst < swag/dns-conf/godaddy.ini.tpl > /opt/iklimco/swag/dns-conf/godaddy.ini
chmod 600 /opt/iklimco/swag/dns-conf/godaddy.ini
set -a; . ./.env; set +a
mkdir -p "$SWAG_DNS_CONF_DIR"
envsubst < swag/dns-conf/godaddy.ini.tpl > "$SWAG_DNS_CONF_DIR/godaddy.ini"
chmod 600 "$SWAG_DNS_CONF_DIR/godaddy.ini"
```
## Step 4 — GoDaddy A records for prod subdomains
In GoDaddy DNS panel for `iklim.co`, add/update A records pointing to the **Floating IP** (`iklim-prod-app-fip`).
Floating IP değerini almak için: `terraform output prod_floating_ip`
To get the Floating IP value: `terraform output prod_floating_ip`
| Record | Value |
|--------|-------|
@ -52,8 +53,8 @@ Floating IP değerini almak için: `terraform output prod_floating_ip`
| `rabbitmq` | `<iklim-prod-app-fip>` |
| `grafana` | `<iklim-prod-app-fip>` |
> Floating IP `iklim-app-01`'e atanmıştır (`06-prod-terraform-iaac.md``floating_ip.tf`).
> Failover gerekirse Floating IP başka bir app node'una taşınabilir; DNS değişmez.
> The Floating IP is assigned to `iklim-app-01` (`06-prod-terraform-iaac.md``floating_ip.tf`).
> If failover is needed, the Floating IP can be reassigned to another app node; DNS does not change.
## Notes
- Test and prod SWAG instances both obtain `*.iklim.co` independently from Let's Encrypt.

View File

@ -1,13 +1,49 @@
# 03 — docker-stack-infra.yml Changes (Prod)
## Context
- **File:** `docker-stack-infra.yml` (repo root — shared between test and prod)
- All changes from `test-env/03-infra-stack-changes.md` apply here identically.
- **Additional prod-specific changes:**
- Microservices have no constraint (distributed across app nodes by Swarm).
- Replica counts for stateless services are increased.
- **Note:** PostgreSQL and MongoDB are **not** in `docker-stack-infra.yml` for prod. They run on
dedicated DB nodes in separate stacks (`iklim-db` and `iklim-patroni`). See `08-prod-db-cluster-kurulum.md`.
### File strategy — overlay approach
Prod-specific service changes are **not written directly** into `docker-stack-infra.yml`; they are kept in a separate overlay file:
| File | Usage |
|------|-------|
| `docker-stack-infra.yml` | Base — works as-is for test |
| `docker-stack-infra.prod.yml` | Prod overlay — additional services and overrides |
```bash
# Test deploy:
docker stack deploy -c docker-stack-infra.yml iklimco
# Prod deploy (Swarm merges both files):
docker stack deploy -c docker-stack-infra.yml -c docker-stack-infra.prod.yml iklimco
```
Docker Swarm merge rule: if the same service name appears in both files, the overlay wins (deploy, environment, etc.); services only present in the overlay are added.
### Prod-specific changes summary
- APISIX: 1 → 3 replicas (overlay override)
- Redis: single-instance → Sentinel cluster — 1 master + 2 replicas + 3 sentinels (overlay adds new services)
- RabbitMQ: 1 → 3-node Erlang cluster (overlay override + env)
- Vault: 1 → 3-node Raft cluster (overlay override) — see `07-vault-raft-plan.md`
- No separate APISIX etcd: Patroni etcd is shared (`/apisix` prefix)
- `init/apisix-core/init.sh`: when `PROFILE=prod`, rate limit `policy:local``policy:redis`
### swag-vl volume — not used in prod, not defined in overlay
Test-env Step 9 adds the `swag-vl` named volume to the base file. In prod, SWAG mounts to the StorageBox via the `${SWAG_CONFIG_DIR}` env var, so this volume is unused by any service. No need to remove it in the overlay — Swarm does not create unused volume definitions, it remains harmless.
No `swag-vl` definition is made in `docker-stack-infra.prod.yml`.
### Monitoring Persistence (StorageBox)
Prometheus and Grafana run as single instances. To ensure monitoring data and dashboards survive a node failover (moving from `iklim-app-01` to another node), their data is stored on the shared StorageBox:
- **Prometheus:** `/mnt/storagebox/prometheus/data`
- **Grafana:** `/mnt/storagebox/grafana/data`
These paths are mounted via env vars (`PROMETHEUS_DATA_DIR`, `GRAFANA_DATA_DIR`) with named-volume fallbacks for test. See Step 8 for implementation details.
**Note:** PostgreSQL and MongoDB are not in `docker-stack-infra.yml`. They run in separate stacks on DB nodes (`iklim-db` and `iklim-patroni`). See `08-prod-db-cluster-kurulum.md`.
## Step 1 — Apply all test-env changes first
@ -17,69 +53,515 @@ Follow every step in `test-env/03-infra-stack-changes.md`:
- Remove published ports for vault, apisix, rabbitmq, prometheus, grafana, apisix-dashboard
- Add `swag-vl` volume
## Step 2 — Pin Vault to manager node (initial prod — single instance)
## Step 2 — Vault: 3-node Raft cluster (prod)
Vault starts as a single instance pinned to the manager node.
Raft cluster migration is handled separately in `07-vault-raft-plan.md`.
Vault starts directly with 3 replicas; the Phase 1 single-instance stage is skipped in prod.
See `07-vault-raft-plan.md` Phase 2 for detailed setup steps.
```yaml
# Vault placement stays as:
placement:
constraints:
- node.role == manager
```
## Step 3 — Increase APISIX replicas for prod
```yaml
# CHANGE in apisix service deploy block:
vault:
deploy:
mode: replicated
replicas: 2 # was 1
```
APISIX is stateless (config in etcd) — multiple replicas are safe.
Swarm load-balances SWAG's requests across APISIX replicas via VIP.
## Step 4 — etcd: single instance in docker-stack-infra.yml (APISIX config store only)
The `etcd` service in `docker-stack-infra.yml` is used exclusively by APISIX as its configuration
store. It runs as a single instance on a manager node and is separate from the etcd cluster used by
Patroni for PostgreSQL HA.
```yaml
# etcd placement stays as:
replicas: 3
placement:
constraints:
- node.role == manager
- node.labels.type == service
```
> The 3-node etcd cluster for Patroni/PostgreSQL HA is deployed separately via `08-prod-db-cluster-kurulum.md`
> on the dedicated DB nodes. These are two independent etcd deployments with different purposes.
## Step 3 — APISIX: 3 replicas + init.sh rate limit update (prod overlay)
## Step 5 — Verify the complete file
Add to `docker-stack-infra.prod.yml`:
After all edits, validate the YAML:
```yaml
# docker-stack-infra.prod.yml
services:
apisix:
deploy:
mode: replicated
replicas: 3
placement:
constraints:
- node.labels.type == service
apisix-dashboard:
deploy:
mode: replicated
replicas: 3
placement:
constraints:
- node.labels.type == service
```
APISIX and apisix-dashboard are stateless (config lives in Patroni etcd) — 3 replicas is safe.
Swarm distributes SWAG requests to APISIX replicas via VIP (IPVS round-robin).
### init.sh — rate limit policy:redis (prod)
With `policy:local`, each APISIX instance counts independently → the global limit effectively becomes 3× with 3 replicas.
Switch to `policy:redis` for `PROFILE=prod`.
Update the global rate limit block in `init/apisix-core/init.sh`:
```bash
docker stack config -c docker-stack-infra.yml > /dev/null && echo "YAML valid"
if [[ "$PROFILE" != "dev" ]]; then
if [[ "$PROFILE" == "prod" ]]; then
RATE_POLICY="redis"
RATE_REDIS=',\"redis_host\":\"redis-master\",\"redis_port\":6379,\"redis_password\":\"'\"$REDIS_PASSWORD\"'\"'
else
RATE_POLICY="local"
RATE_REDIS=""
fi
call_api "global rate limit" -X PUT "$APISIX_ADMIN_URL/global_rules/1" \
-H "X-API-KEY: $API_KEY" -H "Content-Type: application/json" \
-d '{"plugins":{"limit-count":{"count":300,"time_window":60,"key_type":"var","key":"remote_addr","rejected_code":429,"policy":"'"$RATE_POLICY"'"'"$RATE_REDIS"'}}}'
fi
```
No output errors = valid.
> APISIX's `limit-count` plugin does not natively support Redis Sentinel; `policy:redis` works with a single endpoint.
> The `redis-master` service name stays constant within Swarm — during Sentinel failover (~10-30 s) rate limiting may be
> temporarily inconsistent; this brief disruption is acceptable. Microservices use Spring Data Redis Sentinel natively.
## Placement summary for prod (docker-stack-infra.yml only)
## Step 4 — etcd: Separate APISIX etcd removed — Patroni etcd shared
| Service | Placement |
|---------|-----------|
| swag | `node.role == manager` |
| cert-reloader | `node.role == manager` |
| vault | `node.role == manager` |
| apisix (2 replicas) | no constraint (distributed across app nodes) |
| apisix-dashboard | no constraint |
| redis | `node.role == manager` |
| rabbitmq | `node.role == manager` |
| etcd (APISIX store) | `node.role == manager` |
| prometheus | `node.role == manager` |
| grafana | `node.role == manager` |
The standalone `etcd` service in `docker-stack-infra.yml` is **not used in prod and must be removed**.
APISIX uses the 3-node Patroni etcd cluster running on DB nodes, via the `/apisix` prefix.
> PostgreSQL and MongoDB are deployed in separate DB stacks on `iklim-db-*` nodes.
> See `08-prod-db-cluster-kurulum.md` for those stacks.
### Why consolidated?
- A standalone single-instance etcd was a SPOF for APISIX.
- Patroni etcd is already 3-node HA — APISIX gets a more reliable config store.
- etcd supports prefix-based namespacing; Patroni uses `/service/`, APISIX uses `/apisix/` — no collision.
### APISIX etcd connection configuration
Update the etcd endpoints in the APISIX service in `docker-stack-infra.yml` to point to DB nodes:
```yaml
apisix:
environment:
APISIX_STAND_ALONE: "false"
# via apisix/conf/config.yaml or environment:
# etcd:
# host:
# - "http://iklim-db-01:2379"
# - "http://iklim-db-02:2379"
# - "http://iklim-db-03:2379"
# prefix: "/apisix"
```
The preferred method is mounting `config.yaml` via a Docker config or volume:
```yaml
# config/apisix/config.yaml
etcd:
host:
- "http://iklim-db-01:2379"
- "http://iklim-db-02:2379"
- "http://iklim-db-03:2379"
prefix: "/apisix"
timeout: 30
```
### Firewall requirement
etcd access from app nodes to DB nodes must be open:
```bash
# Each app node → each db node, port 2379
# If inside Hetzner private network it may be open by default;
# verify there are no ufw/firewalld rules blocking it:
nc -zv iklim-db-01 2379
```
> **Note:** Docker Compose overlay files can only add/override services, not remove them. The standalone `etcd` service remains in the base stack and runs as an idle container in prod — APISIX connects to Patroni etcd instead (via config.yaml in the prod overlay). This is harmless; etcd uses negligible resources with no active clients.
## Step 5 — Redis: Sentinel cluster (prod overlay)
Redis runs as a single instance in test. In prod, Sentinel provides HA.
Bitnami images are used — all configuration is done via env vars, no separate `.conf` file needed.
### Prerequisites
```bash
# Create Docker secret for Redis password:
openssl rand -hex 32 | docker secret create redis_password -
```
### Topology
```
iklim-app-01: redis-master (1 replica, pinned to app-01)
iklim-app-02: redis-replica (1 replica, pinned to app-02)
iklim-app-03: redis-replica (1 replica, pinned to app-03)
iklim-app-01: redis-sentinel ┐
iklim-app-02: redis-sentinel ├─ 3 replicas, spread across all app nodes
iklim-app-03: redis-sentinel ┘
```
### docker-stack-infra.prod.yml — Redis services
The existing `redis` service is overridden in the prod overlay as **master**; `redis-replica` and `redis-sentinel` are added as new services. The service name (`redis`) remains unchanged so the APISIX connection config does not need updating.
```yaml
# docker-stack-infra.prod.yml
services:
redis: # override base single-instance redis → master
image: bitnamisecure/redis:latest
environment:
ALLOW_EMPTY_PASSWORD: no
REDIS_PASSWORD: ${REDIS_PASSWORD}
REDIS_REPLICATION_MODE: master
deploy:
mode: replicated
replicas: 1
placement:
constraints:
- node.hostname == iklim-app-01
restart_policy:
condition: any
delay: 5s
labels:
project: co.iklim
redis-replica:
image: bitnamisecure/redis:latest
environment:
ALLOW_EMPTY_PASSWORD: no
REDIS_REPLICATION_MODE: slave
REDIS_MASTER_HOST: redis
REDIS_MASTER_PORT_NUMBER: "6379"
REDIS_MASTER_PASSWORD: ${REDIS_PASSWORD}
REDIS_PASSWORD: ${REDIS_PASSWORD}
deploy:
mode: replicated
replicas: 2
placement:
constraints:
- node.labels.type == service
preferences:
- spread: node.hostname
restart_policy:
condition: any
delay: 5s
labels:
project: co.iklim
redis-sentinel:
image: bitnamisecure/redis-sentinel:latest
environment:
REDIS_SENTINEL_MASTER_NAME: mymaster
REDIS_MASTER_HOST: redis
REDIS_MASTER_PORT_NUMBER: "6379"
REDIS_MASTER_PASSWORD: ${REDIS_PASSWORD}
REDIS_SENTINEL_QUORUM: "2"
REDIS_SENTINEL_DOWN_AFTER_MILLISECONDS: "5000"
REDIS_SENTINEL_FAILOVER_TIMEOUT: "10000"
deploy:
mode: replicated
replicas: 3
placement:
constraints:
- node.labels.type == service
preferences:
- spread: node.hostname
restart_policy:
condition: any
delay: 5s
labels:
project: co.iklim
```
### Microservice connection (Spring Data Redis)
Microservices must use a Sentinel-aware connection:
```yaml
# application-prod.yml
spring:
data:
redis:
sentinel:
master: mymaster
nodes:
- redis-sentinel:26379
password: ${REDIS_PASSWORD}
```
### Verification
```bash
# Query master identity:
docker exec $(docker ps -q -f name=iklimco_redis-sentinel | head -1) \
redis-cli -p 26379 SENTINEL get-master-addr-by-name mymaster
```
## Step 6 — RabbitMQ: 3-node Erlang cluster (prod overlay)
RabbitMQ runs as a 3-node cluster with one instance per app node.
### Prerequisites
```bash
# Create Docker secret for Erlang cookie (must be identical on all nodes):
openssl rand -hex 32 | docker secret create rabbitmq_erlang_cookie -
```
### docker-stack-infra.prod.yml — RabbitMQ override
```yaml
# docker-stack-infra.prod.yml (add alongside redis services)
services:
rabbitmq:
image: rabbitmq:3-management
hostname: "rabbitmq-{{.Node.Hostname}}"
environment:
RABBITMQ_ERLANG_COOKIE_FILE: /run/secrets/rabbitmq_erlang_cookie
RABBITMQ_USE_LONGNAME: "true"
RABBITMQ_NODENAME: "rabbit@rabbitmq-{{.Node.Hostname}}"
secrets:
- rabbitmq_erlang_cookie
deploy:
mode: replicated
replicas: 3
placement:
constraints:
- node.labels.type == service
update_config:
parallelism: 1
order: stop-first
labels:
project: co.iklim
secrets:
rabbitmq_erlang_cookie:
external: true
```
### Cluster join procedure (first setup)
RabbitMQ nodes do not form a cluster automatically; manual join is required after first start:
```bash
# Find the RabbitMQ container on iklim-app-02:
CTR=$(docker ps -q -f name=iklimco_rabbitmq)
# Stop, join, start:
docker exec "$CTR" rabbitmqctl stop_app
docker exec "$CTR" rabbitmqctl join_cluster rabbit@rabbitmq-iklim-app-01
docker exec "$CTR" rabbitmqctl start_app
# Repeat for iklim-app-03
```
```bash
# Verify cluster status (from any node):
docker exec "$CTR" rabbitmqctl cluster_status
```
> **HA policy:** After the cluster is formed, set quorum queues as the default:
> ```bash
> docker exec "$CTR" rabbitmqctl set_policy ha-all ".*" \
> '{"queue-type":"quorum"}' --apply-to queues
> ```
## Step 7 — Create `docker-stack-infra.prod.yml`
Create this file in the repo root alongside `docker-stack-infra.yml`. It combines all prod-specific overrides from Steps 26:
```yaml
# docker-stack-infra.prod.yml
# Prod overlay — deploy with:
# docker stack deploy -c docker-stack-infra.yml -c docker-stack-infra.prod.yml iklimco
services:
vault:
environment:
VAULT_LOCAL_CONFIG: >-
{"api_addr":"https://vault.iklim.co:8200",
"cluster_addr":"https://{{ .Node.Hostname }}:8201",
"storage":{"raft":{"path":"/vault/file","node_id":"{{ .Node.Hostname }}"}},
"listener":[{"tcp":{"address":"0.0.0.0:8200",
"tls_cert_file":"/vault/certs/STAR.iklim.co.full.crt",
"tls_key_file":"/vault/certs/STAR.iklim.co_key.pem"}}],
"default_lease_ttl":"168h","max_lease_ttl":"720h","ui":true}
volumes:
- /opt/iklimco/vault/data:/vault/file
- /mnt/storagebox/ssl:/vault/certs:ro
deploy:
mode: replicated
replicas: 3
placement:
constraints:
- node.labels.type == service
apisix:
deploy:
mode: replicated
replicas: 3
placement:
constraints:
- node.labels.type == service
apisix-dashboard:
deploy:
mode: replicated
replicas: 3
placement:
constraints:
- node.labels.type == service
redis:
image: bitnamisecure/redis:latest
environment:
ALLOW_EMPTY_PASSWORD: no
REDIS_PASSWORD: ${REDIS_PASSWORD}
REDIS_REPLICATION_MODE: master
deploy:
mode: replicated
replicas: 1
placement:
constraints:
- node.hostname == iklim-app-01
restart_policy:
condition: any
delay: 5s
labels:
project: co.iklim
redis-replica:
image: bitnamisecure/redis:latest
environment:
ALLOW_EMPTY_PASSWORD: no
REDIS_REPLICATION_MODE: slave
REDIS_MASTER_HOST: redis
REDIS_MASTER_PORT_NUMBER: "6379"
REDIS_MASTER_PASSWORD: ${REDIS_PASSWORD}
REDIS_PASSWORD: ${REDIS_PASSWORD}
deploy:
mode: replicated
replicas: 2
placement:
constraints:
- node.labels.type == service
preferences:
- spread: node.hostname
restart_policy:
condition: any
delay: 5s
labels:
project: co.iklim
redis-sentinel:
image: bitnamisecure/redis-sentinel:latest
environment:
REDIS_SENTINEL_MASTER_NAME: mymaster
REDIS_MASTER_HOST: redis
REDIS_MASTER_PORT_NUMBER: "6379"
REDIS_MASTER_PASSWORD: ${REDIS_PASSWORD}
REDIS_SENTINEL_QUORUM: "2"
REDIS_SENTINEL_DOWN_AFTER_MILLISECONDS: "5000"
REDIS_SENTINEL_FAILOVER_TIMEOUT: "10000"
deploy:
mode: replicated
replicas: 3
placement:
constraints:
- node.labels.type == service
preferences:
- spread: node.hostname
restart_policy:
condition: any
delay: 5s
labels:
project: co.iklim
rabbitmq:
image: rabbitmq:3-management
hostname: "rabbitmq-{{.Node.Hostname}}"
environment:
RABBITMQ_ERLANG_COOKIE_FILE: /run/secrets/rabbitmq_erlang_cookie
RABBITMQ_USE_LONGNAME: "true"
RABBITMQ_NODENAME: "rabbit@rabbitmq-{{.Node.Hostname}}"
secrets:
- rabbitmq_erlang_cookie
deploy:
mode: replicated
replicas: 3
placement:
constraints:
- node.labels.type == service
update_config:
parallelism: 1
order: stop-first
labels:
project: co.iklim
secrets:
rabbitmq_erlang_cookie:
external: true
```
## Step 8 — Monitoring Data Persistence (StorageBox)
Prometheus and Grafana run as single instances. Without persistent storage, data is lost on node failover. This step mounts their data directories from the StorageBox shared filesystem.
**Changes already applied to `docker-stack-infra.yml`:**
```yaml
prometheus:
volumes:
- ${PROMETHEUS_DATA_DIR:-prometheus-vl}:/prometheus
grafana:
volumes:
- ${GRAFANA_DATA_DIR:-grafana-vl}:/var/lib/grafana
```
Test uses the named Docker volume fallbacks (`prometheus-vl`, `grafana-vl`) — no test env change needed.
**Add to `prod/secrets/iklim.co/.env.prod` on storagebox** (already in `env-prod/.env`):
```bash
PROMETHEUS_DATA_DIR=/mnt/storagebox/prometheus/data
GRAFANA_DATA_DIR=/mnt/storagebox/grafana/data
```
**Create directories on StorageBox before first prod deploy:**
```bash
mkdir -p /mnt/storagebox/prometheus/data /mnt/storagebox/grafana/data
```
> Grafana writes its SQLite database and dashboard JSON to `/var/lib/grafana`.
> Prometheus writes its TSDB to `/prometheus`. Both directories must exist before the stack starts.
## Step 9 — Verify
```bash
# Base file must be valid on its own (test deploy):
docker stack config -c docker-stack-infra.yml > /dev/null && echo "base OK"
# Prod merge must be valid:
docker stack config -c docker-stack-infra.yml -c docker-stack-infra.prod.yml > /dev/null && echo "prod merge OK"
```
## Placement and Replica Summary — prod
| Service | File | Replicas | Placement | HA Note |
|---------|------|----------|-----------|---------|
| swag | base | 1 | `node.hostname == iklim-app-01` | No clustering support; Floating IP pinned to node |
| cert-reloader | base | 1 | `node.hostname == iklim-app-01` | Cron-style task; duplicate would be problematic |
| vault | prod overlay | 3 | `node.labels.type == service` | Raft cluster — see `07-vault-raft-plan.md` |
| apisix | prod overlay | 3 | `node.labels.type == service` | Stateless; config in Patroni etcd; rate limit policy:redis |
| apisix-dashboard | prod overlay | 3 | `node.labels.type == service` | Stateless; reads from etcd |
| redis (master) | prod overlay | 1 | `node.hostname == iklim-app-01` | Sentinel cluster master |
| redis-replica | prod overlay | 2 | `node.labels.type == service` | Sentinel replica; spread:hostname |
| redis-sentinel | prod overlay | 3 | `node.labels.type == service` | Quorum=2; failover automatic |
| rabbitmq | prod overlay | 3 | `node.labels.type == service` | Erlang cluster; quorum queues |
| etcd | base | 1 | `node.labels.type == service` | Idle in prod — APISIX uses Patroni etcd; standalone service remains in base stack |
| prometheus | base | 1 | `node.labels.type == service` | No native HA; Thanos is overkill at this scale |
| grafana | base | 1 | `node.labels.type == service` | Not critical |
> PostgreSQL and MongoDB run in separate DB stacks on `iklim-db-*` nodes. See `08-prod-db-cluster-kurulum.md`.
> etcd: 3-node cluster on DB nodes — APISIX shares it via `/apisix` prefix.

View File

@ -1,7 +1,7 @@
# 04 — SWAG Nginx Proxy Configs (Prod)
## Context
Same template files as test (`swag/proxy-confs/*.conf.tpl`), different env vars.
Same template files as test (`swag/site-confs/*.conf.tpl`), different env vars.
The pipeline processes templates with prod-specific subdomain values.
## Required env vars (in `.env` on storagebox `prod/secrets/iklim.co/.env.prod`)
@ -14,20 +14,22 @@ GRAFANA_SUBDOMAIN=grafana.iklim.co
RESTRICTED_IP_1=78.187.87.109
RESTRICTED_IP_2=95.70.151.248
# SWAG storage paths — StorageBox so certs are accessible from any app node
# cert-reloader writes here; Vault reads from here on any manager node
SWAG_CERT_DIR=/mnt/storagebox/prod/ssl
# SWAG full config dir (includes letsencrypt state) — enables clean node failover
SWAG_CONFIG_DIR=/mnt/storagebox/prod/swag/config
# SWAG storage paths — StorageBox is mounted on all app nodes, shared filesystem
# cert-reloader writes here; Vault reads from this path on every node — no SSH distribution needed
SWAG_CERT_DIR=/mnt/storagebox/ssl
# SWAG config dirs on StorageBox — all three survive node failover without pipeline re-run
SWAG_CONFIG_DIR=/mnt/storagebox/swag/config
SWAG_DNS_CONF_DIR=/mnt/storagebox/swag/dns-conf
SWAG_SITE_CONFS_DIR=/mnt/storagebox/swag/site-confs
```
## Template files (already created in test step 04)
- `swag/site-confs/default.conf`
- `swag/proxy-confs/api.conf.tpl`
- `swag/proxy-confs/apigw.conf.tpl`
- `swag/proxy-confs/rabbitmq.conf.tpl`
- `swag/proxy-confs/grafana.conf.tpl`
- `swag/site-confs/api.conf.tpl`
- `swag/site-confs/apigw.conf.tpl`
- `swag/site-confs/rabbitmq.conf.tpl`
- `swag/site-confs/grafana.conf.tpl`
No new files to create — the same templates work for both environments.
@ -38,25 +40,25 @@ set -a; . ./.env; set +a
export RESTRICTED_IP_1="78.187.87.109"
export RESTRICTED_IP_2="95.70.151.248"
sudo mkdir -p /opt/iklimco/swag/proxy-confs /opt/iklimco/swag/site-confs
mkdir -p "$SWAG_DNS_CONF_DIR" "$SWAG_SITE_CONFS_DIR"
for tpl in swag/proxy-confs/*.conf.tpl; do
out="/opt/iklimco/swag/proxy-confs/$(basename "${tpl%.tpl}")"
for tpl in swag/site-confs/*.conf.tpl; do
out="$SWAG_SITE_CONFS_DIR/$(basename "${tpl%.tpl}")"
envsubst < "$tpl" | sudo tee "$out" > /dev/null
echo "✅ $out"
done
sudo cp swag/site-confs/default.conf /opt/iklimco/swag/site-confs/default.conf
sudo cp swag/site-confs/default.conf "$SWAG_SITE_CONFS_DIR/default.conf"
```
With `API_SUBDOMAIN=api.iklim.co`, the output file `/opt/iklimco/swag/proxy-confs/api.conf`
will contain `server_name api.iklim.co;` — correct for prod.
With `API_SUBDOMAIN=api.iklim.co`, the output file `$SWAG_SITE_CONFS_DIR/api.conf`
(`/mnt/storagebox/swag/site-confs/api.conf`) will contain `server_name api.iklim.co;` — correct for prod.
## Verification
After deploy, on iklim-app-01:
```bash
cat /opt/iklimco/swag/proxy-confs/api.conf | grep server_name
cat /mnt/storagebox/swag/site-confs/api.conf | grep server_name
```
Expected: `server_name api.iklim.co;`
@ -74,4 +76,4 @@ Expected: APISIX response with valid `*.iklim.co` cert.
- `Prometheus` is intentionally NOT exposed via SWAG. Access it via Grafana
(internal connection: `http://prometheus:9090`) or SSH tunnel.
- If additional restricted-access subdomains are needed in the future, create a new
`swag/proxy-confs/<name>.conf.tpl` following the same pattern.
`swag/site-confs/<name>.conf.tpl` following the same pattern.

View File

@ -20,11 +20,23 @@ Changes made for test already apply to prod.
## Prod-specific note
APISIX runs with `replicas: 2` in prod. Both replicas receive the same configuration
from etcd — no additional steps needed beyond the single init run.
APISIX runs with `replicas: 3` in prod — this value is defined in the `docker-stack-infra.prod.yml` overlay (not in the base `docker-stack-infra.yml`). All replicas read the same configuration from Patroni etcd (`/apisix` prefix) — a single `init` run is sufficient.
The `init/apisix-core/init.sh` is called once (from the pipeline) and configures the
shared etcd state that all APISIX instances read from.
```bash
# Prod deploy:
docker stack deploy -c docker-stack-infra.yml -c docker-stack-infra.prod.yml iklimco
```
`init/apisix-core/init.sh` is run once by the pipeline and writes the etcd state that all APISIX instances read.
## SWAG → APISIX load distribution
SWAG connects to APISIX via `proxy_pass http://apisix:9080;` — using the service name directly.
No additional upstream or load balancer configuration is needed on the SWAG side.
**How it works:** Docker Swarm resolves the `apisix` service name to a VIP (Virtual IP).
Swarm's internal IPVS load balancer automatically distributes incoming connections across the 3 replicas
in round-robin. SWAG is unaware of this mechanism; it happens transparently at the overlay network layer.
## Verification

View File

@ -1,41 +1,45 @@
# 06 — cert-reloader Sidecar Service (Prod)
## Context
Same service definition as test (see `test-env-setup/06-cert-reloader.md`).
Prod-specific consideration: Vault is single-instance on the manager node (same as SWAG),
so the cert copy to `/opt/iklimco/ssl/` works without cross-node distribution.
Service definition is identical to test (see `test-env-setup/06-cert-reloader.md`).
In prod, Vault runs as a 3-node Raft cluster; cert distribution is handled via the StorageBox shared mount — no SSH required.
When Vault is expanded to a 3-node Raft cluster (see `07-vault-raft-plan.md`), the
cert-reloader must be updated to distribute the cert to the other Vault nodes.
## Current behavior (single-Vault prod)
## Prod flow (3-node Vault Raft)
```
SWAG (manager) renews cert → swag-vl
cert-reloader (manager) detects change → copies to /opt/iklimco/ssl/ → reloads Vault
Vault (manager) reads /opt/iklimco/ssl/ → serves new cert
SWAG renews cert → writes to SWAG_CONFIG_DIR (/mnt/storagebox/swag/config)
cert-reloader detects MD5 change
→ copies to /mnt/storagebox/ssl/ (shared across all app nodes)
→ docker service update --force iklimco_vault
Vault (3 replicas) restarts
→ each instance has /mnt/storagebox/ssl/ mounted → reads the new cert
→ healthcheck checks sealed status every 30 seconds
→ if sealed: reads vault_unseal_key Docker secret and auto-unseals
```
No cross-node distribution needed.
No SSH distribution, additional secrets, or cert-reloader script changes are needed.
## Future behavior (3-node Vault Raft — see step 07)
## Auto-unseal mechanism
When Vault runs on iklim-app-01, iklim-app-02, iklim-app-03:
The Vault healthcheck is already implemented in `docker-stack-infra.yml`:
```
cert-reloader detects cert change
→ copies cert to /opt/iklimco/ssl/ on iklim-app-01 (local)
→ SSH copy to iklim-app-02:/opt/iklimco/ssl/
→ SSH copy to iklim-app-03:/opt/iklimco/ssl/
→ docker service update --force iklimco_vault (restarts all 3 replicas)
```yaml
healthcheck:
test:
- "CMD"
- "sh"
- "-c"
- >-
vault status -format=json 2>/dev/null | grep -q '"sealed":false' ||
vault operator unseal $$(cat /run/secrets/vault_unseal_key 2>/dev/null)
interval: 30s
timeout: 10s
start_period: 15s
retries: 5
```
This requires:
- An SSH key that cert-reloader can use to reach iklim-app-02 and iklim-app-03
- That key mounted as a Docker secret into cert-reloader
- Known_hosts for iklim-app-02 and iklim-app-03 pre-configured
Script update for this phase is tracked in `07-vault-raft-plan.md`.
Each Vault container runs its own healthcheck independently — all 3 replicas unseal separately.
The cert renewal → restart → auto-unseal chain requires no manual intervention.
## Verification
@ -54,4 +58,4 @@ docker exec $(docker ps -q -f name=iklimco_vault) \
| openssl x509 -noout -dates'
```
`notAfter` should match the cert in `/opt/iklimco/ssl/STAR.iklim.co.full.crt`.
`notAfter` should match the cert in `/mnt/storagebox/ssl/STAR.iklim.co.full.crt`.

View File

@ -1,42 +1,28 @@
# 07 — Vault: Initial Single Instance + Raft Cluster Migration Plan (Prod)
# 07 — Vault: 3-Node Raft Cluster (Prod)
## Context
Vault starts as a single instance on the manager node (iklim-app-01) for the initial prod launch.
This matches the current `docker-stack-infra.yml` configuration (file storage, single replica).
Vault starts directly as a 3-node Raft cluster in prod. The single-instance phase used in test is skipped.
Raft HA cluster is planned for a later phase.
Test used a single Vault instance (file storage, 1 replica on the manager node). Prod goes straight to Raft HA.
## Phase 1 — Initial prod launch (current)
## Vault service configuration
- **Replicas:** 1
- **Storage:** file (`/vault/file`) on iklim-app-01
- **Placement:** `node.role == manager` (iklim-app-01)
- **Cert:** from `/opt/iklimco/ssl/` (populated by cert-reloader from SWAG volume)
- **TLS:** `VAULT_LOCAL_CONFIG` unchanged — `api_addr: https://vault.iklim.co:8200`
No changes to `docker-stack-infra.yml` vault service for Phase 1.
## Phase 2 — Vault Raft Cluster (future)
### What changes
- **Replicas:** 3 (one per service node)
- **Storage:** Raft integrated (replaces file storage)
- **Placement:** `node.labels.type == service` (all 3 service nodes)
- **Cert distribution:** cert-reloader SSH-copies renewed cert to iklim-app-02, iklim-app-03
- **Storage:** Raft integrated storage
- **Placement:** `node.labels.type == service` (all 3 app nodes)
- **Cert distribution:** No SSH needed — all nodes mount StorageBox, cert-reloader writes to `SWAG_CERT_DIR=/mnt/storagebox/ssl`, Vault reads from that path on every node
### Prerequisites
### Prerequisites before migration
- [ ] All 3 service nodes are running and labeled `type=service`
- [ ] Vault data backed up from Phase 1 (snapshot via `vault operator raft snapshot save`)
- [ ] SSH key created for cert-reloader to reach iklim-app-02 and iklim-app-03
- [ ] SSH key stored as Docker secret `cert_reloader_ssh_key`
- [ ] `/opt/iklimco/ssl/` directory exists on iklim-app-02 and iklim-app-03
- [ ] `/mnt/storagebox/ssl/` directory is mounted and accessible on all 3 app nodes
- [ ] Vault data directory `/opt/iklimco/vault/data/` exists on all 3 nodes (host path volumes)
### Vault service update for Raft
### Vault service YAML (docker-stack-infra.prod.yml overlay)
```yaml
vault:
# ... (image, secrets, healthcheck unchanged)
# ... (image, secrets, healthcheck unchanged from base)
environment:
VAULT_LOCAL_CONFIG: >-
{"api_addr":"https://vault.iklim.co:8200",
@ -44,11 +30,11 @@ vault:
"storage":{"raft":{"path":"/vault/file","node_id":"{{ .Node.Hostname }}"}},
"listener":[{"tcp":{"address":"0.0.0.0:8200",
"tls_cert_file":"/vault/certs/STAR.iklim.co.full.crt",
"tls_key_file":"/vault/certs/STAR.iklim.co_key.txt"}}],
"tls_key_file":"/vault/certs/STAR.iklim.co_key.pem"}}],
"default_lease_ttl":"168h","max_lease_ttl":"720h","ui":true}
volumes:
- /opt/iklimco/vault/data:/vault/file # host path per node
- /opt/iklimco/ssl:/vault/certs:ro
- /mnt/storagebox/ssl:/vault/certs:ro # StorageBox — shared across all nodes, no SSH distribution needed
deploy:
mode: replicated
replicas: 3
@ -60,44 +46,73 @@ vault:
> `{{ .Node.Hostname }}` is Docker Swarm's Go template for the node hostname —
> gives each Vault instance a unique `node_id`.
### Raft join procedure (after deploying 3-replica Vault)
## Raft initialization procedure (first deploy)
Only the leader needs to be bootstrapped; others join via `vault operator raft join`:
### Step 1 — Deploy the stack
```bash
# On the primary Vault (iklim-app-01 container):
VAULT_CTR=$(docker ps -q -f name=iklimco_vault)
# Unseal if needed
docker exec -it "$VAULT_CTR" vault operator unseal
# Check Raft peers
docker exec "$VAULT_CTR" vault operator raft list-peers
docker stack deploy -c docker-stack-infra.yml -c docker-stack-infra.prod.yml iklimco
```
All 3 Vault containers start. Only the first one to initialize becomes the leader.
### Step 2 — Initialize Vault on the leader (iklim-app-01)
```bash
VAULT_CTR=$(docker ps -q -f name=iklimco_vault)
docker exec -it "$VAULT_CTR" vault operator init
```
Save the unseal keys and root token securely. Store the unseal key as a Docker secret:
```bash
echo -n "<unseal-key>" | docker secret create vault_unseal_key -
```
### Step 3 — Unseal the leader
```bash
docker exec -it "$VAULT_CTR" vault operator unseal
```
The healthcheck auto-unseals on subsequent restarts via the `vault_unseal_key` secret.
### Step 4 — Join remaining nodes to the Raft cluster
On iklim-app-02 and iklim-app-03 containers:
```bash
docker exec -it <vault-on-iklim-app-02> vault operator raft join \
https://vault.iklim.co:8200
docker exec -it <vault-on-iklim-app-03> vault operator raft join \
https://vault.iklim.co:8200
```
### cert-reloader update for Raft
Update the cert-reloader command in `docker-stack-infra.yml` to SSH-copy the cert
to iklim-app-02 and iklim-app-03 after renewal:
Unseal each node after joining:
```bash
# After copying to local /opt/iklimco/ssl/:
ssh -i /run/secrets/cert_reloader_ssh_key iklim-app-02 \
"cp /dev/stdin /opt/iklimco/ssl/STAR.iklim.co.full.crt" < /opt/iklimco/ssl/STAR.iklim.co.full.crt
# (repeat for iklim-app-03 and privkey)
docker service update --force iklimco_vault
docker exec -it <vault-on-iklim-app-02> vault operator unseal
docker exec -it <vault-on-iklim-app-03> vault operator unseal
```
Add Docker secret to cert-reloader:
```yaml
secrets:
- cert_reloader_ssh_key
### Step 5 — Verify cluster
```bash
docker exec "$VAULT_CTR" vault operator raft list-peers
```
Expected: 3 peers, one `leader`, two `follower`.
## cert-reloader — no additional changes needed for Raft
cert-reloader writes the cert to `SWAG_CERT_DIR=/mnt/storagebox/ssl`.
Since StorageBox is mounted on all app nodes, every Vault instance already sees the same path.
The cert renewal flow works unchanged with Raft:
```
cert changed → copy to /mnt/storagebox/ssl/ → docker service update --force iklimco_vault
Vault (3 replicas) restart → each auto-unseals via healthcheck
```
## Reference

View File

@ -14,13 +14,13 @@
```yaml
# DELETE from "Initialize Servers" step:
scp -P 23 ${{ vars.STORAGEBOX_USER }}@${{ vars.STORAGEBOX_USER }}.your-storagebox.de:prod/app/iklim.co/ssl/STAR.iklim.co.full.crt ./STAR.iklim.co.full.crt
scp -P 23 ${{ vars.STORAGEBOX_USER }}@${{ vars.STORAGEBOX_USER }}.your-storagebox.de:prod/app/iklim.co/ssl/STAR.iklim.co_key.txt ./STAR.iklim.co_key.txt
scp -P 23 ${{ vars.STORAGEBOX_USER }}@${{ vars.STORAGEBOX_USER }}.your-storagebox.de:prod/app/iklim.co/ssl/STAR.iklim.co_key.pem ./STAR.iklim.co_key.pem
```
Also remove from `Prepare Init Files`:
```yaml
# DELETE or make conditional:
sudo cp STAR.iklim.co.full.crt STAR.iklim.co_key.txt /opt/iklimco/ssl/
sudo cp STAR.iklim.co.full.crt STAR.iklim.co_key.pem /opt/iklimco/ssl/
```
## Step 2 — Add `Prepare SWAG Directories` step
@ -32,27 +32,26 @@ Insert **before** `Bootstrap Vault TLS Placeholder`:
run: |
set -a; . ./.env; . ./.env.secrets.swag; set +a
docker run --rm -v /opt/iklimco/swag:/output alpine \
mkdir -p /output/dns-conf /output/proxy-confs /output/site-confs
mkdir -p "$SWAG_CONFIG_DIR" "$SWAG_DNS_CONF_DIR" "$SWAG_SITE_CONFS_DIR"
envsubst < swag/dns-conf/godaddy.ini.tpl | docker run --rm -i \
-v /opt/iklimco/swag/dns-conf:/output \
-v "${SWAG_DNS_CONF_DIR}:/output" \
alpine sh -c "cat > /output/godaddy.ini && chmod 600 /output/godaddy.ini"
echo "✅ godaddy.ini written"
export RESTRICTED_IPS_BLOCK="$(echo "$RESTRICTED_IPS" | tr ',' '\n' | sed 's|.*| allow &;|')"
SWAG_VARS='${API_SUBDOMAIN}${APIGW_SUBDOMAIN}${GRAFANA_SUBDOMAIN}${RABBITMQ_SUBDOMAIN}${RESTRICTED_IPS_BLOCK}'
for tpl in swag/proxy-confs/*.conf.tpl; do
for tpl in swag/site-confs/*.conf.tpl; do
fname=$(basename "${tpl%.tpl}")
envsubst "$SWAG_VARS" < "$tpl" | docker run --rm -i \
-v /opt/iklimco/swag/site-confs:/output \
-v "${SWAG_SITE_CONFS_DIR}:/output" \
alpine sh -c "cat > /output/${fname}"
echo "✅ ${fname}"
done
cat swag/site-confs/default.conf | docker run --rm -i \
-v /opt/iklimco/swag/site-confs:/output \
-v "${SWAG_SITE_CONFS_DIR}:/output" \
alpine sh -c "cat > /output/default.conf"
echo "✅ SWAG directories ready"
@ -89,6 +88,8 @@ APISIX reads its entire configuration from etcd; init script will fail silently
done
```
> **Note:** In prod, the standalone `etcd` service from `docker-stack-infra.yml` still runs (Docker Compose overlay files cannot remove services). APISIX currently uses this etcd; the Patroni etcd migration happens via `docker-stack-infra.prod.yml`. The `http://etcd:2379/health` check targets this standalone service and is correct for the current setup.
## Step 4 — Add `Run APISIX Init` step
Insert **after** `Wait for etcd` and **before** `Bootstrap SWAG Certificate`.
@ -112,7 +113,7 @@ Insert **after** `Wait for etcd` and **before** `Bootstrap SWAG Certificate`.
> **Prod-specific:** `SPRING_PROFILES_ACTIVE=prod` — test pipeline uses `test`.
> `APISIX_ADMIN_KEY` is sourced from `.env.secrets.shared`.
> The init script is idempotent (PUT semantics); safe to re-run on subsequent deploys.
> With `replicas: 2` in prod, both APISIX instances read the same etcd state — no per-replica init needed.
> With `replicas: 3` in prod, all APISIX instances read the same etcd state — no per-replica init needed.
## Step 5 — Add `Bootstrap SWAG Certificate` step
@ -121,6 +122,7 @@ Insert **after** `Run APISIX Init`:
```yaml
- name: Bootstrap SWAG Certificate
run: |
set -a; . ./.env; set +a
echo "Waiting for SWAG container to start..."
SWAG_CTR=""
for i in $(seq 1 24); do
@ -152,12 +154,12 @@ Insert **after** `Run APISIX Init`:
fi
docker exec "$SWAG_CTR" cat "$CERT_PATH" | \
docker run --rm -i -v /opt/iklimco/ssl:/output alpine \
docker run --rm -i -v "${SWAG_CERT_DIR}:/output" alpine \
sh -c "cat > /output/STAR.iklim.co.full.crt && chmod 644 /output/STAR.iklim.co.full.crt"
docker exec "$SWAG_CTR" cat "/config/etc/letsencrypt/live/iklim.co/privkey.pem" | \
docker run --rm -i -v /opt/iklimco/ssl:/output alpine \
sh -c "cat > /output/STAR.iklim.co_key.txt && chmod 644 /output/STAR.iklim.co_key.txt"
echo "✅ Cert bootstrapped to /opt/iklimco/ssl/"
docker run --rm -i -v "${SWAG_CERT_DIR}:/output" alpine \
sh -c "cat > /output/STAR.iklim.co_key.pem && chmod 644 /output/STAR.iklim.co_key.pem"
echo "✅ Cert bootstrapped to ${SWAG_CERT_DIR}/"
working-directory: /workspace/iklim.co
```
@ -201,7 +203,7 @@ Insert **after** `Bootstrap SWAG Certificate` and **before** `Review Environment
working-directory: /workspace/iklim.co
```
> **Prod-specific:** DB hostnames are `iklimco_postgresql` ve `iklimco_mongodb` (Swarm VIP service names).
> **Prod-specific:** DB hostnames are `iklimco_postgresql` and `iklimco_mongodb` (Swarm VIP service names).
> Test pipeline uses `postgresql` / `mongodb` (unqualified aliases within the same stack).
> SQL and JS files are generated by `Prepare Init Files` step via `init_postgresql` / `init_mongodb` functions in `common-functions.sh`.
> Step is idempotent — scripts use `CREATE IF NOT EXISTS` / `createCollection` semantics.

View File

@ -109,7 +109,7 @@ All tasks should show node names matching `iklim-db-01`, `iklim-db-02`, or `ikli
```bash
docker service ps iklimco_apisix
```
Expected: 2 tasks, both `Running`, on different nodes.
Expected: 3 tasks, all `Running`, on different nodes.
## 10 — fail2ban active

View File

@ -47,7 +47,6 @@ Add after the `apisix-dashboard` service block:
volumes:
- swag-vl:/config
- /opt/iklimco/swag/dns-conf:/config/dns-conf:ro
- /opt/iklimco/swag/proxy-confs:/config/nginx/proxy-confs:ro
- /opt/iklimco/swag/site-confs:/config/nginx/site-confs:ro
ports:
- target: 80
@ -90,18 +89,18 @@ Add after the `swag` service block:
LAST_HASH=""
echo "[cert-reloader] started"
while true; do
sleep 3600
if [ -f "$$CERT_DIR/fullchain.pem" ]; then
CURR=$$(md5sum "$$CERT_DIR/fullchain.pem" | cut -d' ' -f1)
if [ "$$CURR" != "$$LAST_HASH" ]; then
echo "[cert-reloader] cert changed — copying and reloading Vault"
cp "$$CERT_DIR/fullchain.pem" "$$HOST_DIR/STAR.iklim.co.full.crt"
cp "$$CERT_DIR/privkey.pem" "$$HOST_DIR/STAR.iklim.co_key.txt"
cp "$$CERT_DIR/privkey.pem" "$$HOST_DIR/STAR.iklim.co_key.pem"
docker service update --force iklimco_vault
LAST_HASH="$$CURR"
echo "[cert-reloader] done"
fi
fi
sleep 3600
done
deploy:
mode: replicated

View File

@ -1,9 +1,7 @@
# 04 — SWAG Nginx Proxy Configs (Test)
## Context
SWAG reads nginx configs from bind-mounted directories:
- `/config/nginx/proxy-confs/``swag/proxy-confs/` in repo, deployed to `/opt/iklimco/swag/proxy-confs/`
- `/config/nginx/site-confs/``swag/site-confs/` in repo, deployed to `/opt/iklimco/swag/site-confs/`
SWAG nginx auto-includes only `site-confs/*.conf`. All proxy config templates live in `swag/site-confs/` in the repo and are rendered to `/opt/iklimco/swag/site-confs/` on the host at deploy time.
Templates use `${VAR}` placeholders processed with `envsubst` at deploy time.
@ -40,7 +38,7 @@ server {
}
```
### `swag/proxy-confs/api.conf.tpl`
### `swag/site-confs/api.conf.tpl`
Public API gateway — no IP restriction.
```nginx
@ -65,7 +63,7 @@ server {
}
```
### `swag/proxy-confs/apigw.conf.tpl`
### `swag/site-confs/apigw.conf.tpl`
APISIX Dashboard — IP restricted.
```nginx
@ -94,7 +92,7 @@ server {
}
```
### `swag/proxy-confs/rabbitmq.conf.tpl`
### `swag/site-confs/rabbitmq.conf.tpl`
RabbitMQ Management UI — IP restricted.
```nginx
@ -123,7 +121,7 @@ server {
}
```
### `swag/proxy-confs/grafana.conf.tpl`
### `swag/site-confs/grafana.conf.tpl`
Grafana — IP restricted.
```nginx
@ -156,14 +154,14 @@ server {
```bash
# Process templates and write to host
mkdir -p /opt/iklimco/swag/proxy-confs /opt/iklimco/swag/site-confs
mkdir -p /opt/iklimco/swag/site-confs
set -a; . ./.env; set +a
export RESTRICTED_IP_1="78.187.87.109"
export RESTRICTED_IP_2="95.70.151.248"
for tpl in swag/proxy-confs/*.conf.tpl; do
out="/opt/iklimco/swag/proxy-confs/$(basename "${tpl%.tpl}")"
for tpl in swag/site-confs/*.conf.tpl; do
out="/opt/iklimco/swag/site-confs/$(basename "${tpl%.tpl}")"
envsubst < "$tpl" > "$out"
echo "✅ $out"
done

View File

@ -13,10 +13,10 @@ Locate and **delete** this entire block:
```bash
# DELETE THIS BLOCK:
if [[ "$PROFILE" == "test" || "$PROFILE" == "prod" ]]; then
if [[ -f "STAR.iklim.co.full.crt" && -f "STAR.iklim.co_key.txt" ]]; then
if [[ -f "STAR.iklim.co.full.crt" && -f "STAR.iklim.co_key.pem" ]]; then
call_api "ssl iklim.co" -X PUT "$APISIX_ADMIN_URL/ssls/1" \
-H "X-API-KEY: $API_KEY" -H "Content-Type: application/json" \
-d '{"cert":"'"$(cat STAR.iklim.co.full.crt)"'","key":"'"$(cat STAR.iklim.co_key.txt)"'","snis":["*.iklim.co"]}'
-d '{"cert":"'"$(cat STAR.iklim.co.full.crt)"'","key":"'"$(cat STAR.iklim.co_key.pem)"'","snis":["*.iklim.co"]}'
else
echo "iklim.co ssl certificates not found!"
fi

View File

@ -56,7 +56,7 @@ CERT="$SWAG_VOL/etc/letsencrypt/live/iklim.co/fullchain.pem"
if [ -f "$CERT" ]; then
cp "$CERT" /opt/iklimco/ssl/STAR.iklim.co.full.crt
KEYF="$SWAG_VOL/etc/letsencrypt/live/iklim.co/privkey.pem"
cp "$KEYF" /opt/iklimco/ssl/STAR.iklim.co_key.txt
cp "$KEYF" /opt/iklimco/ssl/STAR.iklim.co_key.pem
docker service update --force iklimco_vault
echo "✅ Manual reload triggered"
else

View File

@ -4,7 +4,7 @@
- **File:** `.gitea/workflows/deploy-test.yml`
- Changes:
1. Remove manual `scp STAR.iklim.co.full.crt` steps (SWAG now owns cert lifecycle).
2. Add SWAG host directories preparation (dns-conf, nginx proxy-confs).
2. Add SWAG host directories preparation (dns-conf, nginx site-confs).
3. Add cert bootstrap step: on first deploy, wait for SWAG to obtain cert, then copy
to `/opt/iklimco/ssl/` so Vault can start.
4. Ensure `GODADDY_KEY` and `GODADDY_SECRET` are available from `.env.secrets.swag`.
@ -16,15 +16,15 @@
```yaml
# DELETE these two lines from the "Initialize Servers" step:
scp -P 23 ${{ vars.STORAGEBOX_USER }}@${{ vars.STORAGEBOX_USER }}.your-storagebox.de:test/app/iklim.co/ssl/STAR.iklim.co.full.crt ./STAR.iklim.co.full.crt
scp -P 23 ${{ vars.STORAGEBOX_USER }}@${{ vars.STORAGEBOX_USER }}.your-storagebox.de:test/app/iklim.co/ssl/STAR.iklim.co_key.txt ./STAR.iklim.co_key.txt
scp -P 23 ${{ vars.STORAGEBOX_USER }}@${{ vars.STORAGEBOX_USER }}.your-storagebox.de:test/app/iklim.co/ssl/STAR.iklim.co_key.pem ./STAR.iklim.co_key.pem
```
Also remove any references to `STAR.iklim.co.full.crt` and `STAR.iklim.co_key.txt` in
Also remove any references to `STAR.iklim.co.full.crt` and `STAR.iklim.co_key.pem` in
the `Prepare Init Files` step's `sudo cp` commands:
```yaml
# DELETE or make conditional:
sudo cp STAR.iklim.co.full.crt STAR.iklim.co_key.txt /opt/iklimco/ssl/ 2>/dev/null || true
sudo cp STAR.iklim.co.full.crt STAR.iklim.co_key.pem /opt/iklimco/ssl/ 2>/dev/null || true
```
## Step 2 — Add `Prepare SWAG Directories` step
@ -42,14 +42,14 @@ Insert this step **before** `Deploy Swarm Stack`:
sudo chmod 600 /opt/iklimco/swag/dns-conf/godaddy.ini
echo "✅ godaddy.ini written"
# Nginx proxy conf files
sudo mkdir -p /opt/iklimco/swag/proxy-confs /opt/iklimco/swag/site-confs
# Nginx site conf files
sudo mkdir -p /opt/iklimco/swag/site-confs
export RESTRICTED_IP_1="78.187.87.109"
export RESTRICTED_IP_2="95.70.151.248"
for tpl in swag/proxy-confs/*.conf.tpl; do
out="/opt/iklimco/swag/proxy-confs/$(basename "${tpl%.tpl}")"
for tpl in swag/site-confs/*.conf.tpl; do
out="/opt/iklimco/swag/site-confs/$(basename "${tpl%.tpl}")"
envsubst < "$tpl" | sudo tee "$out" > /dev/null
echo "✅ $out"
done
@ -105,7 +105,7 @@ Vault being accessible (e.g., `Provision Vault AppRole IDs`):
docker exec "$SWAG_CTR" cat "$CERT_PATH" | \
sudo tee /opt/iklimco/ssl/STAR.iklim.co.full.crt > /dev/null
docker exec "$SWAG_CTR" cat "/config/etc/letsencrypt/live/iklim.co/privkey.pem" | \
sudo tee /opt/iklimco/ssl/STAR.iklim.co_key.txt > /dev/null
sudo tee /opt/iklimco/ssl/STAR.iklim.co_key.pem > /dev/null
echo "✅ Cert bootstrapped to /opt/iklimco/ssl/"
working-directory: /workspace/iklim.co
```

View File

@ -119,7 +119,7 @@ Expected: `[cert-reloader] started` — no errors.
VAULT_CTR=$(docker ps -q -f name=iklimco_vault)
docker exec "$VAULT_CTR" ls /vault/certs/
```
Expected: `STAR.iklim.co.full.crt` and `STAR.iklim.co_key.txt`.
Expected: `STAR.iklim.co.full.crt` and `STAR.iklim.co_key.pem`.
## 10 — fail2ban is active (SWAG)

View File

@ -21,7 +21,7 @@ Terraform/Ansible setup aşamalarından hangisinde ele alındığını gösterir
| `act_runner` systemd kurulumu | **Ansible `05-test-runner-ve-deploy-onkosullari.md`**`act_runner` role (`test-app-post-stack.yml`) |
| GoDaddy credentials storagebox'a yükleme | **Manuel kalır** — secret yönetimi, Terraform/Ansible dışı |
| `docker-stack-infra.yml` port kaldırma + SWAG/cert-reloader ekleme | **Pipeline `deploy-test.yml`** + **repo değişikliği**`roadmap/test-env/03` |
| SWAG nginx proxy conf'ları (`swag/proxy-confs/*.conf.tpl`) | **Repo içinde teslim edildi**`roadmap/test-env/04` |
| SWAG nginx proxy conf'ları (`swag/site-confs/*.conf.tpl`) | **Repo içinde teslim edildi**`roadmap/test-env/04` |
| APISIX SSL cert yükleme bloğu kaldırma (`init/apisix-core/init.sh`) | **Repo değişikliği**`roadmap/test-env/05` |
| cert-reloader sidecar servisi | **`docker-stack-infra.yml`'e eklendi** — `roadmap/test-env/06` |
| Pipeline güncelleme: Prepare SWAG Dirs + Bootstrap SWAG Cert + Run DB Init | **`deploy-test.yml`** — `roadmap/test-env/07` |
@ -49,7 +49,7 @@ Terraform/Ansible setup aşamalarından hangisinde ele alındığını gösterir
| 3× `act_runner` systemd (HA runner) | **Ansible `09-prod-runner-ha-ve-swarm.md`**`act_runner` role |
| GoDaddy credentials storagebox'a yükleme | **Manuel kalır** — secret yönetimi, Terraform/Ansible dışı |
| `docker-stack-infra.yml` port kaldırma + SWAG/cert-reloader ekleme | **Repo değişikliği**`roadmap/prod-env/03` |
| SWAG nginx proxy conf'ları (`swag/proxy-confs/*.conf.tpl`) | **Repo içinde teslim edildi**`roadmap/prod-env/04` |
| SWAG nginx proxy conf'ları (`swag/site-confs/*.conf.tpl`) | **Repo içinde teslim edildi**`roadmap/prod-env/04` |
| APISIX SSL cert yükleme bloğu kaldırma (`init/apisix-core/init.sh`) | **Repo değişikliği**`roadmap/prod-env/05` |
| cert-reloader sidecar servisi | **`docker-stack-infra.yml`'e eklendi** — `roadmap/prod-env/06` |
| Vault Raft Cluster geçiş planı | **Manuel / İleri Faz**`roadmap/prod-env/07` |