# 03 — docker-stack-infra.yml Changes (Prod) ## Context ### File strategy — overlay approach Prod-specific service changes are **not written directly** into `docker-stack-infra.yml`; they are kept in a separate overlay file: | File | Usage | |------|-------| | `docker-stack-infra.yml` | Base — works as-is for test | | `docker-stack-infra.prod.yml` | Prod overlay — additional services and overrides | ```bash # Test deploy: docker stack deploy -c docker-stack-infra.yml iklimco # Prod deploy (Swarm merges both files): docker stack deploy -c docker-stack-infra.yml -c docker-stack-infra.prod.yml iklimco ``` Docker Swarm merge rule: if the same service name appears in both files, the overlay wins (deploy, environment, etc.); services only present in the overlay are added. ### Prod-specific changes summary - APISIX: 1 → 3 replicas (overlay override) - Redis: single-instance → Sentinel cluster — 1 master + 2 replicas + 3 sentinels (overlay adds new services) - RabbitMQ: 1 → 3-node Erlang cluster (overlay override + env) - Vault: 1 → 3-node Raft cluster (overlay override) — see `07-vault-raft-plan.md` - No separate APISIX etcd: Patroni etcd is shared (`/apisix` prefix) - `init/apisix-core/init.sh`: when `PROFILE=prod`, rate limit `policy:local` → `policy:redis` ### swag-vl volume — not used in prod, not defined in overlay Test-env Step 9 adds the `swag-vl` named volume to the base file. In prod, SWAG mounts to the StorageBox via the `${SWAG_CONFIG_DIR}` env var, so this volume is unused by any service. No need to remove it in the overlay — Swarm does not create unused volume definitions, it remains harmless. No `swag-vl` definition is made in `docker-stack-infra.prod.yml`. ### Monitoring Persistence Prometheus and Grafana run as single instances, but their storage profiles are different: - **Prometheus:** keep TSDB on a local Docker volume (`prometheus-vl`). Prometheus local storage should not run on StorageBox/DAVFS because of filesystem semantics and WAL/compaction I/O. - **Grafana:** keep `/var/lib/grafana` on StorageBox (`/mnt/storagebox/grafana/data`) so dashboards, plugins, and the SQLite database are available if the single active instance is manually moved to another node. Grafana uses the `GRAFANA_DATA_DIR` env var with a named-volume fallback for test. Prometheus continues to use the named Docker volume. See Step 9 for implementation details. **Note:** PostgreSQL and MongoDB are not in `docker-stack-infra.yml`. They run in separate stacks on DB nodes (`iklim-db` and `iklim-patroni`). See `08-prod-db-cluster-kurulum.md`. ## Step 1 — Apply all test-env changes first Follow every step in `test-env/03-infra-stack-changes.md`: - Add `swag` service - Add `cert-reloader` service - Remove published ports for vault, apisix, rabbitmq, prometheus, grafana, apisix-dashboard - Add `swag-vl` volume ## Step 2 — Vault: 3-node Raft cluster (prod) Vault starts directly with 3 replicas; the Phase 1 single-instance stage is skipped in prod. See `07-vault-raft-plan.md` Phase 2 for detailed setup steps. ```yaml vault: deploy: mode: replicated replicas: 3 placement: max_replicas_per_node: 1 constraints: - node.labels.type == service ``` ## Step 3 — APISIX: 3 replicas + init.sh rate limit update (prod overlay) Add to `docker-stack-infra.prod.yml`: ```yaml # docker-stack-infra.prod.yml services: apisix: deploy: mode: replicated replicas: 3 placement: max_replicas_per_node: 1 constraints: - node.labels.type == service apisix-dashboard: deploy: mode: replicated replicas: 3 placement: max_replicas_per_node: 1 constraints: - node.labels.type == service ``` APISIX and apisix-dashboard are stateless (config lives in Patroni etcd) — 3 replicas is safe. Swarm distributes SWAG requests to APISIX replicas via VIP (IPVS round-robin). ### init.sh — rate limit policy:redis (prod) With `policy:local`, each APISIX instance counts independently → the global limit effectively becomes 3× with 3 replicas. Switch to `policy:redis` for `PROFILE=prod`. Keep the following APISIX plugin limits in `init/apisix-core/init.sh` for `test/prod` unless stated otherwise: | Scope | Plugin | Target limit | |-------|--------|--------------| | WebSocket `/ws` | `limit-conn` | `conn: 5` per `remote_addr` | | Auth routes `/v1/auth/*`, `/v1/users/*` | `limit-count` | `count: 12`, `time_window: 60` per `remote_addr` | | Global rule | `limit-count` | `count: 60`, `time_window: 60` per `remote_addr` | Update the rate limit and connection limit blocks in `init/apisix-core/init.sh`. **1. Define threshold constants at the script header:** ```bash GLOBAL_LIMIT_COUNT=60 GLOBAL_LIMIT_WINDOW=60 AUTH_LIMIT_COUNT=12 AUTH_LIMIT_WINDOW=60 WS_LIMIT_CONN=5 ``` **2. Update WebSocket route plugins (test/prod):** ```bash if [[ "$PROFILE" != "dev" ]]; then WS_PLUGINS=',"plugins":{"limit-conn":{"conn":'"$WS_LIMIT_CONN"',"burst":2,"default_conn_delay":0.1,"key":"remote_addr","key_type":"var","rejected_code":429}}' else WS_PLUGINS="" fi ``` **3. Update Auth route plugins (test/prod):** ```bash if [[ "$PROFILE" != "dev" ]]; then AUTH_LIMIT=',"plugins":{"limit-count":{"count":'"$AUTH_LIMIT_COUNT"',"time_window":'"$AUTH_LIMIT_WINDOW"',"key_type":"var","key":"remote_addr","rejected_code":429,"policy":"local"}}' else AUTH_LIMIT="" fi ``` **4. Update Global rate limit rule (test/prod):** ```bash if [[ "$PROFILE" != "dev" ]]; then if [[ "$PROFILE" == "prod" ]]; then RATE_POLICY="redis" RATE_REDIS=',"redis_host":"redis","redis_port":6379,"redis_password":"'"$REDIS_PASSWORD"'"' else RATE_POLICY="local" RATE_REDIS="" fi call_api "global rate limit" -X PUT "$APISIX_ADMIN_URL/global_rules/1" \ -H "X-API-KEY: $API_KEY" -H "Content-Type: application/json" \ -d '{"plugins":{"limit-count":{"count":'"$GLOBAL_LIMIT_COUNT"',"time_window":'"$GLOBAL_LIMIT_WINDOW"',"key_type":"var","key":"remote_addr","rejected_code":429,"policy":"'"$RATE_POLICY"'","allow_degradation":true'"$RATE_REDIS"'}}}' fi ``` > APISIX's `limit-count` plugin does not natively support Redis Sentinel; `policy:redis` works with a single endpoint. > The `redis` service name stays constant within Swarm overlay DNS. `allow_degradation: true` ensures that if Redis is > temporarily unreachable (e.g. Sentinel failover ~10-30 s, or master rescheduling), APISIX passes requests through > instead of returning errors — rate limiting is briefly suspended but API access is unaffected. > Microservices use Spring Data Redis Sentinel natively and are unaffected by master changes. > Docker Swarm has no inter-service anti-affinity; the `redis` master placement relies on Swarm's spread strategy > to avoid co-locating with a replica. This is a known limitation — accepted in favour of operational simplicity. ## Step 4 — etcd: Separate APISIX etcd removed — Patroni etcd shared The standalone `etcd` service in `docker-stack-infra.yml` is **not used in prod and must be disabled** by setting `replicas: 0` in the prod overlay. APISIX uses the 3-node Patroni etcd cluster running on DB nodes, via the `/apisix` prefix. ### Why consolidated? - A standalone single-instance etcd was a SPOF for APISIX. - Patroni etcd is already 3-node HA — APISIX gets a more reliable config store. - etcd supports prefix-based namespacing; Patroni uses `/service/`, APISIX uses `/apisix/` — no collision. ### APISIX etcd connection configuration Update the etcd endpoints in the APISIX service in `docker-stack-infra.yml` to point to DB nodes: ```yaml apisix: environment: APISIX_STAND_ALONE: "false" # via apisix/conf/config.yaml or environment: # etcd: # host: # - "http://etcd-01:2379" # - "http://etcd-02:2379" # - "http://etcd-03:2379" # prefix: "/apisix" ``` The preferred method is mounting `config.yaml` via a Docker config or volume. etcd endpoints use **overlay DNS aliases** defined in `docker-stack-db.prod.yml` — `etcd-01`, `etcd-02`, `etcd-03` — which are reachable from app nodes via the `iklimco-net` overlay: ```yaml # config/apisix/config.yaml etcd: host: - "http://etcd-01:2379" - "http://etcd-02:2379" - "http://etcd-03:2379" prefix: "/apisix" timeout: 30 ``` ### Disable standalone etcd in prod overlay Docker Swarm overlay files cannot delete services from the base stack, but `replicas: 0` stops the container entirely: ```yaml # docker-stack-infra.prod.yml services: etcd: deploy: replicas: 0 ``` ### Firewall requirement etcd access from app nodes to DB nodes must be open (port 2379, app subnet → DB subnet). Verify from an app node: ```bash docker run --rm --network iklimco-net alpine \ sh -c "wget -qO- http://etcd-01:2379/health" ``` ## Step 5 — Redis: Sentinel cluster (prod overlay) Redis runs as a single instance in test. In prod, Sentinel provides HA. ![[redis-sentinel-vs-cluster.png]] Bitnami images are used — all configuration is done via env vars, no separate `.conf` file needed. ### Prerequisites ```bash # Create Docker secret for Redis password: openssl rand -hex 32 | docker secret create redis_password - ``` ### Topology ``` any app node: redis (1 replica, spread by Swarm — not pinned) 2 app nodes: redis-replica (2 replicas, max 1/node, spread across app nodes) all app nodes: redis-sentinel (3 replicas, max 1/node, spread across all app nodes) ``` ### docker-stack-infra.prod.yml — Redis services The existing `redis` service is overridden in the prod overlay as **master**; `redis-replica` and `redis-sentinel` are added as new services. The service name (`redis`) remains unchanged so the APISIX connection config does not need updating. ```yaml # docker-stack-infra.prod.yml services: redis: # override base single-instance redis → master image: bitnamisecure/redis:latest environment: ALLOW_EMPTY_PASSWORD: no REDIS_PASSWORD: ${REDIS_PASSWORD} REDIS_REPLICATION_MODE: master deploy: mode: replicated replicas: 1 placement: constraints: - node.labels.type == service restart_policy: condition: any delay: 5s labels: project: co.iklim redis-replica: image: bitnamisecure/redis:latest environment: ALLOW_EMPTY_PASSWORD: no REDIS_REPLICATION_MODE: slave REDIS_MASTER_HOST: redis REDIS_MASTER_PORT_NUMBER: "6379" REDIS_MASTER_PASSWORD: ${REDIS_PASSWORD} REDIS_PASSWORD: ${REDIS_PASSWORD} deploy: mode: replicated replicas: 2 placement: max_replicas_per_node: 1 constraints: - node.labels.type == service preferences: - spread: node.hostname restart_policy: condition: any delay: 5s labels: project: co.iklim redis-sentinel: image: bitnamisecure/redis-sentinel:latest environment: REDIS_SENTINEL_MASTER_NAME: prod-master REDIS_MASTER_HOST: redis REDIS_MASTER_PORT_NUMBER: "6379" REDIS_MASTER_PASSWORD: ${REDIS_PASSWORD} REDIS_SENTINEL_QUORUM: "2" REDIS_SENTINEL_DOWN_AFTER_MILLISECONDS: "5000" REDIS_SENTINEL_FAILOVER_TIMEOUT: "10000" deploy: mode: replicated replicas: 3 placement: max_replicas_per_node: 1 constraints: - node.labels.type == service preferences: - spread: node.hostname restart_policy: condition: any delay: 5s labels: project: co.iklim ``` ### Microservice connection (Spring Data Redis) Microservices must use a Sentinel-aware connection: ```yaml # application-prod.yml spring: data: redis: sentinel: master: prod-master nodes: - redis-sentinel:26379 password: ${REDIS_PASSWORD} ``` ### Verification ```bash # Query master identity: docker exec $(docker ps -q -f name=iklimco_redis-sentinel | head -1) \ redis-cli -p 26379 SENTINEL get-master-addr-by-name prod-master ``` ## Step 6 — RabbitMQ: 3-node Erlang cluster (prod overlay) RabbitMQ runs as a 3-node cluster with one instance per app node. ### Prerequisites ```bash # Create Docker secret for Erlang cookie (must be identical on all nodes): openssl rand -hex 32 | docker secret create rabbitmq_erlang_cookie - ``` ### docker-stack-infra.prod.yml — RabbitMQ override ```yaml # docker-stack-infra.prod.yml (add alongside redis services) services: rabbitmq: image: rabbitmq:3-management hostname: "rabbitmq-{{.Node.Hostname}}" environment: RABBITMQ_ERLANG_COOKIE_FILE: /run/secrets/rabbitmq_erlang_cookie RABBITMQ_USE_LONGNAME: "true" RABBITMQ_NODENAME: "rabbit@rabbitmq-{{.Node.Hostname}}" secrets: - rabbitmq_erlang_cookie networks: iklimco-net: aliases: - "rabbitmq-{{.Node.Hostname}}" deploy: mode: replicated replicas: 3 placement: max_replicas_per_node: 1 constraints: - node.labels.type == service update_config: parallelism: 1 order: stop-first labels: project: co.iklim secrets: rabbitmq_erlang_cookie: external: true networks: iklimco-net: external: true ``` ### Cluster join procedure (first setup) RabbitMQ nodes do not form a cluster automatically; manual join is required after first start: ```bash # Find the RabbitMQ container on iklim-app-02: CTR=$(docker ps -q -f name=iklimco_rabbitmq) # Stop, join, start: docker exec "$CTR" rabbitmqctl stop_app docker exec "$CTR" rabbitmqctl join_cluster rabbit@rabbitmq-iklim-app-01 docker exec "$CTR" rabbitmqctl start_app # Repeat for iklim-app-03 ``` ```bash # Verify cluster status (from any node): docker exec "$CTR" rabbitmqctl cluster_status ``` > **HA policy:** After the cluster is formed, set quorum queues as the default: > ```bash > docker exec "$CTR" rabbitmqctl set_policy ha-all ".*" \ > '{"queue-type":"quorum"}' --apply-to queues > ``` ## Step 7 — RabbitMQ WebSocket Sticky Sessions (Consistent Hash) RabbitMQ Web STOMP (over WebSocket) requires a persistent connection. In a 3-node RabbitMQ cluster, if an APISIX instance uses the default Swarm VIP for the `rabbitmq` upstream, it may cause unnecessary inter-node traffic or connection drops if the session doesn't persist on the same node. To optimize this, we implement **Consistent Hashing (chash)** at the APISIX layer based on the client's IP address (`remote_addr`). ### 1. Update APISIX Upstream Configuration (init.sh) Update the `rabbitmq` upstream definition in `init/apisix-core/init.sh` to target specific cluster nodes instead of the generic service name, enabling the `chash` algorithm for prod. ```bash # Update upstream rabbitmq block in init.sh if [[ "$PROFILE" == "prod" ]]; then # Direct node DNS names to bypass Swarm VIP and allow chash to work effectively RABBITMQ_NODES='{"rabbitmq-iklim-app-01:15674":1, "rabbitmq-iklim-app-02:15674":1, "rabbitmq-iklim-app-03:15674":1}' LB_TYPE="chash" HASH_KEY="remote_addr" else RABBITMQ_NODES='{"rabbitmq:15674":1}' LB_TYPE="roundrobin" HASH_KEY="" fi call_api "upstream rabbitmq" -X PUT "$APISIX_ADMIN_URL/upstreams/rabbitmq-upstream" \ -H "X-API-KEY: $API_KEY" -H "Content-Type: application/json" \ -d '{ "name": "rabbitmq-upstream", "type": "'"$LB_TYPE"'", "key": "'"$HASH_KEY"'", "nodes": '"$RABBITMQ_NODES"', "timeout": {"connect": 10, "send": 3600, "read": 3600}, "scheme": "http", '"$HC"' }' ``` ### 2. Enable Real IP Detection in APISIX Consistent hashing by `remote_addr` requires APISIX to see the actual client IP, not the internal IP of the SWAG (Nginx) proxy. > **DNS Note:** For `chash` to work with node-specific names, the RabbitMQ service must have network aliases configured for each node (e.g., `rabbitmq-{{.Node.Hostname}}`) as shown in Step 6. In the `config.yaml` inside the custom APISIX image (`custom-apisix:3.12.0`): ```yaml nginx_config: http: real_ip_header: "X-Real-IP" set_real_ip_from: "10.0.0.0/8" ``` ## Step 8 — Create `docker-stack-infra.prod.yml` Create this file in the repo root alongside `docker-stack-infra.yml`. It combines all prod-specific overrides from Steps 2–6 (including disabling the standalone `etcd` from Step 4): ```yaml # docker-stack-infra.prod.yml # Prod overlay — deploy with: # docker stack deploy -c docker-stack-infra.yml -c docker-stack-infra.prod.yml iklimco services: vault: environment: VAULT_LOCAL_CONFIG: >- {"api_addr":"https://vault.iklim.co:8200", "cluster_addr":"https://{{ .Node.Hostname }}:8201", "storage":{"raft":{"path":"/vault/file","node_id":"{{ .Node.Hostname }}"}}, "listener":[{"tcp":{"address":"0.0.0.0:8200", "tls_cert_file":"/vault/certs/STAR.iklim.co.full.crt", "tls_key_file":"/vault/certs/STAR.iklim.co_key.pem"}}], "default_lease_ttl":"168h","max_lease_ttl":"720h","ui":true} volumes: - /opt/iklimco/vault/data:/vault/file - ${SWAG_CERT_DIR}:/vault/certs:ro deploy: mode: replicated replicas: 3 placement: max_replicas_per_node: 1 constraints: - node.labels.type == service apisix: deploy: mode: replicated replicas: 3 placement: max_replicas_per_node: 1 constraints: - node.labels.type == service apisix-dashboard: deploy: mode: replicated replicas: 3 placement: max_replicas_per_node: 1 constraints: - node.labels.type == service redis: image: bitnamisecure/redis:latest environment: ALLOW_EMPTY_PASSWORD: no REDIS_PASSWORD: ${REDIS_PASSWORD} REDIS_REPLICATION_MODE: master deploy: mode: replicated replicas: 1 placement: constraints: - node.labels.type == service restart_policy: condition: any delay: 5s labels: project: co.iklim redis-replica: image: bitnamisecure/redis:latest environment: ALLOW_EMPTY_PASSWORD: no REDIS_REPLICATION_MODE: slave REDIS_MASTER_HOST: redis REDIS_MASTER_PORT_NUMBER: "6379" REDIS_MASTER_PASSWORD: ${REDIS_PASSWORD} REDIS_PASSWORD: ${REDIS_PASSWORD} deploy: mode: replicated replicas: 2 placement: max_replicas_per_node: 1 constraints: - node.labels.type == service preferences: - spread: node.hostname restart_policy: condition: any delay: 5s labels: project: co.iklim redis-sentinel: image: bitnamisecure/redis-sentinel:latest environment: REDIS_SENTINEL_MASTER_NAME: prod-master REDIS_MASTER_HOST: redis REDIS_MASTER_PORT_NUMBER: "6379" REDIS_MASTER_PASSWORD: ${REDIS_PASSWORD} REDIS_SENTINEL_QUORUM: "2" REDIS_SENTINEL_DOWN_AFTER_MILLISECONDS: "5000" REDIS_SENTINEL_FAILOVER_TIMEOUT: "10000" deploy: mode: replicated replicas: 3 placement: max_replicas_per_node: 1 constraints: - node.labels.type == service preferences: - spread: node.hostname restart_policy: condition: any delay: 5s labels: project: co.iklim rabbitmq: image: rabbitmq:3-management hostname: "rabbitmq-{{.Node.Hostname}}" environment: RABBITMQ_ERLANG_COOKIE_FILE: /run/secrets/rabbitmq_erlang_cookie RABBITMQ_USE_LONGNAME: "true" RABBITMQ_NODENAME: "rabbit@rabbitmq-{{.Node.Hostname}}" secrets: - rabbitmq_erlang_cookie networks: iklimco-net: aliases: - "rabbitmq-{{.Node.Hostname}}" deploy: mode: replicated replicas: 3 placement: max_replicas_per_node: 1 constraints: - node.labels.type == service update_config: parallelism: 1 order: stop-first labels: project: co.iklim # ── Disabled in prod ───────────────────────────────────────────────────────── etcd: deploy: replicas: 0 postgresql: deploy: replicas: 0 mongodb: deploy: replicas: 0 pg-proxy: deploy: replicas: 0 mongo-proxy: deploy: replicas: 0 secrets: rabbitmq_erlang_cookie: external: true networks: iklimco-net: external: true ``` ## Step 9 — Monitoring Data Persistence Prometheus and Grafana run as single instances. Grafana data is placed on the StorageBox shared filesystem for manual failover. Prometheus TSDB stays on a local Docker volume because DAVFS/StorageBox is not suitable for Prometheus WAL and compaction I/O. **Changes already applied to `docker-stack-infra.yml`:** ```yaml prometheus: volumes: - prometheus-vl:/prometheus grafana: volumes: - ${GRAFANA_DATA_DIR:-grafana-vl}:/var/lib/grafana ``` Test uses the named Docker volume fallback (`grafana-vl`) for Grafana, and Prometheus always uses the named Docker volume (`prometheus-vl`) — no test env change needed. **Add to `prod/secrets/iklim.co/.env.prod` on storagebox** (already in `env-prod/.env`): ```bash GRAFANA_DATA_DIR=/mnt/storagebox/grafana/data ``` > `/mnt/storagebox/grafana/data` is created automatically by the Ansible `storagebox` role during bootstrap via the `storagebox_managed_directories` variable. No manual step required. > Grafana writes its SQLite database and dashboard JSON to `/var/lib/grafana`. > Prometheus writes its TSDB to `/prometheus` on the local `prometheus-vl` Docker volume; it is not shared between nodes. ## Step 10 — Verify ```bash # Base file must be valid on its own (test deploy): docker stack config -c docker-stack-infra.yml > /dev/null && echo "base OK" # Prod merge must be valid: docker stack config -c docker-stack-infra.yml -c docker-stack-infra.prod.yml > /dev/null && echo "prod merge OK" ``` ## Step 11 — Database Proxies and Developer Access In the production environment, the `pg-proxy` and `mongo-proxy` services (socat-based) defined in the base `docker-stack-infra.yml` are **deprecated and will not be used**. ### Rationale - **Leader Tracking:** Simple L4 proxies (socat) cannot track the Patroni Leader or MongoDB Primary. They point to a single service VIP, which might lead to a Read-Only replica during failover. - **HA Connection Strings:** Modern DB drivers (JDBC, libpq, MongoClient) support multi-host connection strings, which provide native failover and load balancing without an intermediate proxy. ### Developer Access Strategy - **Direct Subnet Access:** Developers connect via WireGuard directly to the DB subnet (`10.20.20.0/24`). - **No Translation:** Instead of mapping ports like `15432`, the standard ports (`5432`, `27017`) are used across all cluster nodes. ## Placement and Replica Summary — prod | Service | File | Replicas | Placement | HA Note | |---------|------|----------|-----------|---------| | swag | base | 1 | `node.hostname == iklim-app-01` | No clustering support; Floating IP pinned to node | | cert-reloader | base | 1 | `node.hostname == iklim-app-01` | Cron-style task; duplicate would be problematic | | vault | prod overlay | 3 | `node.labels.type == service`; max 1/node | Raft cluster — see `07-vault-raft-plan.md` | | apisix | prod overlay | 3 | `node.labels.type == service`; max 1/node | Stateless; config in Patroni etcd; rate limit policy:redis | | apisix-dashboard | prod overlay | 3 | `node.labels.type == service`; max 1/node | Stateless; reads from etcd | | redis (master) | prod overlay | 1 | `node.labels.type == service`; Swarm spread | Sentinel cluster master; not pinned — reschedules on node failure | | redis-replica | prod overlay | 2 | `node.labels.type == service`; max 1/node | Sentinel replica; spread:hostname | | redis-sentinel | prod overlay | 3 | `node.labels.type == service`; max 1/node | Quorum=2; failover automatic | | rabbitmq | prod overlay | 3 | `node.labels.type == service`; max 1/node | Erlang cluster; quorum queues | | etcd | prod overlay | 0 | — | Disabled (`replicas: 0`); APISIX uses Patroni etcd on DB nodes | | postgresql | prod overlay | 0 | — | Disabled (`replicas: 0`); Patroni HA runs as `iklim-db` stack on DB nodes; port 5432 conflict | | mongodb | prod overlay | 0 | — | Disabled (`replicas: 0`); MongoDB replica set runs as `iklim-db` stack on DB nodes; port 27017 conflict | | pg-proxy | prod overlay | 0 | — | Deprecated; microservices use multi-host JDBC with native Patroni failover | | mongo-proxy | prod overlay | 0 | — | Deprecated; microservices use multi-host MongoClient with native replica set failover | | prometheus | base | 1 | `node.labels.type == service` | No native HA; Thanos is overkill at this scale | | grafana | base | 1 | `node.labels.type == service` | Not critical | > PostgreSQL and MongoDB run in separate DB stacks on `iklim-db-*` nodes. See `08-prod-db-cluster-kurulum.md`. > etcd: 3-node cluster on DB nodes — APISIX shares it via `/apisix` prefix. > Disabled services (`replicas: 0`) are removed from `docker service ls` by a post-deploy step in `deploy-prod.yml`.