From 4c3b7faad6aefc0ee85870f71dbeaa1d6c3a0018 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Murat=20=C3=96ZDEM=C4=B0R?= Date: Sun, 17 May 2026 18:54:44 +0300 Subject: [PATCH] docs(roadmap): update production environment roadmap and setup guides - Documented infrastructure changes for Redis Sentinel and RabbitMQ clustering. - Updated setup guides for Terraform, Ansible, and Swarm node recovery. - Clarified APISIX rate limit policy and degradation settings. --- roadmap/prod-env/01-swarm-init-multinode.md | 4 +- roadmap/prod-env/03-infra-stack-changes.md | 57 +++++++++------- roadmap/prod-env/08-deploy-pipeline-update.md | 28 +++++++- setup/06-prod-terraform-iaac.md | 10 +++ setup/07-prod-ansible-bootstrap.md | 21 ++++++ setup/08-prod-db-cluster-kurulum.md | 2 + setup/09-prod-runner-ha-ve-swarm.md | 65 ++++++++++++++----- 7 files changed, 145 insertions(+), 42 deletions(-) diff --git a/roadmap/prod-env/01-swarm-init-multinode.md b/roadmap/prod-env/01-swarm-init-multinode.md index 24496bf..00550f1 100644 --- a/roadmap/prod-env/01-swarm-init-multinode.md +++ b/roadmap/prod-env/01-swarm-init-multinode.md @@ -18,8 +18,8 @@ | `iklim-app-01` | API services, SWAG, Vault | Manager + Worker | `type=service` | | `iklim-app-02` | API services replicas | Manager + Worker | `type=service` | | `iklim-app-03` | API services replicas | Manager + Worker | `type=service` | -| `iklim-db-01` | PostgreSQL (Patroni), etcd | Worker | `role=db` | -| `iklim-db-02` | PostgreSQL (Patroni), etcd | Worker | `role=db` | +| `iklim-db-01` | MongoDB replica + PostgreSQL (Patroni), etcd | Worker | `role=db` | +| `iklim-db-02` | MongoDB replica + PostgreSQL (Patroni), etcd | Worker | `role=db` | | `iklim-db-03` | MongoDB replica + PostgreSQL (Patroni), etcd | Worker | `role=db` | ### Label scheme rationale diff --git a/roadmap/prod-env/03-infra-stack-changes.md b/roadmap/prod-env/03-infra-stack-changes.md index 11f9b89..ef979f6 100644 --- a/roadmap/prod-env/03-infra-stack-changes.md +++ b/roadmap/prod-env/03-infra-stack-changes.md @@ -64,6 +64,7 @@ vault: mode: replicated replicas: 3 placement: + max_replicas_per_node: 1 constraints: - node.labels.type == service ``` @@ -80,6 +81,7 @@ services: mode: replicated replicas: 3 placement: + max_replicas_per_node: 1 constraints: - node.labels.type == service @@ -88,6 +90,7 @@ services: mode: replicated replicas: 3 placement: + max_replicas_per_node: 1 constraints: - node.labels.type == service ``` @@ -106,7 +109,7 @@ Update the global rate limit block in `init/apisix-core/init.sh`: if [[ "$PROFILE" != "dev" ]]; then if [[ "$PROFILE" == "prod" ]]; then RATE_POLICY="redis" - RATE_REDIS=',\"redis_host\":\"redis-master\",\"redis_port\":6379,\"redis_password\":\"'\"$REDIS_PASSWORD\"'\"' + RATE_REDIS=',\"redis_host\":\"redis\",\"redis_port\":6379,\"redis_password\":\"'\"$REDIS_PASSWORD\"'\"' else RATE_POLICY="local" RATE_REDIS="" @@ -114,13 +117,17 @@ if [[ "$PROFILE" != "dev" ]]; then call_api "global rate limit" -X PUT "$APISIX_ADMIN_URL/global_rules/1" \ -H "X-API-KEY: $API_KEY" -H "Content-Type: application/json" \ - -d '{"plugins":{"limit-count":{"count":300,"time_window":60,"key_type":"var","key":"remote_addr","rejected_code":429,"policy":"'"$RATE_POLICY"'"'"$RATE_REDIS"'}}}' + -d '{"plugins":{"limit-count":{"count":300,"time_window":60,"key_type":"var","key":"remote_addr","rejected_code":429,"policy":"'"$RATE_POLICY"'","allow_degradation":true'"$RATE_REDIS"'}}}' fi ``` > APISIX's `limit-count` plugin does not natively support Redis Sentinel; `policy:redis` works with a single endpoint. -> The `redis-master` service name stays constant within Swarm — during Sentinel failover (~10-30 s) rate limiting may be -> temporarily inconsistent; this brief disruption is acceptable. Microservices use Spring Data Redis Sentinel natively. +> The `redis` service name stays constant within Swarm overlay DNS. `allow_degradation: true` ensures that if Redis is +> temporarily unreachable (e.g. Sentinel failover ~10-30 s, or master rescheduling), APISIX passes requests through +> instead of returning errors — rate limiting is briefly suspended but API access is unaffected. +> Microservices use Spring Data Redis Sentinel natively and are unaffected by master changes. +> Docker Swarm has no inter-service anti-affinity; the `redis` master placement relies on Swarm's spread strategy +> to avoid co-locating with a replica. This is a known limitation — accepted in favour of operational simplicity. ## Step 4 — etcd: Separate APISIX etcd removed — Patroni etcd shared @@ -190,12 +197,9 @@ openssl rand -hex 32 | docker secret create redis_password - ### Topology ``` -iklim-app-01: redis-master (1 replica, pinned to app-01) -iklim-app-02: redis-replica (1 replica, pinned to app-02) -iklim-app-03: redis-replica (1 replica, pinned to app-03) -iklim-app-01: redis-sentinel ┐ -iklim-app-02: redis-sentinel ├─ 3 replicas, spread across all app nodes -iklim-app-03: redis-sentinel ┘ +any app node: redis (1 replica, spread by Swarm — not pinned) +2 app nodes: redis-replica (2 replicas, max 1/node, spread across app nodes) +all app nodes: redis-sentinel (3 replicas, max 1/node, spread across all app nodes) ``` ### docker-stack-infra.prod.yml — Redis services @@ -216,7 +220,7 @@ services: replicas: 1 placement: constraints: - - node.hostname == iklim-app-01 + - node.labels.type == service restart_policy: condition: any delay: 5s @@ -236,6 +240,7 @@ services: mode: replicated replicas: 2 placement: + max_replicas_per_node: 1 constraints: - node.labels.type == service preferences: @@ -249,7 +254,7 @@ services: redis-sentinel: image: bitnamisecure/redis-sentinel:latest environment: - REDIS_SENTINEL_MASTER_NAME: mymaster + REDIS_SENTINEL_MASTER_NAME: prod-master REDIS_MASTER_HOST: redis REDIS_MASTER_PORT_NUMBER: "6379" REDIS_MASTER_PASSWORD: ${REDIS_PASSWORD} @@ -260,6 +265,7 @@ services: mode: replicated replicas: 3 placement: + max_replicas_per_node: 1 constraints: - node.labels.type == service preferences: @@ -281,7 +287,7 @@ spring: data: redis: sentinel: - master: mymaster + master: prod-master nodes: - redis-sentinel:26379 password: ${REDIS_PASSWORD} @@ -292,7 +298,7 @@ spring: ```bash # Query master identity: docker exec $(docker ps -q -f name=iklimco_redis-sentinel | head -1) \ - redis-cli -p 26379 SENTINEL get-master-addr-by-name mymaster + redis-cli -p 26379 SENTINEL get-master-addr-by-name prod-master ``` ## Step 6 — RabbitMQ: 3-node Erlang cluster (prod overlay) @@ -324,6 +330,7 @@ services: mode: replicated replicas: 3 placement: + max_replicas_per_node: 1 constraints: - node.labels.type == service update_config: @@ -392,6 +399,7 @@ services: mode: replicated replicas: 3 placement: + max_replicas_per_node: 1 constraints: - node.labels.type == service @@ -400,6 +408,7 @@ services: mode: replicated replicas: 3 placement: + max_replicas_per_node: 1 constraints: - node.labels.type == service @@ -408,6 +417,7 @@ services: mode: replicated replicas: 3 placement: + max_replicas_per_node: 1 constraints: - node.labels.type == service @@ -442,6 +452,7 @@ services: mode: replicated replicas: 2 placement: + max_replicas_per_node: 1 constraints: - node.labels.type == service preferences: @@ -455,7 +466,7 @@ services: redis-sentinel: image: bitnamisecure/redis-sentinel:latest environment: - REDIS_SENTINEL_MASTER_NAME: mymaster + REDIS_SENTINEL_MASTER_NAME: prod-master REDIS_MASTER_HOST: redis REDIS_MASTER_PORT_NUMBER: "6379" REDIS_MASTER_PASSWORD: ${REDIS_PASSWORD} @@ -466,6 +477,7 @@ services: mode: replicated replicas: 3 placement: + max_replicas_per_node: 1 constraints: - node.labels.type == service preferences: @@ -489,6 +501,7 @@ services: mode: replicated replicas: 3 placement: + max_replicas_per_node: 1 constraints: - node.labels.type == service update_config: @@ -552,13 +565,13 @@ docker stack config -c docker-stack-infra.yml -c docker-stack-infra.prod.yml > / |---------|------|----------|-----------|---------| | swag | base | 1 | `node.hostname == iklim-app-01` | No clustering support; Floating IP pinned to node | | cert-reloader | base | 1 | `node.hostname == iklim-app-01` | Cron-style task; duplicate would be problematic | -| vault | prod overlay | 3 | `node.labels.type == service` | Raft cluster — see `07-vault-raft-plan.md` | -| apisix | prod overlay | 3 | `node.labels.type == service` | Stateless; config in Patroni etcd; rate limit policy:redis | -| apisix-dashboard | prod overlay | 3 | `node.labels.type == service` | Stateless; reads from etcd | -| redis (master) | prod overlay | 1 | `node.hostname == iklim-app-01` | Sentinel cluster master | -| redis-replica | prod overlay | 2 | `node.labels.type == service` | Sentinel replica; spread:hostname | -| redis-sentinel | prod overlay | 3 | `node.labels.type == service` | Quorum=2; failover automatic | -| rabbitmq | prod overlay | 3 | `node.labels.type == service` | Erlang cluster; quorum queues | +| vault | prod overlay | 3 | `node.labels.type == service`; max 1/node | Raft cluster — see `07-vault-raft-plan.md` | +| apisix | prod overlay | 3 | `node.labels.type == service`; max 1/node | Stateless; config in Patroni etcd; rate limit policy:redis | +| apisix-dashboard | prod overlay | 3 | `node.labels.type == service`; max 1/node | Stateless; reads from etcd | +| redis (master) | prod overlay | 1 | `node.labels.type == service`; Swarm spread | Sentinel cluster master; not pinned — reschedules on node failure | +| redis-replica | prod overlay | 2 | `node.labels.type == service`; max 1/node | Sentinel replica; spread:hostname | +| redis-sentinel | prod overlay | 3 | `node.labels.type == service`; max 1/node | Quorum=2; failover automatic | +| rabbitmq | prod overlay | 3 | `node.labels.type == service`; max 1/node | Erlang cluster; quorum queues | | etcd | base | 1 | `node.labels.type == service` | Idle in prod — APISIX uses Patroni etcd; standalone service remains in base stack | | prometheus | base | 1 | `node.labels.type == service` | No native HA; Thanos is overkill at this scale | | grafana | base | 1 | `node.labels.type == service` | Not critical | diff --git a/roadmap/prod-env/08-deploy-pipeline-update.md b/roadmap/prod-env/08-deploy-pipeline-update.md index 026a230..264887d 100644 --- a/roadmap/prod-env/08-deploy-pipeline-update.md +++ b/roadmap/prod-env/08-deploy-pipeline-update.md @@ -245,7 +245,31 @@ Insert **after** `Bootstrap SWAG Certificate` and **before** `Review Environment > SQL and JS files are generated by `Prepare Init Files` step via `init_postgresql` / `init_mongodb` functions in `common-functions.sh`. > Step is idempotent — scripts use `CREATE IF NOT EXISTS` / `createCollection` semantics. -## Step 8 — Ensure subdomain env vars are in prod `.env` +## Step 8 — Microservice prod deploy overlay + +Her mikroservisin kendi `docker-stack-service.prod.yml` overlay dosyası vardır. Bu dosya prod'a özgü `replicas: 3` ve `max_replicas_per_node: 1` ayarlarını içerir. + +Mikroservis deploy pipeline'larında (`deploy-prod.yml`) `docker stack deploy` komutu şu şekilde olmalı: + +```bash +docker stack deploy \ + -c BE-/docker-stack-service.yml \ + -c BE-/docker-stack-service.prod.yml \ + iklimco +``` + +Örneğin `BE-Authentication` için: + +```bash +docker stack deploy \ + -c BE-Authentication/docker-stack-service.yml \ + -c BE-Authentication/docker-stack-service.prod.yml \ + iklimco +``` + +> Yeni bir mikroservis eklendiğinde `BE-/docker-stack-service.prod.yml` dosyasının oluşturulması ve pipeline'ın bu overlay'i içermesi zorunludur. + +## Step 9 — Ensure subdomain env vars are in prod `.env` Add to `prod/secrets/iklim.co/.env.prod` on storagebox: @@ -256,7 +280,7 @@ RABBITMQ_SUBDOMAIN=rabbitmq.iklim.co GRAFANA_SUBDOMAIN=grafana.iklim.co ``` -## Step 8 — Final step order for prod pipeline +## Step 10 — Final step order for prod pipeline 1. Acquire Deploy Lock 2. Checkout Branch diff --git a/setup/06-prod-terraform-iaac.md b/setup/06-prod-terraform-iaac.md index d4c2179..9b2a7fe 100644 --- a/setup/06-prod-terraform-iaac.md +++ b/setup/06-prod-terraform-iaac.md @@ -302,6 +302,16 @@ terraform output -raw ansible_inventory_yaml > \ ../../ansible/inventory/generated/prod.yml ``` +### Gitea Değişkeni: PROD_FLOATING_IP + +Deploy pipeline DNS kayıtlarını otomatik yönetmek için bu değişkene ihtiyaç duyar. `terraform apply` sonrasında bir kez ayarlanır: + +```bash +terraform output prod_floating_ip +``` + +Çıkan IP adresini Gitea → proje ayarları → **Variables** altında `PROD_FLOATING_IP` adıyla ekle. Pipeline `vars.PROD_FLOATING_IP` ile okur ve GoDaddy A kayıtlarını idempotent olarak günceller. + ### Resize (Server Type Degistirme) `terraform.tfvars` icinde `server_type_swarm` veya `server_type_db` degerini degistir: diff --git a/setup/07-prod-ansible-bootstrap.md b/setup/07-prod-ansible-bootstrap.md index bde093a..25ff8f3 100644 --- a/setup/07-prod-ansible-bootstrap.md +++ b/setup/07-prod-ansible-bootstrap.md @@ -262,6 +262,26 @@ scp -P 23 STORAGEBOX_USER@STORAGEBOX_USER.your-storagebox.de:prod/secrets/iklim. chmod 600 /opt/iklimco/stacks/.env ``` +## Swarm Kurulum Doğrulaması + +Bootstrap sonrası aşağıdaki komutlarla Swarm durumu kontrol edilir: + +```bash +# 6 node: 3 manager (Leader/Reachable), 3 worker (Ready) +docker node ls + +# App node etiketi +docker node inspect iklim-app-01 --format '{{.Spec.Labels}}' +# Beklenen: map[type:service] + +# DB node etiketi +docker node inspect iklim-db-01 --format '{{.Spec.Labels}}' +# Beklenen: map[db-index:01 role:db] + +# swarm-init.sh idempotency — zaten aktif Swarm'da tekrar init denemez +grep -n "swarm init\|swarm join" init/swarm-init.sh +``` + ## Kabul Kriterleri - `ansible all -m ping` başarılı olur. @@ -270,6 +290,7 @@ chmod 600 /opt/iklimco/stacks/.env - Manager quorum sağlanır (3 manager, 1 kayıp tolere edilir). - `iklimco-net` overlay network vardır. - Node etiketleri (`type=service`, `role=db`, `db-index=01/02/03`) inspect ile doğrulanır. +- `swarm-init.sh` aktif Swarm'da tekrar init denemez (idempotent). - Her node'da `/mnt/storagebox` mount edilmiştir. - Her app node'da Gitea Act Runner servisi çalışmaktadır. - DB node'larında `/opt/iklimco/db/mongodb/config/mongod.conf` oluşturulmuştur ve `replSetName: rs0` içermektedir. diff --git a/setup/08-prod-db-cluster-kurulum.md b/setup/08-prod-db-cluster-kurulum.md index 9148da6..197eb71 100644 --- a/setup/08-prod-db-cluster-kurulum.md +++ b/setup/08-prod-db-cluster-kurulum.md @@ -528,6 +528,8 @@ bootstrap: wal_keep_size: 512 max_wal_senders: 5 max_replication_slots: 5 + shared_preload_libraries: 'pg_stat_statements' + pg_stat_statements.track: 'all' initdb: - encoding: UTF8 diff --git a/setup/09-prod-runner-ha-ve-swarm.md b/setup/09-prod-runner-ha-ve-swarm.md index a68a824..7b89273 100644 --- a/setup/09-prod-runner-ha-ve-swarm.md +++ b/setup/09-prod-runner-ha-ve-swarm.md @@ -66,6 +66,30 @@ Gerekli onlem: - Ayni servis icin prod deploy ayni anda birden fazla kez tetiklenmemeli. - Prod deploy workflow'lari StorageBox uzerinde otomatik deploy lock kullanmalidir. +## Ön Koşullar — StorageBox Sırları + +Deploy pipeline çalışmadan önce aşağıdaki dosyaların StorageBox'ta mevcut olması gerekir. Bu dosyalar otomatik oluşturulmaz; ilk kurulumda elle oluşturulur. + +### SWAG / GoDaddy Kimlik Bilgileri + +``` +prod/secrets/iklim.co/.env.secrets.swag +``` + +```bash +GODADDY_KEY= +GODADDY_SECRET= +``` + +GoDaddy API anahtarı için: https://developer.godaddy.com/keys — **Production** key oluştur. Mevcut bir anahtarın herhangi bir chat, Slack veya e-postada paylaşıldığı biliniyorsa kullanmadan önce iptal et ve yenisini oluştur. + +> `.env.secrets.swag` yalnızca SWAG/GoDaddy kimlik bilgilerini içerir. +> `.env.secrets.shared` AppRole ID'leri, DB şifreleri ve diğer çalışma zamanı sırlarını içerir — bu iki dosyayı karıştırma. + +### Gitea PROD_FLOATING_IP Değişkeni + +DNS otomasyonu için `PROD_FLOATING_IP` Gitea project variable olarak tanımlanmış olmalıdır. `06-prod-terraform-iaac.md` → "Gitea Değişkeni: PROD_FLOATING_IP" adımına bak. + ## StorageBox Deploy Lock Modeli Prod'da 3 runner oldugu icin deploy lock zorunlu kabul edilir. Lock lokal dosya @@ -123,28 +147,35 @@ Lock seviyesi: ## Swarm Servis Dagilimi -Prod'da 3 node da manager + app worker oldugu icin servisler 3 node'a dagitilabilir. +Prod'da 3 app node da manager + app worker oldugu icin servisler 3 node'a dagitilabilir. -Uygulama servisleri icin ileride `docker-stack-service.yml` deploy ayarlari su prensiplere gore revize edilebilir: +### Mikroservisler -- Stateless servislerde `replicas: 3` -- `placement` ile sadece app-capable node'lar secilir -- `update_config` rolling update olacak sekilde ayarlanir -- `restart_policy` aktif kalir -- State tutan servisler app worker uzerinde cogaltilmaz; stateful katman DB node'larinda ayridir +Her mikroservisin iki stack dosyasi vardir: -Mevcut repo durumunda mikroservis stack dosyalari servis bazli deploy ediliyor. Bu dokuman, prod HA hedefi icin runner ve Swarm on kosullarini tanimlar; her mikroservisin replica sayisi ayri uygulama deploy refaktoru olarak ele alinmalidir. +| Dosya | Icerik | Ortam | +| --- | --- | --- | +| `BE-/docker-stack-service.yml` | Base tanimlar, `replicas: 1` | Test + Prod | +| `BE-/docker-stack-service.prod.yml` | `replicas: 3`, `max_replicas_per_node: 1` | Yalnizca Prod | + +Prod deploy komutu: + +```bash +docker stack deploy \ + -c BE-/docker-stack-service.yml \ + -c BE-/docker-stack-service.prod.yml \ + iklimco +``` + +`max_replicas_per_node: 1` zorunludur; bu olmadan Swarm node sayisi < replica sayisina dustugunde ayni node'a birden fazla replica yerlestirir. + +### Infra Servisleri + +`docker-stack-infra.yml` (base) ile `docker-stack-infra.prod.yml` (overlay) birlikte deploy edilir. Overlay; Vault, APISIX, RabbitMQ, Redis Sentinel gibi servisleri `replicas: 3` ve `max_replicas_per_node: 1` ile override eder. Detay: `Environment_Infrastructure/roadmap/prod-env/03-infra-stack-changes.md`. ## Gateway ve Public Trafik -Public internet sadece `80/tcp` ve `443/tcp` ile gateway katmanina girmelidir. - -Mevcut stack dosyalarinda APISIX `8080/8443` publish ediyor olabilir. Prod hedef mimaride public firewall sadece `80/443` acik oldugu icin iki secenekten biri secilmelidir: - -1. APISIX/SWAG host publish portlari `80/443` ile uyumlu hale getirilir. -2. Hetzner Load Balancer veya reverse proxy `80/443` alip Swarm gateway portlarina private network uzerinden aktarir. - -Bu karar Terraform/Ansible bootstrap'tan ayridir; uygulama altyapi manifest revizyonu gerektirir. +Public internet sadece `80/tcp` ve `443/tcp` ile SWAG uzerinden girer. SWAG `iklim-app-01`'e sabitlenmistir (Floating IP bu node'da). APISIX admin portlari (`9180`) ve diger servis portlari public acilmaz; SWAG reverse proxy olarak tum public trafigi APISIX'e iletir. Detay: `Environment_Infrastructure/roadmap/prod-env/04-swag-nginx-configs.md`. ## Kabul Kriterleri @@ -156,3 +187,5 @@ Bu karar Terraform/Ansible bootstrap'tan ayridir; uygulama altyapi manifest revi - Prod workflow'lari StorageBox uzerindeki `prod/locks/prod-deploy.lock` global lock'unu kullanir. - Lock manuel degil, workflow tarafindan `mkdir/rmdir` ile otomatik yonetilir. - Public ingress sadece `22`, `80`, `443` ile sinirlidir. +- StorageBox'ta `prod/secrets/iklim.co/.env.secrets.swag` mevcuttur ve geçerli GoDaddy kimlik bilgilerini içerir. +- Gitea'da `PROD_FLOATING_IP` project variable tanımlıdır.