diff --git a/README.md b/README.md index dac3471..89230d5 100644 --- a/README.md +++ b/README.md @@ -1,64 +1,111 @@ -# 🌍 iklim.co Altyapı ve Sunucu Yönetimi +# iklim.co Altyapı ve Sunucu Yönetimi -Bu depo, `iklim.co` projesinin **test** ve **production** ortamlarını kurmak, yönetmek ve modernize etmek için gerekli olan Infrastructure-as-Code (IaC) varlıklarını, teknik rehberleri ve operasyonel standartları barındırır. +Bu depo, `iklim.co` test ve production ortamlarını provision etmek, yapılandırmak, işletmek ve modernize etmek için kullanılan Infrastructure-as-Code varlıklarını, kurulum runbook'larını, operasyonel facts dokümanlarını ve planlama notlarını içerir. -Altyapı yönetimi; Hetzner Cloud üzerinde Terraform ile kaynak provisioning, Ansible ile işletim sistemi yapılandırması ve Docker Swarm üzerinde mikroservis mimarisinin kurgulanması süreçlerini kapsar. +Altyapı yönetimi Hetzner Cloud üzerinde Terraform ile kaynak provisioning, Ansible ile işletim sistemi ve Swarm bootstrap, Docker Swarm üzerinde altyapı ve uygulama servislerinin deploy edilmesi süreçlerini kapsar. ---- +## Depo Yapısı -## 📂 Depo Yapısı ve Temel Bölümler +### Terraform (`terraform/`) -Bu depodaki dökümantasyon ve kod varlıkları beş ana kategoriye ayrılmıştır: +Terraform, uzak test ve production ortamları için Hetzner Cloud kaynaklarını tanımlar: -### 1. 🛣️ Roadmap (`roadmap/`) -Ortamların (test ve prod) sıfırdan kurulması veya mevcut yapının güncellenmesi için gerekli olan **iş gereksinimlerini, teknik hedefleri ve adım adım uygulama planlarını** içerir. -- Altyapıda yapılacak büyük değişikliklerin (örn: Redis Sentinel geçişi, APISIX konfigürasyonu, RabbitMQ Quorum Queues) stratejik dökümantasyonudur. -- [roadmap/test-env/](./roadmap/test-env/) - Test ortamı gereksinimleri ve planları. -- [roadmap/prod-env/](./roadmap/prod-env/) - Üretim ortamı HA (High Availability) ve güvenilirlik planları. +- `terraform/hetzner/test`: test sunucuları, network, firewall, Floating IP, placement ve outputs. +- `terraform/hetzner/prod`: production app/service node'ları, DB node'ları, private networking, firewall'lar, placement group'lar, Floating IP ve outputs. -### 2. 🛠️ Setup (`setup/`) -Altyapının fiziksel olarak ayağa kaldırılması için kullanılan **uygulama dökümanlarıdır**. Bu bölüm şunları yönetmek için kullanılır: -- **Terraform:** Bulut kaynaklarının (Server, Network, Firewall) üretilmesi. -- **Ansible:** İşletim sistemi hazırlığı, güvenlik sertleştirme (hardening), Docker/Swarm kurulumu. -- **CI/CD:** Deployment workflow'larının (Gitea Actions) ve stack manifest'lerinin oluşturulması/güncellenmesi. -- Örn: [setup/06-prod-terraform-iaac.md](./setup/06-prod-terraform-iaac.md), [setup/07-prod-ansible-bootstrap.md](./setup/07-prod-ansible-bootstrap.md) +Dev ortamı lokal ve Docker Compose tabanlıdır; bu Terraform stack'leri tarafından provision edilmez. -### 3. 🗺️ Setup vs Roadmap Matrisi (`setup-vs-roadmap-map.md`) -İşterler doğrultusunda hazırlanan **Roadmap** dökümanları ile bu isterleri teknik olarak hayata geçiren **Setup** dökümanları arasındaki ilişkiyi açıklar. -- Hangi roadmap adımının hangi setup dökümanı ile uygulandığını gösteren bir eşleşme matrisidir. -- [setup-vs-roadmap-map.md](./setup-vs-roadmap-map.md) dökümanından detaylara ulaşılabilir. +### Ansible (`ansible/`) -### 4. 📊 Hetzner Sizing Report (`hetzner-sizing-report.md`) -İklim altyapı servisleri (API Gateway, Microservices, Databases, Broker) için seçilen **Hetzner sunucu tiplerini, CPU/RAM kapasitelerini ve maliyet/performans analizlerini** anlatır. -- Ortam kurulumundan önce kapasite planlaması için temel referans noktasıdır. -- [hetzner-sizing-report.md](./hetzner-sizing-report.md) dökümanını inceleyin. +Ansible, Terraform provisioning sonrası uzak host'ları hazırlar: -### 5. 💡 Facts (`facts/`) -Ortam kurulumları tamamlandıktan sonra ortaya çıkan, **sistemin o anki gerçek durumunu (source of truth) ve bilinmesi gereken kritik teknik detayları** barındıran dökümanlardır. -- "Sistem şu an nasıl çalışıyor?" sorusunun cevabıdır. -- [facts/firewall.md](./facts/firewall.md): Aktif firewall kuralları ve port matrisi. -- [facts/swarm-node-recovery-swag-failover.md](./facts/swarm-node-recovery-swag-failover.md): Node düşmesi durumunda manuel müdahale ve recovery prosedürleri. +- `ansible/test`: test bootstrap playbook'ları, inventory ve ortama özel değişkenler. +- `ansible/prod`: production bootstrap playbook'ları, inventory, değişkenler ve prod'a özel rol override'ları. +- `ansible/roles`: `base`, `hardening`, `docker`, `swarm`, `node_dirs`, `storagebox`, `storagebox_ssh_key`, `act_runner` ve ortak `db_stack` gibi paylaşılan roller. ---- +Production, `ansible/prod/ansible.cfg` içinde `roles_path = roles:../roles` kullanır. Bu nedenle `ansible/prod/roles/db_stack` gibi prod-local roller mevcut olduğunda paylaşılan rollerden önce çalışır. -## 🧱 Kurulum Akışı (Kanonik Sıra) +### Setup Runbook'ları (`setup/`) -Bir ortamı sıfırdan kurarken veya majör bir güncelleme yaparken şu sırayı takip edin: +Setup dokümanları, ortamları ayağa kaldırmak veya büyük altyapı değişikliklerini uygulamak için kullanılan kanonik uygulama runbook'larıdır. Güncel dosyalar: -1. **Analiz:** [hetzner-sizing-report.md](./hetzner-sizing-report.md) ile kaynak ihtiyacını belirleyin. -2. **Planlama:** `roadmap/` altındaki ilgili ortam dökümanlarını inceleyerek yapılacak değişiklikleri anlayın. -3. **Hizalama:** [setup-vs-roadmap-map.md](./setup-vs-roadmap-map.md) ile hangi setup dökümanlarını kullanacağınızı netleştirin. -4. **Uygulama:** `setup/` dökümanlarını (00'dan 09'a kadar) sırasıyla takip ederek Terraform ve Ansible süreçlerini işletin. -5. **Doğrulama:** Kurulum sonrası sistemin çalışma prensipleri için `facts/` dökümanlarını referans alın. +- [setup/00-general-roadmap.md](./setup/00-general-roadmap.md) +- [setup/01-private-network-port-matrix.md](./setup/01-private-network-port-matrix.md) +- [setup/02-test-terraform-iac.md](./setup/02-test-terraform-iac.md) +- [setup/03-test-ansible-bootstrap.md](./setup/03-test-ansible-bootstrap.md) +- [setup/04-test-db-docker-setup.md](./setup/04-test-db-docker-setup.md) +- [setup/05-test-runner-and-deploy-prerequisites.md](./setup/05-test-runner-and-deploy-prerequisites.md) +- [setup/06-prod-terraform-iac.md](./setup/06-prod-terraform-iac.md) +- [setup/07-prod-ansible-bootstrap.md](./setup/07-prod-ansible-bootstrap.md) +- [setup/08-prod-db-cluster-setup.md](./setup/08-prod-db-cluster-setup.md) +- [setup/09-prod-runner-ha-and-swarm.md](./setup/09-prod-runner-ha-and-swarm.md) ---- +Bu dokümanlar Terraform, Ansible, Swarm label'ları, StorageBox path'leri, runner ön koşulları, DB servisleri ve production Swarm deploy modelinin birlikte nasıl çalıştığını açıklar. -## ✅ Ön Koşullar ve Araçlar +### Roadmap (`roadmap/`) -- **Terraform >= 1.6**: Altyapı provisioning. -- **Ansible**: Konfigürasyon yönetimi. -- **Hetzner Cloud API Token**: Ortam bazlı yetkilendirme. -- **SSH Key**: Sunucu erişimi için sisteme tanımlı anahtar çifti. +Roadmap dokümanları test ve production değişiklikleri için gereksinimleri, tasarım hedeflerini ve migration planlarını açıklar: ---- -*iklim.co Infrastructure Team - 2026* +- [roadmap/test-env/](./roadmap/test-env/) +- [roadmap/prod-env/](./roadmap/prod-env/) + +Roadmap dokümanlarını amaç ve tasarım bağlamı için kullanın. Güncel uygulama akışı için setup runbook'larını kullanın. + +### Setup vs Roadmap Map + +[setup-vs-roadmap-map.md](./setup-vs-roadmap-map.md), roadmap maddelerini bu maddeleri hayata geçiren setup dokümanları ve implementation alanları ile eşler. + +### Facts (`facts/`) + +Facts dokümanları güncel durum detaylarını ve operasyonel geçmişi korur: + +- [facts/firewall.md](./facts/firewall.md): aktif firewall ve port bilgileri. +- [facts/node-recovery-failover.md](./facts/node-recovery-failover.md): node recovery ve failover prosedürleri. +- [facts/prod-kurulum-gecmisi.md](./facts/prod-kurulum-gecmisi.md): production kurulum geçmişi ve güncel production notları. + +Facts dokümanlarını “sistem şu an nasıl çalışıyor?” sorusu, tarihsel bağlam ve setup sonrası doğrulama için kullanın. + +### Hetzner Sizing Report + +[hetzner-sizing-report.md](./hetzner-sizing-report.md), altyapı servisleri, veritabanları, broker'lar ve uygulama workload'ları için sunucu sizing, CPU/RAM seçimleri ve maliyet/performans değerlendirmelerini açıklar. + +### Confluence Export (`confluence-wiki/`) + +`confluence-wiki/`, altyapı notlarının repository dışına yayınlanması veya mirror edilmesi gerektiğinde kullanılan wiki odaklı/export edilmiş dokümantasyon materyallerini içerir. + +## Güncel Production Modeli + +Production şu anda ayrık altyapı modeli kullanır: + +- Ana infra ve DB stack: root `docker-stack-infra_db-prod.yml`. +- Vault stack: root `docker-stack-vault.yml`. +- Vault bootstrap: root `init/vault/vault-bootstrap.sh`; production deploy akışında `init-infra-prod.sh` üzerinden çağrılır. +- Production pipeline source of truth: root `.gitea/workflows/deploy-prod.yml` ve root `prod_env-ci_dc-pipeline.md`. + +`docker-stack-infra_db-prod.yml` bilinçli olarak karma bir stack'tir: + +- Patroni/PostgreSQL, MongoDB ve etcd gibi DB/cluster servisleri `iklim-db-*` node'larında çalışır ve gerektiği yerde host-mode cluster portları kullanır. +- Redis, Redis Sentinel ve RabbitMQ gibi service-node altyapı servisleri `node.labels.type == service` app/service node'larında çalışır ve stack veya reverse proxy tarafından açıkça expose edilmedikçe Docker overlay network üzerinde kalır. + +## Kanonik Kurulum Akışı + +Yeni bir ortam veya büyük bir altyapı güncellemesi için: + +1. [hetzner-sizing-report.md](./hetzner-sizing-report.md) dosyasını inceleyin. +2. Tasarım amacını anlamak için ilgili `roadmap/` dokümanlarını inceleyin. +3. Her roadmap maddesinin hangi setup runbook'u ile uygulandığını görmek için [setup-vs-roadmap-map.md](./setup-vs-roadmap-map.md) dosyasını kontrol edin. +4. Hedef ortam için numaralı `setup/` runbook'larını sırayla takip edin. +5. Güncel davranışı, recovery prosedürlerini, firewall durumunu ve production geçmişini doğrulamak için `facts/` dokümanlarını kullanın. + +## Gerekli Araçlar + +- Terraform `>= 1.6` +- Ansible +- Hedef ortam için Hetzner Cloud API token +- Sunucu erişimi için yetkili SSH key pair + +## Notlar + +- Dev ortamı lokal ve Docker Compose tabanlıdır; uzak Terraform/Ansible otomasyonu test ve production ortamlarını hedefler. +- Test daha küçük bir uzak ortamdır ve single-node DB/App varsayımlarına dayanır. +- Production üç app/service node ve üç DB node içeren high-availability uzak ortamdır. diff --git a/ansible/prod/roles/db_stack/templates/patroni.yml.j2 b/ansible/prod/roles/db_stack/templates/patroni.yml.j2 index 8a057d1..2995868 100644 --- a/ansible/prod/roles/db_stack/templates/patroni.yml.j2 +++ b/ansible/prod/roles/db_stack/templates/patroni.yml.j2 @@ -15,7 +15,7 @@ etcd3: - etcd-02:2379 - etcd-03:2379 username: root - password: "{{ vault_etcd_root_password }}" + password: "${ETCD_ROOT_PASSWORD}" bootstrap: dcs: diff --git a/facts/swarm-node-recovery-swag-failover.md b/facts/node-recovery-failover.md similarity index 54% rename from facts/swarm-node-recovery-swag-failover.md rename to facts/node-recovery-failover.md index f93e292..e742be3 100644 --- a/facts/swarm-node-recovery-swag-failover.md +++ b/facts/node-recovery-failover.md @@ -1,4 +1,4 @@ -# Docker Swarm — Node Recovery +# Test — Docker Swarm Node Recovery Test ortamında tek manager (`iklim-app-01`) ve tek worker (`iklim-db-01`) bulunur. Hangi node'un yeniden kurulduğuna göre recovery süreci farklılaşır. @@ -32,17 +32,19 @@ DB verileri `iklim-db-01`'deki named volume'larda korunur, kayıp yaşanmaz. Yeni `iklim-db-01` Swarm'dan habersiz başlar (`inactive`). Manager (`iklim-app-01`) eski dead node kaydını tutar. +> ⚠️ **Veri kaybı:** `iklim-db-01` yeniden kurulduğu için tüm named volume'lar silinmiştir. 3. adım öncesinde backup'tan restore yapılması zorunludur. + ### Çözüm ```bash -# 1. Ansible bootstrap — yeni node otomatik join olur -cd ansible/test -ansible-playbook -i inventory/generated/test.yml test-bootstrap.yml --ask-vault-pass - -# 2. iklim-app-01 üzerinde — eski dead node kaydını temizle +# 1. iklim-app-01 üzerinde — eski dead node kaydını temizle (bootstrap'tan ÖNCE yapılmalı) docker node ls # eski node ID'yi bul docker node rm +# 2. Ansible bootstrap — yeni node otomatik join olur +cd ansible/test +ansible-playbook -i inventory/generated/test.yml test-bootstrap.yml --ask-vault-pass + # 3. DB stack'i yeniden deploy et (backup'tan restore sonrası) ansible-playbook -i inventory/generated/test.yml test-db-post-stack.yml --ask-vault-pass ``` @@ -68,7 +70,7 @@ ansible-playbook -i inventory/generated/test.yml test-db-post-stack.yml --ask-va | Senaryo | Manuel Adım | Ansible Yeterli mi? | |---|---|---| | Manager (`iklim-app-01`) ölür | `docker swarm leave --force` (worker'da) | Sonrasında evet | -| Worker (`iklim-db-01`) ölür | `docker node rm ` (manager'da) | Büyük ölçüde evet | +| Worker (`iklim-db-01`) ölür | `docker node rm ` (manager'da, bootstrap'tan önce) | Hayır — backup restore gerekir | | Her ikisi ölür | Yok | Evet | ## Neden Prod'da Bu Sorun Yok @@ -81,6 +83,8 @@ Prod ortamında birden fazla manager node (en az 3) çalıştırılır. Tek mana SWAG, cert-reloader, Prometheus ve Grafana cluster-native (replicated) değildir; her zaman tek instance çalışırlar ve varsayılan olarak `iklim-app-01`'e (Floating IP node) sabitlenmişlerdir. `iklim-app-01` çöktüğünde bu servisler durur; DNS/HTTPS erişimi ve izleme (monitoring) kesilir. Swarm quorum 2 manager ile devam eder; mikroservisler ve Vault başka node'lara taşınır. +`cert-distributor` bu kuralın dışındadır: `mode: global` ile `node.labels.type == service` olan tüm node'larda çalışır; StorageBox'tan sertifikayı node-lokal `/opt/iklimco/ssl`'e kopyalar (Vault FUSE mount kısıtlaması nedeniyle). `iklim-app-01` düştüğünde diğer node'lardaki `cert-distributor` instance'ları çalışmaya devam eder — failover gerektirmez. + Tüm bu servislerin verileri ve konfigürasyonları StorageBox'ta tutulur: - **SWAG:** `/mnt/storagebox/swag/config` - **SSL:** `/mnt/storagebox/ssl` @@ -91,12 +95,12 @@ Tüm bu servislerin verileri ve konfigürasyonları StorageBox'ta tutulur: ### 1. Servisleri Başka Node'a Taşı -SWAG ve cert-reloader birlikte taşınmalıdır. Prometheus ve Grafana da bağımsız olarak veya aynı anda taşınabilir. +SWAG ve cert-reloader birlikte taşınmalıdır. Prometheus ve Grafana da bağımsız olarak veya aynı anda taşınabilir. `cert-distributor` global mode'da çalıştığından taşıma gerekmez. ```bash # iklim-app-02 veya iklim-app-03 üzerinde (aktif manager): -# SWAG & Cert-Reloader taşıma +# SWAG & Cert-Reloader taşıma (replicas=1 olduğundan taşıma sırasında kısa kesinti yaşanır) docker service update --constraint-add "node.hostname == iklim-app-02" --constraint-rm "node.hostname == iklim-app-01" iklimco_swag docker service update --constraint-add "node.hostname == iklim-app-02" --constraint-rm "node.hostname == iklim-app-01" iklimco_cert-reloader @@ -121,8 +125,12 @@ hcloud floating-ip assign 4. `iklim-prod-app-fip` satırının sağındaki **⋮** (üç nokta) menüsünü aç → **Reassign**. 5. Açılan listeden **`iklim-app-02`**'yi seç → **Reassign** butonuna tıkla. +> **Not:** Floating IP Hetzner panelinde yeniden atandıktan sonra `iklim-app-02`'nin network interface'inde de aktif olması gerekir. Ansible bootstrap bu konfigürasyonu yapıyorsa otomatiktir; emin olmak için `ip addr show` ile Floating IP'nin bind edildiğini doğrula. + ### 3. Doğrula +SWAG başlama ve sertifika kontrolü birkaç saniye sürebilir; servis `Running` görünse de ilk `curl` başarısız dönebilir. Birkaç saniye bekleyip tekrar dene. + ```bash docker service ls | grep -E 'swag|cert-reloader|prometheus|grafana' curl -si https://api.iklim.co/health @@ -133,6 +141,9 @@ curl -si https://api.iklim.co/health Node Swarm'a yeniden katıldıktan sonra tüm servisleri tekrar `iklim-app-01`'e taşıyıp Floating IP'yi geri aktarabilirsiniz. ```bash +# Önce node'un Swarm'a gerçekten katıldığını doğrula (STATUS = Ready olmalı) +docker node ls + # Servisleri geri taşı for svc in iklimco_swag iklimco_cert-reloader iklimco_prometheus iklimco_grafana; do docker service update --constraint-add "node.hostname == iklim-app-01" --constraint-rm "node.hostname == iklim-app-02" $svc @@ -149,5 +160,62 @@ hcloud floating-ip assign | Swarm quorum | Otomatik — 2 manager yeterli | | Vault, mikroservisler | Otomatik — `node.labels.type == service` constraint ile başka node'a schedule edilir | | SWAG, cert-reloader | Manuel — `docker service update --constraint-*` + Floating IP taşıma | +| cert-distributor | Otomatik — `mode: global`, tüm servis node'larında zaten çalışır | | Prometheus, Grafana | Manuel — `docker service update --constraint-*` | | Veriler & Konfig | StorageBox'ta; failover node hemen erişir, veri kaybı yaşanmaz | + +--- + +# Prod — DB Node Recovery + +Her DB node'u (`iklim-db-01`, `iklim-db-02`, `iklim-db-03`) aynı servis üçlüsünü barındırır: + +| Node | Servisler | +|------|-----------| +| `iklim-db-01` | `etcd-01`, `patroni-01`, `mongodb-01` | +| `iklim-db-02` | `etcd-02`, `patroni-02`, `mongodb-02` | +| `iklim-db-03` | `etcd-03`, `patroni-03`, `mongodb-03` | + +## Senaryo A: Node Geçici Olarak Çöker (Volume'lar Korunur) + +etcd, Patroni ve MongoDB'nin tamamı 3 üyeli HA cluster'lardır; quorum için 2 node yeterlidir. + +| Servis | Etki | Otomatik İyileşme | +|--------|------|-------------------| +| etcd | 2/3 node ile quorum devam eder | Node geri dönünce cluster'a otomatik katılır | +| Patroni | Replica düşerse primary devam eder; primary düşerse etcd üzerinden yeni primary seçilir | Node geri dönünce replica olarak otomatik katılır | +| MongoDB | 2/3 node ile quorum devam eder; gerekirse yeni primary seçilir | Node geri dönünce primary'den initial sync ile güncellenir | + +**Manuel adım gerekmez.** Docker Swarm `restart_policy: on-failure` servisleri otomatik başlatır. + +## Senaryo B: Node Yeniden Kurulur (Volume'lar Silinir) + +etcd named volume'ları node-lokal olduğundan node yeniden kurulunca kaybolur. Patroni ve MongoDB kendi kendine iyileşir; etcd manuel müdahale gerektirir. + +```bash +# Aktif bir etcd container'ından — eski üyeyi cluster'dan çıkar +docker exec -it $(docker ps -q -f name=iklimco_etcd-01) \ + etcdctl member list --endpoints=http://etcd-01:2379,http://etcd-02:2379,http://etcd-03:2379 +# Çıktıdan yeniden kurulan node'un 'sini al: +docker exec -it $(docker ps -q -f name=iklimco_etcd-01) \ + etcdctl member remove --endpoints=http://etcd-01:2379,http://etcd-02:2379,http://etcd-03:2379 + +# Servisleri yeniden başlat (etcd boş volume ile existing cluster'a katılır; +# Patroni primary'den pg_basebackup ile otomatik clone alır; +# MongoDB hostname değişmediyse primary'den otomatik initial sync yapar) +docker service update --force iklimco_etcd-0N +docker service update --force iklimco_patroni-0N +docker service update --force iklimco_mongodb-0N +``` + +> **MongoDB hostname değişirse:** Replica set konfigürasyonu eski hostname'i tutar. `mongosh` ile `rs.remove(":27017")` ardından `rs.add(":27017")` çalıştır. + +> **etcd `ETCD_INITIAL_CLUSTER_STATE`:** Stack dosyasında `new` olarak tanımlıdır (ilk kurulum için). Yeniden kurulum senaryosunda Swarm servisi `--force` ile güncellenince etcd boş volume ile başlar ve mevcut cluster'a `existing` modunda katılmaya çalışır. Bitnami etcd image'ı bunu otomatik algılar; sorun yaşanırsa stack dosyasında ilgili node'un `ETCD_INITIAL_CLUSTER_STATE` değerini geçici olarak `existing` yapıp redeploy et, ardından geri al. + +## Özet + +| Servis | Geçici çöküş | Yeniden kurulum | +|--------|-------------|-----------------| +| etcd | Otomatik | Manuel: `member remove` → `service update --force` | +| Patroni | Otomatik | Otomatik: boş dir'den primary'yi clone alır | +| MongoDB | Otomatik | Otomatik (aynı hostname); hostname değişirse `rs.remove` + `rs.add` | diff --git a/facts/prod-kurulum-gecmisi.md b/facts/prod-kurulum-gecmisi.md index 3b40953..64c5d87 100644 --- a/facts/prod-kurulum-gecmisi.md +++ b/facts/prod-kurulum-gecmisi.md @@ -2,6 +2,11 @@ Prod kurulum adımları ve mevcut yapı. +Bu dosya kurulum geçmişini korur. Güncel prod deploy akışı için ana kaynak +repo kökündeki `prod_env-ci_dc-pipeline.md` dosyasıdır. Aşağıdaki manuel deploy +adımları, ilk kurulum ve sorun giderme geçmişi olarak tutulur; normal prod deploy +artık root `.gitea/workflows/deploy-prod.yml` üzerinden yürür. + ## Terraform ### Hetzner Cloud Yapılandırması @@ -166,7 +171,27 @@ ansible-playbook prod-bootstrap.yml \ --vault-password-file=../.vault_pass ``` -## DB Stack Deploy +## Güncel Production Deploy Kaynakları + +| Alan | Güncel kaynak | +| --- | --- | +| Root prod workflow | `.gitea/workflows/deploy-prod.yml` | +| Detaylı CI/CD dokümanı | `prod_env-ci_dc-pipeline.md` | +| Ana infra stack | `docker-stack-infra_db-prod.yml` | +| Vault HA stack | `docker-stack-vault.yml` | +| Vault bootstrap script | `init/vault/vault-bootstrap.sh` | +| Prod env ve secret dosyaları | `prod/secrets/iklim.co/.env`, `.env.secrets.*` | + +Güncel yapıda `.deleted` suffix'li eski stack dosyaları yoktur ve prod akışında +dikkate alınmaz. Ana infra stack `docker-stack-infra_db-prod.yml` dosyasıdır. +Vault stack'i bu dosyanın içinde değildir; `vault-bootstrap.sh` tarafından +`docker-stack-vault.yml` ile deploy edilir. + +## Tarihsel Manuel DB Stack Deploy (2026-05-21) + +Bu bölüm ilk prod DB/infra kurulum geçmişini korumak için bırakılmıştır. Güncel +normal akışta bu adımlar elle çalıştırılmaz; root prod workflow ana stack deploy, +Vault bootstrap, MongoDB replica set init ve DB init scriptlerini yönetir. ### Custom Image Build @@ -174,6 +199,9 @@ ansible-playbook prod-bootstrap.yml \ ### Stack Deploy +Tarihsel not: Bu komut bloğundaki `docker-stack-db-prod.yml` artık güncel stack +dosyası değildir. Güncel ana stack `docker-stack-infra_db-prod.yml` dosyasıdır. + ```bash # Lokal → app-01 scp ./docker-stack-* root@178.104.210.41:/home/iklim/ @@ -198,6 +226,10 @@ history -c && history -w ### MongoDB Replica Set Init +Tarihsel not: İlk kurulumda `rs.initiate` elle verilmişti. Güncel root prod +workflow içinde `Initialize MongoDB Replica Set` adımı replica set yoksa +`rs.initiate()`, eksik üye varsa primary üzerinden `rs.add()` çalıştırır. + ```bash ssh root@ @@ -242,26 +274,66 @@ history -c && history -w curl -s http://10.20.20.11:8008/cluster | python3 -m json.tool ``` -## Mevcut Durum (2026-05-21) +## Tarihsel Durum (2026-05-21) -| Adım | Durum | +| Adım | Durum | +| ------------------------------------------------------- | ---------- | +| Terraform — 6 sunucu, ağ, firewall, floating IP | ✅ | +| Ansible base + hardening + docker + node_dirs | ✅ | +| Ansible storagebox + storagebox_ssh_key | ✅ | +| Ansible swarm (3 manager app + 3 worker db) | ✅ | +| Ansible db_labels | ✅ | +| Ansible db_stack (StorageBox DB dizinleri + config) | ✅ | +| Ansible act_runner (3 prod runner Gitea'da Idle) | ✅ | +| DB stack deploy (etcd + MongoDB + Patroni) | ✅ | +| MongoDB replica set init (rs0: 1 primary, 2 secondary) | ✅ | +| Patroni HA cluster (1 leader, 2 replica, lag=0) | ✅ | +| Ana infra stack deploy (docker-stack-infra_db-prod.yml) | ✅ | +| MongoDB rs.initiate (ilk deploy sonrası elle) | ✅ | +| Deploy pipeline ilk çalışma | ⏳ bekliyor | + +## Güncel Durum (2026-06-15) + +| Alan | Güncel durum | | --- | --- | -| Terraform — 6 sunucu, ağ, firewall, floating IP | ✅ | -| Ansible base + hardening + docker + node_dirs | ✅ | -| Ansible storagebox + storagebox_ssh_key | ✅ | -| Ansible swarm (3 manager app + 3 worker db) | ✅ | -| Ansible db_labels | ✅ | -| Ansible db_stack (StorageBox DB dizinleri + config) | ✅ | -| Ansible act_runner (3 prod runner Gitea'da Idle) | ✅ | -| DB stack deploy (etcd + MongoDB + Patroni) | ✅ | -| MongoDB replica set init (rs0: 1 primary, 2 secondary) | ✅ | -| Patroni HA cluster (1 leader, 2 replica, lag=0) | ✅ | -| Ana infra stack deploy (docker-stack-infra_db-prod.yml) | ⏳ bekliyor | -| MongoDB rs.initiate (ilk deploy sonrası elle) | ⏳ bekliyor | -| Deploy pipeline ilk çalışma | ⏳ bekliyor | +| Prod deploy kaynak dokümanı | `prod_env-ci_dc-pipeline.md` | +| Root prod workflow | `.gitea/workflows/deploy-prod.yml` | +| Ana infra stack | `docker-stack-infra_db-prod.yml` | +| Vault HA stack | `docker-stack-vault.yml` | +| Vault deploy yöntemi | `init/vault/vault-bootstrap.sh` tarafından bootstrap/deploy | +| Eski `.deleted` stack dosyaları | Silindi, güncel akışta yok | +| Prod env dosyası | StorageBox `prod/secrets/iklim.co/.env` -> workflow workspace `./.env` | +| Shared secrets | StorageBox `prod/secrets/iklim.co/.env.secrets.shared` | +| Service secrets | StorageBox `prod/secrets/iklim.co/.env.secrets.` | +| SWAG secrets | StorageBox `prod/secrets/iklim.co/.env.secrets.swag` | +| MongoDB replica set init | Workflow içinde otomatik/idempotent adım olarak yönetiliyor | +| PostgreSQL init | Patroni primary beklenerek `./init/postgresql/*.sql` ile çalışıyor | +| MongoDB init | Replica set hazırlandıktan sonra `./init/mongodb/*.js` ile çalışıyor | +| DNS update | Workflow GoDaddy API ile `api`, `apigw`, `rabbitmq`, `grafana` A kayıtlarını güncelliyor | + +Güncel prod workflow ana hatlarıyla şu sırayı izler: + +1. StorageBox'tan `.env`, `.env.secrets.shared`, service secret dosyaları ve `.env.secrets.swag` alınır. +2. PostgreSQL ve MongoDB init template'leri `./init/postgresql` ve `./init/mongodb` altına üretilir. +3. Harbor pull login yapılır. +4. SWAG DNS/site config dosyaları hazırlanır. +5. Vault için geçici TLS placeholder cert gerekirse oluşturulur. +6. `rabbitmq_erlang_cookie` Docker secret'ı oluşturulur veya mevcutsa korunur. +7. `docker-stack-infra_db-prod.yml` `iklimco` stack'ine deploy edilir. +8. Runner job container `iklimco-net` overlay network'üne bağlanır. +9. `init-infra-prod.sh` çalışır; bu script Vault bootstrap ve RabbitMQ prod hazırlığını yapar. +10. Vault AppRole ID/Secret ID değerleri ve Docker secrets üretilir. +11. Güncellenen `.env` ve `.env.secrets.*` dosyaları StorageBox'a yüklenir. +12. etcd, APISIX, SWAG certificate, MongoDB replica set, DB init scriptleri ve DNS kayıtları doğrulanır/güncellenir. ## Önemli Mimari Notlar +### Ana Infra Stack ve Vault Ayrımı (2026-06-15) + +Güncel durumda ana infra stack `docker-stack-infra_db-prod.yml` dosyasıdır. Bu stack Redis master/replica/sentinel, RabbitMQ cluster, APISIX, APISIX Dashboard, Prometheus, Grafana, SWAG, cert-reloader, cert-distributor, etcd, Patroni ve MongoDB replica set servislerini içerir. + +Vault ana infra stack içinde değildir. Vault HA cluster `docker-stack-vault.yml` dosyasıyla, `init/vault/vault-bootstrap.sh` tarafından deploy edilir. Bootstrap akışı placeholder `vault_unseal_key` oluşturur, `iklimco_vault` servisini deploy eder, Vault init/unseal işlemini yapar ve Docker secret'ı gerçek unseal key ile rotate eder. + ### Tek Stack Yaklaşımı (2026-05-26) `docker-stack-infra-prod.yml` ve `docker-stack-db-prod.yml` tek dosyada birleştirildi: `docker-stack-infra_db-prod.yml`. Her iki dosya da aynı `iklimco` stack adına deploy edildiğinden servis isimleri değişmedi. @@ -270,7 +342,9 @@ curl -s http://10.20.20.11:8008/cluster | python3 -m json.tool **Network:** `iklimco-net` artık stack tarafından oluşturulur (MTU=1400, attachable). Ansible `swarm` rolündeki network oluşturma task'ı kaldırıldı. -**MongoDB rs.initiate:** İlk deploy sonrası `rs.initiate` elle verilmeli (DB Stack Deploy bölümüne bakınız). +**MongoDB rs.initiate:** Bu not ilk kurulum dönemine aittir. Güncel prod workflow +`Initialize MongoDB Replica Set` adımında `rs.initiate()` ve gerektiğinde `rs.add()` +işlemlerini yönetir. **Network silinirse:** Stack'i yeniden deploy et — `docker stack deploy -c docker-stack-infra_db-prod.yml iklimco` @@ -278,6 +352,11 @@ curl -s http://10.20.20.11:8008/cluster | python3 -m json.tool `retry_join.leader_api_addr` olarak `iklimco_vault` (Swarm servis adı) kullanılır. Stack-owned network sayesinde Docker DNS bu VIP'i kayıt eder. `leader_tls_server_name: vault.iklim.co` ile `*.iklim.co` sertifikası TLS doğrulamasını geçer. +Güncel Vault deploy akışında bu ayar `docker-stack-vault.yml` ve Vault template +dosyaları üzerinden kullanılır. Vault stack deploy'u root workflow'da doğrudan +değil, `init-infra-prod.sh` -> `init/vault/init-prod.sh` -> +`init/vault/vault-bootstrap.sh` zinciriyle yapılır. + ### Runner / iklimco-net (2026-05-26) Act runner config'de `container.network: "bridge"` kullanılır (önceki `iklimco-net`). Workflow'da "Connect Runner to Overlay Network" adımı "Deploy Swarm Stacks" sonrasına taşındı — böylece stack'in oluşturduğu `iklimco-net`'e runner job container bağlanabilir. diff --git a/roadmap/prod-env/01-swarm-init-multinode.md b/roadmap/prod-env/01-swarm-init-multinode.md index 2f9c268..e1407a4 100644 --- a/roadmap/prod-env/01-swarm-init-multinode.md +++ b/roadmap/prod-env/01-swarm-init-multinode.md @@ -41,6 +41,9 @@ This scheme is applied consistently across `docker-stack-infra.yml` and all 10 m `node.role == worker` is intentionally not used anywhere. DB nodes are Swarm workers, but targeting them via `node.role == worker` would also match any future worker-only app nodes. The explicit `node.labels.role == db` label provides precise, unambiguous targeting regardless of Swarm role. +## Otomasyon Notu +**ÖNEMLİ:** Aşağıda listelenen tüm Swarm ilklendirme, join token işlemleri ve node etiketleme (labeling) süreçleri artık manuel yapılmamaktadır. Bu işlemler `Environment_Infrastructure/ansible/prod/prod-bootstrap.yml` ve ortak `swarm` rolü tarafından **tamamen otomatik** olarak yürütülmektedir. Buradaki manuel bash komutları yalnızca referans, bilgi ve sorun giderme (troubleshooting) amaçlı tutulmaktadır. + ## Step 1 — Init Swarm on iklim-app-01 (the prod-runner node) ```bash @@ -102,7 +105,7 @@ docker node update --label-add role=db --label-add db-index=03 iklim-db-03 > DB nodes are Swarm **workers** only — they never become managers. > DB services are pinned to them via `node.labels.role == db` placement constraint. -> See `08-prod-db-cluster-kurulum.md` for DB stack deployment. +> See `08-prod-db-cluster-setup.md` for DB stack deployment. ## Step 6 — Verify diff --git a/roadmap/prod-env/02-godaddy-credentials.md b/roadmap/prod-env/02-godaddy-credentials.md index 4db58d6..97c674b 100644 --- a/roadmap/prod-env/02-godaddy-credentials.md +++ b/roadmap/prod-env/02-godaddy-credentials.md @@ -60,7 +60,7 @@ To get the Floating IP: `terraform output prod_floating_ip` Logic: for each record, pipeline queries the current value via GoDaddy API. If already correct, it skips. Otherwise it creates/updates the record. -> The Floating IP is assigned to `iklim-app-01` (`06-prod-terraform-iaac.md` — `floating_ip.tf`). +> The Floating IP is assigned to `iklim-app-01` (`06-prod-terraform-iac.md` — `floating_ip.tf`). > If failover is needed, the Floating IP can be reassigned to another app node; DNS does not change. ## Notes diff --git a/roadmap/prod-env/03-infra-stack-changes.md b/roadmap/prod-env/03-infra-stack-changes.md index 54abe21..1cc31d1 100644 --- a/roadmap/prod-env/03-infra-stack-changes.md +++ b/roadmap/prod-env/03-infra-stack-changes.md @@ -1,702 +1,75 @@ -# 03 — docker-stack-infra.yml Changes (Prod) +# 03 — Production Infrastructure and DB Stack Model ## Context -### File strategy — overlay approach +This document records the production infrastructure target that is now implemented by the current setup runbooks. The execution source is no longer the old base-plus-prod overlay model. -Prod-specific service changes are **not written directly** into `docker-stack-infra.yml`; they are kept in a separate overlay file: +Current references: -| File | Usage | -|------|-------| -| `docker-stack-infra.yml` | Base — works as-is for test | -| `docker-stack-infra.prod.yml` | Prod overlay — additional services and overrides | +- Setup source: `../../setup/08-prod-db-cluster-setup.md` and `../../setup/09-prod-runner-ha-and-swarm.md` +- Main infra and DB stack: root `docker-stack-infra_db-prod.yml` +- Vault stack: root `docker-stack-vault.yml` +- Vault bootstrap: root `init/vault/vault-bootstrap.sh`, called through `init-infra-prod.sh` -```bash -# Test deploy: -docker stack deploy -c docker-stack-infra.yml iklimco +## Current Stack Strategy -# Prod deploy (Swarm merges both files): -docker stack deploy -c docker-stack-infra.yml -c docker-stack-infra.prod.yml iklimco -``` +Production uses a split stack model: -Docker Swarm merge rule: if the same service name appears in both files, the overlay wins (deploy, environment, etc.); services only present in the overlay are added. +- `docker-stack-infra_db-prod.yml`: APISIX, APISIX Dashboard, SWAG, cert services, Redis/Sentinel, RabbitMQ, Prometheus, Grafana, Patroni/PostgreSQL, MongoDB, and etcd. +- `docker-stack-vault.yml`: Vault Raft cluster only. -### Prod-specific changes summary -- APISIX: 1 → 3 replicas (overlay override) -- Redis: single-instance → Sentinel cluster — 1 master + 2 replicas + 3 sentinels (overlay adds new services) -- RabbitMQ: 1 → 3-node Erlang cluster (overlay override + env) -- Vault: 1 → 3-node Raft cluster (overlay override) — see `07-vault-raft-plan.md` -- No separate APISIX etcd: Patroni etcd is shared (`/apisix` prefix) -- `init/apisix-core/init.sh`: when `PROFILE=prod`, rate limit `policy:local` → `policy:redis` +The previous `docker-stack-infra.yml` + `docker-stack-infra.prod.yml` overlay strategy is superseded for production. Do not create or deploy `docker-stack-infra.prod.yml` for the current prod environment. -### swag-vl volume — not used in prod, not defined in overlay +## Placement Boundary -Test-env Step 9 adds the `swag-vl` named volume to the base file. In prod, SWAG mounts to the StorageBox via the `${SWAG_CONFIG_DIR}` env var, so this volume is unused by any service. No need to remove it in the overlay — Swarm does not create unused volume definitions, it remains harmless. +`docker-stack-infra_db-prod.yml` is intentionally a mixed stack. The placement model is the important boundary: -No `swag-vl` definition is made in `docker-stack-infra.prod.yml`. +- DB/cluster services run on `iklim-db-*`: Patroni/PostgreSQL, MongoDB, and etcd. +- App/service-node infrastructure runs on `iklim-app-*` with `node.labels.type == service`: Redis, Redis Sentinel, RabbitMQ, APISIX, APISIX Dashboard, SWAG, cert-reloader/cert-distributor, Prometheus, and Grafana. +- Redis and RabbitMQ are not DB-node host-mode services. They stay on the overlay network unless explicitly exposed by the stack or SWAG/APISIX. -### Monitoring Persistence +DB services that require direct cluster traffic publish host-mode ports where the current stack defines them. Redis and RabbitMQ must not be changed to host-mode just because they live in the same stack file. -Prometheus and Grafana run as single instances, but their storage profiles are different: -- **Prometheus:** keep TSDB on a local Docker volume (`prometheus-vl`). Prometheus local storage should not run on StorageBox/DAVFS because of filesystem semantics and WAL/compaction I/O. -- **Grafana:** keep `/var/lib/grafana` on StorageBox (`/mnt/storagebox/grafana/data`) so dashboards, plugins, and the SQLite database are available if the single active instance is manually moved to another node. +## Current Production Services -Grafana uses the `GRAFANA_DATA_DIR` env var with a named-volume fallback for test. Prometheus continues to use the named Docker volume. See Step 9 for implementation details. +| Area | Current model | +| --- | --- | +| APISIX | 3 replicas on service nodes; config stored in etcd with `/apisix` prefix | +| Redis | Sentinel model on service nodes; overlay-only | +| RabbitMQ | 3-node service-node cluster; management exposed through SWAG, restricted by IP | +| Vault | Separate 3-node Raft stack via `docker-stack-vault.yml` | +| PostgreSQL | 3-node Patroni cluster on DB nodes | +| MongoDB | 3-node replica set on DB nodes | +| etcd | 3-node cluster on DB nodes, shared by Patroni and APISIX | +| Prometheus | Single instance; local Docker volume | +| Grafana | Single instance; StorageBox-backed data path | -**Note:** PostgreSQL and MongoDB are not in `docker-stack-infra.yml`. See `08-prod-db-cluster-kurulum.md`. +## Monitoring Persistence -## Step 1 — Apply all test-env changes first +Prometheus TSDB remains on a local Docker volume because StorageBox/DAVFS is not suitable for Prometheus WAL and compaction I/O. -Follow every step in `test-env/03-infra-stack-changes.md`: -- Add `swag` service -- Add `cert-reloader` service -- Remove published ports for vault, apisix, rabbitmq, prometheus, grafana, apisix-dashboard -- Add `swag-vl` volume +Grafana uses `/mnt/storagebox/grafana/data` through `GRAFANA_DATA_DIR` so dashboards, plugins, and the SQLite database survive manual service movement between service nodes. -## Step 2 — Vault: 3-node Raft cluster (prod) +## APISIX and etcd -Vault starts directly with 3 replicas; the Phase 1 single-instance stage is skipped in prod. -See `07-vault-raft-plan.md` Phase 2 for detailed setup steps. +APISIX uses the DB-node etcd cluster through overlay DNS aliases such as `etcd-01`, `etcd-02`, and `etcd-03`. Patroni and APISIX use different etcd prefixes, so their data does not collide. -```yaml -vault: - deploy: - mode: replicated - replicas: 3 - placement: - max_replicas_per_node: 1 - constraints: - - node.labels.type == service -``` +The app subnet to DB subnet firewall rule for etcd client traffic is part of the current production firewall model. See `../../setup/06-prod-terraform-iac.md`. -## Step 3 — APISIX: 3 replicas + init.sh rate limit update (prod overlay) +## Redis and RabbitMQ -Add to `docker-stack-infra.prod.yml`: +Redis/Sentinel and RabbitMQ are service-node infrastructure. Their placement follows `node.labels.type == service`. -```yaml -# docker-stack-infra.prod.yml -services: - apisix: - deploy: - mode: replicated - replicas: 3 - placement: - max_replicas_per_node: 1 - constraints: - - node.labels.type == service +RabbitMQ-related private firewall rules belong to the app/service-node firewall model. Redis and Sentinel do not publish host-mode ports in the current prod stack and do not require Hetzner firewall openings. - apisix-dashboard: - deploy: - mode: replicated - replicas: 3 - placement: - max_replicas_per_node: 1 - constraints: - - node.labels.type == service -``` +## Historical / Superseded by Setup -APISIX and apisix-dashboard are stateless (config lives in Patroni etcd) — 3 replicas is safe. -Swarm distributes SWAG requests to APISIX replicas via VIP (IPVS round-robin). +The following earlier roadmap ideas are retained only as historical context: -### init.sh — rate limit policy:redis (prod) +- Creating `docker-stack-infra.prod.yml` as a prod overlay. +- Deploying prod with `docker stack deploy -c docker-stack-infra.yml -c docker-stack-infra.prod.yml iklimco`. +- Keeping Vault inside the prod infra overlay with `/opt/iklimco/vault/data` host-path storage. +- Treating PostgreSQL/MongoDB as separate DB stacks such as `docker-stack-db.prod.yml`. +- Validating a prod merge with `docker stack config -c docker-stack-infra.yml -c docker-stack-infra.prod.yml`. -With `policy:local`, each APISIX instance counts independently → the global limit effectively becomes 3× with 3 replicas. -Switch to `policy:redis` for `PROFILE=prod`. - -Keep the following APISIX plugin limits in `init/apisix-core/init.sh` for `test/prod` unless stated otherwise: - -| Scope | Plugin | Target limit | -|-------|--------|--------------| -| WebSocket `/ws` | `limit-conn` | `conn: 5` per `remote_addr` | -| Auth routes `/v1/auth/*`, `/v1/users/*` | `limit-count` | `count: 12`, `time_window: 60` per `remote_addr` | -| Global rule | `limit-count` | `count: 60`, `time_window: 60` per `remote_addr` | - -Update the rate limit and connection limit blocks in `init/apisix-core/init.sh`. - -**1. Define threshold constants at the script header:** - -```bash -GLOBAL_LIMIT_COUNT=60 -GLOBAL_LIMIT_WINDOW=60 -AUTH_LIMIT_COUNT=12 -AUTH_LIMIT_WINDOW=60 -WS_LIMIT_CONN=5 -``` - -**2. Update WebSocket route plugins (test/prod):** - -```bash -if [[ "$PROFILE" != "dev" ]]; then - WS_PLUGINS=',"plugins":{"limit-conn":{"conn":'"$WS_LIMIT_CONN"',"burst":2,"default_conn_delay":0.1,"key":"remote_addr","key_type":"var","rejected_code":429}}' -else - WS_PLUGINS="" -fi -``` - -**3. Update Auth route plugins (test/prod):** - -```bash -if [[ "$PROFILE" != "dev" ]]; then - AUTH_LIMIT=',"plugins":{"limit-count":{"count":'"$AUTH_LIMIT_COUNT"',"time_window":'"$AUTH_LIMIT_WINDOW"',"key_type":"var","key":"remote_addr","rejected_code":429,"policy":"local"}}' -else - AUTH_LIMIT="" -fi -``` - -**4. Update Global rate limit rule (test/prod):** - -```bash -if [[ "$PROFILE" != "dev" ]]; then - if [[ "$PROFILE" == "prod" ]]; then - RATE_POLICY="redis" - RATE_REDIS=',"redis_host":"redis","redis_port":6379,"redis_password":"'"$REDIS_PASSWORD"'"' - else - RATE_POLICY="local" - RATE_REDIS="" - fi - - call_api "global rate limit" -X PUT "$APISIX_ADMIN_URL/global_rules/1" \ - -H "X-API-KEY: $API_KEY" -H "Content-Type: application/json" \ - -d '{"plugins":{"limit-count":{"count":'"$GLOBAL_LIMIT_COUNT"',"time_window":'"$GLOBAL_LIMIT_WINDOW"',"key_type":"var","key":"remote_addr","rejected_code":429,"policy":"'"$RATE_POLICY"'","allow_degradation":true'"$RATE_REDIS"'}}}' -fi -``` - -> APISIX's `limit-count` plugin does not natively support Redis Sentinel; `policy:redis` works with a single endpoint. -> The `redis` service name stays constant within Swarm overlay DNS. `allow_degradation: true` ensures that if Redis is -> temporarily unreachable (e.g. Sentinel failover ~10-30 s, or master rescheduling), APISIX passes requests through -> instead of returning errors — rate limiting is briefly suspended but API access is unaffected. -> Microservices use Spring Data Redis Sentinel natively and are unaffected by master changes. -> Docker Swarm has no inter-service anti-affinity; the `redis` master placement relies on Swarm's spread strategy -> to avoid co-locating with a replica. This is a known limitation — accepted in favour of operational simplicity. - -## Step 4 — etcd: Separate APISIX etcd removed — Patroni etcd shared - -The standalone `etcd` service in `docker-stack-infra.yml` is **not used in prod and must be disabled** by setting `replicas: 0` in the prod overlay. -APISIX uses the 3-node Patroni etcd cluster running on DB nodes, via the `/apisix` prefix. - -### Why consolidated? -- A standalone single-instance etcd was a SPOF for APISIX. -- Patroni etcd is already 3-node HA — APISIX gets a more reliable config store. -- etcd supports prefix-based namespacing; Patroni uses `/service/`, APISIX uses `/apisix/` — no collision. - -### APISIX etcd connection configuration - -Update the etcd endpoints in the APISIX service in `docker-stack-infra.yml` to point to DB nodes: - -```yaml -apisix: - environment: - APISIX_STAND_ALONE: "false" - # via apisix/conf/config.yaml or environment: - # etcd: - # host: - # - "http://etcd-01:2379" - # - "http://etcd-02:2379" - # - "http://etcd-03:2379" - # prefix: "/apisix" -``` - -The preferred method is mounting `config.yaml` via a Docker config or volume. etcd endpoints use **overlay DNS aliases** defined in `docker-stack-db.prod.yml` — `etcd-01`, `etcd-02`, `etcd-03` — which are reachable from app nodes via the `iklimco-net` overlay: - -```yaml -# config/apisix/config.yaml -etcd: - host: - - "http://etcd-01:2379" - - "http://etcd-02:2379" - - "http://etcd-03:2379" - prefix: "/apisix" - timeout: 30 -``` - -### Disable standalone etcd in prod overlay - -Docker Swarm overlay files cannot delete services from the base stack, but `replicas: 0` stops the container entirely: - -```yaml -# docker-stack-infra.prod.yml -services: - etcd: - deploy: - replicas: 0 -``` - -### Firewall requirement - -etcd access from app nodes to DB nodes must be open (port 2379, app subnet → DB subnet). Verify from an app node: - -```bash -docker run --rm --network iklimco-net alpine \ - sh -c "wget -qO- http://etcd-01:2379/health" -``` - -## Step 5 — Redis: Sentinel cluster (prod overlay) - -Redis runs as a single instance in test. In prod, Sentinel provides HA. -![[redis-sentinel-vs-cluster.png]] -Bitnami images are used — all configuration is done via env vars, no separate `.conf` file needed. - -### Prerequisites - -```bash -# Create Docker secret for Redis password: -openssl rand -hex 32 | docker secret create redis_password - -``` - -### Topology - -``` -any app node: redis (1 replica, spread by Swarm — not pinned) -2 app nodes: redis-replica (2 replicas, max 1/node, spread across app nodes) -all app nodes: redis-sentinel (3 replicas, max 1/node, spread across all app nodes) -``` - -### docker-stack-infra.prod.yml — Redis services - -The existing `redis` service is overridden in the prod overlay as **master**; `redis-replica` and `redis-sentinel` are added as new services. The service name (`redis`) remains unchanged so the APISIX connection config does not need updating. - -```yaml -# docker-stack-infra.prod.yml -services: - redis: # override base single-instance redis → master - image: bitnamisecure/redis:latest - environment: - ALLOW_EMPTY_PASSWORD: no - REDIS_PASSWORD: ${REDIS_PASSWORD} - REDIS_REPLICATION_MODE: master - deploy: - mode: replicated - replicas: 1 - placement: - constraints: - - node.labels.type == service - restart_policy: - condition: any - delay: 5s - labels: - project: co.iklim - - redis-replica: - image: bitnamisecure/redis:latest - environment: - ALLOW_EMPTY_PASSWORD: no - REDIS_REPLICATION_MODE: slave - REDIS_MASTER_HOST: redis - REDIS_MASTER_PORT_NUMBER: "6379" - REDIS_MASTER_PASSWORD: ${REDIS_PASSWORD} - REDIS_PASSWORD: ${REDIS_PASSWORD} - deploy: - mode: replicated - replicas: 2 - placement: - max_replicas_per_node: 1 - constraints: - - node.labels.type == service - preferences: - - spread: node.hostname - restart_policy: - condition: any - delay: 5s - labels: - project: co.iklim - - redis-sentinel: - image: bitnamisecure/redis-sentinel:latest - environment: - REDIS_SENTINEL_MASTER_NAME: prod-master - REDIS_MASTER_HOST: redis - REDIS_MASTER_PORT_NUMBER: "6379" - REDIS_MASTER_PASSWORD: ${REDIS_PASSWORD} - REDIS_SENTINEL_QUORUM: "2" - REDIS_SENTINEL_DOWN_AFTER_MILLISECONDS: "5000" - REDIS_SENTINEL_FAILOVER_TIMEOUT: "10000" - deploy: - mode: replicated - replicas: 3 - placement: - max_replicas_per_node: 1 - constraints: - - node.labels.type == service - preferences: - - spread: node.hostname - restart_policy: - condition: any - delay: 5s - labels: - project: co.iklim -``` - -### Microservice connection (Spring Data Redis) - -Microservices must use a Sentinel-aware connection: - -```yaml -# application-prod.yml -spring: - data: - redis: - sentinel: - master: prod-master - nodes: - - redis-sentinel:26379 - password: ${REDIS_PASSWORD} -``` - -### Verification - -```bash -# Query master identity: -docker exec $(docker ps -q -f name=iklimco_redis-sentinel | head -1) \ - redis-cli -p 26379 SENTINEL get-master-addr-by-name prod-master -``` - -## Step 6 — RabbitMQ: 3-node Erlang cluster (prod overlay) - -RabbitMQ runs as a 3-node cluster with one instance per app node. - -### Prerequisites - -```bash -# Create Docker secret for Erlang cookie (must be identical on all nodes): -openssl rand -hex 32 | docker secret create rabbitmq_erlang_cookie - -``` - -### docker-stack-infra.prod.yml — RabbitMQ override - -```yaml -# docker-stack-infra.prod.yml (add alongside redis services) -services: - rabbitmq: - image: rabbitmq:3-management - hostname: "rabbitmq-{{.Node.Hostname}}" - environment: - RABBITMQ_ERLANG_COOKIE_FILE: /run/secrets/rabbitmq_erlang_cookie - RABBITMQ_USE_LONGNAME: "true" - RABBITMQ_NODENAME: "rabbit@rabbitmq-{{.Node.Hostname}}" - secrets: - - rabbitmq_erlang_cookie - networks: - iklimco-net: - aliases: - - "rabbitmq-{{.Node.Hostname}}" - deploy: - mode: replicated - replicas: 3 - placement: - max_replicas_per_node: 1 - constraints: - - node.labels.type == service - update_config: - parallelism: 1 - order: stop-first - labels: - project: co.iklim - -secrets: - rabbitmq_erlang_cookie: - external: true - -networks: - iklimco-net: - external: true -``` - -### Cluster join procedure (first setup) - -RabbitMQ nodes do not form a cluster automatically; manual join is required after first start: - -```bash -# Find the RabbitMQ container on iklim-app-02: -CTR=$(docker ps -q -f name=iklimco_rabbitmq) - -# Stop, join, start: -docker exec "$CTR" rabbitmqctl stop_app -docker exec "$CTR" rabbitmqctl join_cluster rabbit@rabbitmq-iklim-app-01 -docker exec "$CTR" rabbitmqctl start_app - -# Repeat for iklim-app-03 -``` - -```bash -# Verify cluster status (from any node): -docker exec "$CTR" rabbitmqctl cluster_status -``` - -> **HA policy:** After the cluster is formed, set quorum queues as the default: -> ```bash -> docker exec "$CTR" rabbitmqctl set_policy ha-all ".*" \ -> '{"queue-type":"quorum"}' --apply-to queues -> ``` - -## Step 7 — RabbitMQ WebSocket Sticky Sessions (Consistent Hash) - -RabbitMQ Web STOMP (over WebSocket) requires a persistent connection. In a 3-node RabbitMQ cluster, if an APISIX instance uses the default Swarm VIP for the `rabbitmq` upstream, it may cause unnecessary inter-node traffic or connection drops if the session doesn't persist on the same node. - -To optimize this, we implement **Consistent Hashing (chash)** at the APISIX layer based on the client's IP address (`remote_addr`). - -### 1. Update APISIX Upstream Configuration (init.sh) - -Update the `rabbitmq` upstream definition in `init/apisix-core/init.sh` to target specific cluster nodes instead of the generic service name, enabling the `chash` algorithm for prod. - -```bash -# Update upstream rabbitmq block in init.sh -if [[ "$PROFILE" == "prod" ]]; then - # Direct node DNS names to bypass Swarm VIP and allow chash to work effectively - RABBITMQ_NODES='{"rabbitmq-iklim-app-01:15674":1, "rabbitmq-iklim-app-02:15674":1, "rabbitmq-iklim-app-03:15674":1}' - LB_TYPE="chash" - HASH_KEY="remote_addr" -else - RABBITMQ_NODES='{"rabbitmq:15674":1}' - LB_TYPE="roundrobin" - HASH_KEY="" -fi - -call_api "upstream rabbitmq" -X PUT "$APISIX_ADMIN_URL/upstreams/rabbitmq-upstream" \ - -H "X-API-KEY: $API_KEY" -H "Content-Type: application/json" \ - -d '{ - "name": "rabbitmq-upstream", - "type": "'"$LB_TYPE"'", - "key": "'"$HASH_KEY"'", - "nodes": '"$RABBITMQ_NODES"', - "timeout": {"connect": 10, "send": 3600, "read": 3600}, - "scheme": "http", - '"$HC"' - }' -``` - -### 2. Enable Real IP Detection in APISIX - -Consistent hashing by `remote_addr` requires APISIX to see the actual client IP, not the internal IP of the SWAG (Nginx) proxy. - -> **DNS Note:** For `chash` to work with node-specific names, the RabbitMQ service must have network aliases configured for each node (e.g., `rabbitmq-{{.Node.Hostname}}`) as shown in Step 6. - -In the `config.yaml` inside the custom APISIX image (`custom-apisix:3.12.0`): - -```yaml -nginx_config: - http: - real_ip_header: "X-Real-IP" - set_real_ip_from: "10.0.0.0/8" -``` - -## Step 8 — Create `docker-stack-infra.prod.yml` - -Create this file in the repo root alongside `docker-stack-infra.yml`. It combines all prod-specific overrides from Steps 2–6 (including disabling the standalone `etcd` from Step 4): - -```yaml -# docker-stack-infra.prod.yml -# Prod overlay — deploy with: -# docker stack deploy -c docker-stack-infra.yml -c docker-stack-infra.prod.yml iklimco - -services: - - vault: - environment: - VAULT_LOCAL_CONFIG: >- - {"api_addr":"https://vault.iklim.co:8200", - "cluster_addr":"https://{{ .Node.Hostname }}:8201", - "storage":{"raft":{"path":"/vault/file","node_id":"{{ .Node.Hostname }}"}}, - "listener":[{"tcp":{"address":"0.0.0.0:8200", - "tls_cert_file":"/vault/certs/STAR.iklim.co.full.crt", - "tls_key_file":"/vault/certs/STAR.iklim.co_key.pem"}}], - "default_lease_ttl":"168h","max_lease_ttl":"720h","ui":true} - volumes: - - /opt/iklimco/vault/data:/vault/file - - ${SWAG_CERT_DIR}:/vault/certs:ro - deploy: - mode: replicated - replicas: 3 - placement: - max_replicas_per_node: 1 - constraints: - - node.labels.type == service - - apisix: - deploy: - mode: replicated - replicas: 3 - placement: - max_replicas_per_node: 1 - constraints: - - node.labels.type == service - - apisix-dashboard: - deploy: - mode: replicated - replicas: 3 - placement: - max_replicas_per_node: 1 - constraints: - - node.labels.type == service - - redis: - image: bitnamisecure/redis:latest - environment: - ALLOW_EMPTY_PASSWORD: no - REDIS_PASSWORD: ${REDIS_PASSWORD} - REDIS_REPLICATION_MODE: master - deploy: - mode: replicated - replicas: 1 - placement: - constraints: - - node.labels.type == service - restart_policy: - condition: any - delay: 5s - labels: - project: co.iklim - - redis-replica: - image: bitnamisecure/redis:latest - environment: - ALLOW_EMPTY_PASSWORD: no - REDIS_REPLICATION_MODE: slave - REDIS_MASTER_HOST: redis - REDIS_MASTER_PORT_NUMBER: "6379" - REDIS_MASTER_PASSWORD: ${REDIS_PASSWORD} - REDIS_PASSWORD: ${REDIS_PASSWORD} - deploy: - mode: replicated - replicas: 2 - placement: - max_replicas_per_node: 1 - constraints: - - node.labels.type == service - preferences: - - spread: node.hostname - restart_policy: - condition: any - delay: 5s - labels: - project: co.iklim - - redis-sentinel: - image: bitnamisecure/redis-sentinel:latest - environment: - REDIS_SENTINEL_MASTER_NAME: prod-master - REDIS_MASTER_HOST: redis - REDIS_MASTER_PORT_NUMBER: "6379" - REDIS_MASTER_PASSWORD: ${REDIS_PASSWORD} - REDIS_SENTINEL_QUORUM: "2" - REDIS_SENTINEL_DOWN_AFTER_MILLISECONDS: "5000" - REDIS_SENTINEL_FAILOVER_TIMEOUT: "10000" - deploy: - mode: replicated - replicas: 3 - placement: - max_replicas_per_node: 1 - constraints: - - node.labels.type == service - preferences: - - spread: node.hostname - restart_policy: - condition: any - delay: 5s - labels: - project: co.iklim - - rabbitmq: - image: rabbitmq:3-management - hostname: "rabbitmq-{{.Node.Hostname}}" - environment: - RABBITMQ_ERLANG_COOKIE_FILE: /run/secrets/rabbitmq_erlang_cookie - RABBITMQ_USE_LONGNAME: "true" - RABBITMQ_NODENAME: "rabbit@rabbitmq-{{.Node.Hostname}}" - secrets: - - rabbitmq_erlang_cookie - networks: - iklimco-net: - aliases: - - "rabbitmq-{{.Node.Hostname}}" - deploy: - mode: replicated - replicas: 3 - placement: - max_replicas_per_node: 1 - constraints: - - node.labels.type == service - update_config: - parallelism: 1 - order: stop-first - labels: - project: co.iklim - -secrets: - rabbitmq_erlang_cookie: - external: true - -networks: - iklimco-net: - external: true -``` - -## Step 9 — Monitoring Data Persistence - -Prometheus and Grafana run as single instances. Grafana data is placed on the StorageBox shared filesystem for manual failover. Prometheus TSDB stays on a local Docker volume because DAVFS/StorageBox is not suitable for Prometheus WAL and compaction I/O. - -**Changes already applied to `docker-stack-infra.yml`:** - -```yaml -prometheus: - volumes: - - prometheus-vl:/prometheus - -grafana: - volumes: - - ${GRAFANA_DATA_DIR:-grafana-vl}:/var/lib/grafana -``` - -Test uses the named Docker volume fallback (`grafana-vl`) for Grafana, and Prometheus always uses the named Docker volume (`prometheus-vl`) — no test env change needed. - -**Add to `prod/secrets/iklim.co/.env.prod` on storagebox** (already in `env-prod/.env`): - -```bash -GRAFANA_DATA_DIR=/mnt/storagebox/grafana/data -``` - -> `/mnt/storagebox/grafana/data` is created automatically by the Ansible `storagebox` role during bootstrap via the `storagebox_managed_directories` variable. No manual step required. - -> Grafana writes its SQLite database and dashboard JSON to `/var/lib/grafana`. -> Prometheus writes its TSDB to `/prometheus` on the local `prometheus-vl` Docker volume; it is not shared between nodes. - -## Step 10 — Verify - -```bash -# Base file must be valid on its own (test deploy): -docker stack config -c docker-stack-infra.yml > /dev/null && echo "base OK" - -# Prod merge must be valid: -docker stack config -c docker-stack-infra.yml -c docker-stack-infra.prod.yml > /dev/null && echo "prod merge OK" -``` - -## Step 11 — Database Proxies and Developer Access - -In the production environment, the `pg-proxy` and `mongo-proxy` services (socat-based) defined in the base `docker-stack-infra.yml` are **deprecated and will not be used**. - -### Rationale -- **Leader Tracking:** Simple L4 proxies (socat) cannot track the Patroni Leader or MongoDB Primary. They point to a single service VIP, which might lead to a Read-Only replica during failover. -- **HA Connection Strings:** Modern DB drivers (JDBC, libpq, MongoClient) support multi-host connection strings, which provide native failover and load balancing without an intermediate proxy. - -### Developer Access Strategy -- **Direct Subnet Access:** Developers connect via WireGuard directly to the DB subnet (`10.20.20.0/24`). -- **No Translation:** Instead of mapping ports like `15432`, the standard ports (`5432`, `27017`) are used across all cluster nodes. - -## Placement and Replica Summary — prod - -| Service | File | Replicas | Placement | HA Note | -| ---------------- | ------------ | -------- | ------------------------------------------- | ------------------------------------------------------------------------------------- | -| swag | base | 1 | `node.hostname == iklim-app-01` | No clustering support; Floating IP pinned to node | -| cert-reloader | base | 1 | `node.hostname == iklim-app-01` | Cron-style task; duplicate would be problematic | -| vault | prod overlay | 3 | `node.labels.type == service`; max 1/node | Raft cluster — see `07-vault-raft-plan.md` | -| apisix | prod overlay | 3 | `node.labels.type == service`; max 1/node | Stateless; config in Patroni etcd; rate limit policy:redis | -| apisix-dashboard | prod overlay | 3 | `node.labels.type == service`; max 1/node | Stateless; reads from etcd | -| redis (master) | prod overlay | 1 | `node.labels.type == service`; Swarm spread | Sentinel cluster master; not pinned — reschedules on node failure | -| redis-replica | prod overlay | 2 | `node.labels.type == service`; max 1/node | Sentinel replica; spread:hostname | -| redis-sentinel | prod overlay | 3 | `node.labels.type == service`; max 1/node | Quorum=2; failover automatic | -| rabbitmq | prod overlay | 3 | `node.labels.type == service`; max 1/node | Erlang cluster; quorum queues | -| prometheus | base | 1 | `node.labels.type == service` | No native HA; Thanos is overkill at this scale | -| grafana | base | 1 | `node.labels.type == service` | Not critical | - -> PostgreSQL and MongoDB run in separate DB stacks on `iklimco-*` nodes. See `08-prod-db-cluster-kurulum.md`. -> etcd: 3-node cluster on DB nodes — APISIX shares it via `/apisix` prefix. +For current execution, use the setup runbooks and root stack files listed in the Context section. diff --git a/roadmap/prod-env/07-vault-raft-plan.md b/roadmap/prod-env/07-vault-raft-plan.md index 90892b4..9cb8af2 100644 --- a/roadmap/prod-env/07-vault-raft-plan.md +++ b/roadmap/prod-env/07-vault-raft-plan.md @@ -1,121 +1,83 @@ -# 07 — Vault: 3-Node Raft Cluster (Prod) +# 07 — Vault Raft Stack and Bootstrap Automation (Prod) ## Context -Vault starts directly as a 3-node Raft cluster in prod. The single-instance phase used in test is skipped. -Test used a single Vault instance (file storage, 1 replica on the manager node). Prod goes straight to Raft HA. +Production Vault is a 3-node Raft cluster, but it is no longer initialized through a manual post-deploy runbook. -## Vault service configuration +Current references: -- **Replicas:** 3 (one per service node) -- **Storage:** Raft integrated storage -- **Placement:** `node.labels.type == service` (all 3 app nodes) -- **Cert distribution:** No SSH needed — all nodes mount StorageBox, cert-reloader writes to `SWAG_CERT_DIR=/mnt/storagebox/ssl`, Vault reads from that path on every node +- Setup source: `../../setup/09-prod-runner-ha-and-swarm.md` +- Stack file: root `docker-stack-vault.yml` +- Bootstrap script: root `init/vault/vault-bootstrap.sh` +- Template: root `init/vault/vault-template-v2.json` -### Prerequisites +## Current Model -- [ ] All 3 service nodes are running and labeled `type=service` -- [ ] `/mnt/storagebox/ssl/` directory is mounted and accessible on all 3 app nodes -- [ ] Vault data directory `/opt/iklimco/vault/data/` exists on all 3 nodes (host path volumes) +Vault is deployed separately from `docker-stack-infra_db-prod.yml`. -### Vault service YAML (docker-stack-infra.prod.yml overlay) +The Vault stack uses: -```yaml -vault: - # ... (image, secrets, healthcheck unchanged from base) - environment: - VAULT_LOCAL_CONFIG: >- - {"api_addr":"https://vault.iklim.co:8200", - "cluster_addr":"https://{{ .Node.Hostname }}:8201", - "storage":{"raft":{"path":"/vault/file","node_id":"{{ .Node.Hostname }}"}}, - "listener":[{"tcp":{"address":"0.0.0.0:8200", - "tls_cert_file":"/vault/certs/STAR.iklim.co.full.crt", - "tls_key_file":"/vault/certs/STAR.iklim.co_key.pem"}}], - "default_lease_ttl":"168h","max_lease_ttl":"720h","ui":true} - volumes: - - /opt/iklimco/vault/data:/vault/file # host path per node - - ${SWAG_CERT_DIR}:/vault/certs:ro # StorageBox — shared across all nodes, no SSH distribution needed - deploy: - mode: replicated - replicas: 3 - placement: - max_replicas_per_node: 1 - constraints: - - node.labels.type == service +- 3 replicas, one per service node when placement allows it. +- Docker volumes such as `vault-data-vl` and `vault-logs-vl`. +- `/opt/iklimco/ssl:/vault/certs:ro` for TLS certificates. +- `iklimco-net` as an external overlay network. +- `vault_unseal_key` as a Docker secret. + +The production workflow calls `init-infra-prod.sh`, which calls `init/vault/vault-bootstrap.sh`. The bootstrap script handles stack deploy, initialization, unseal key secret rotation, peer join, and peer unseal. + +## Certificate Flow + +Vault does not read TLS certificates directly from `/mnt/storagebox/ssl`. + +The current flow is: + +```text +SWAG renews certificate +cert-reloader copies renewed files to /mnt/storagebox/ssl +cert-distributor syncs certificate files to /opt/iklimco/ssl on service nodes +Vault reads /opt/iklimco/ssl through the /vault/certs mount ``` -> `{{ .Node.Hostname }}` is Docker Swarm's Go template for the node hostname — -> gives each Vault instance a unique `node_id`. +## Bootstrap Flow -## Raft initialization procedure (first deploy) +Normal production bootstrap is automated: -### Step 1 — Deploy the stack +1. Create or refresh the placeholder `vault_unseal_key` secret when needed. +2. Deploy `docker-stack-vault.yml`. +3. Initialize Vault with one key share and one threshold if it is not initialized. +4. Replace the placeholder `vault_unseal_key` secret with the real unseal key. +5. Unseal the leader. +6. Join peers to the Raft cluster. +7. Unseal peers. +8. Verify Raft peers and service health. + +These operations belong to `vault-bootstrap.sh`, not to a manual operator checklist. + +## Verification + +Use the current setup verification flow: ```bash -docker stack deploy -c docker-stack-infra.yml -c docker-stack-infra.prod.yml iklimco +docker service ps iklimco_vault +docker exec $(docker ps -q -f name=iklimco_vault | head -1) vault status +docker exec $(docker ps -q -f name=iklimco_vault | head -1) vault operator raft list-peers ``` -All 3 Vault containers start. Only the first one to initialize becomes the leader. +Expected state: -### Step 2 — Initialize Vault on the leader (iklim-app-01) +- Vault service has 3 running tasks. +- `vault status` reports `Sealed false`. +- Raft list shows one leader and two followers. -```bash -VAULT_CTR=$(docker ps -q -f name=iklimco_vault) -docker exec -it "$VAULT_CTR" vault operator init -``` +## Historical / Superseded by Setup -Save the unseal keys and root token securely. Store the unseal key as a Docker secret: +The previous manual procedure is superseded: -```bash -echo -n "" | docker secret create vault_unseal_key - -``` +- Deploying Vault through `docker-stack-infra.yml` + `docker-stack-infra.prod.yml`. +- Creating `/opt/iklimco/vault/data` host-path directories on each app node. +- Running `vault operator init` manually. +- Manually copying/storing unseal keys. +- Manually running `vault operator raft join` on peers. +- Manually unsealing each peer after join. -### Step 3 — Unseal the leader - -```bash -docker exec -it "$VAULT_CTR" vault operator unseal -``` - -The healthcheck auto-unseals on subsequent restarts via the `vault_unseal_key` secret. - -### Step 4 — Join remaining nodes to the Raft cluster - -On iklim-app-02 and iklim-app-03 containers: - -```bash -docker exec -it vault operator raft join \ - https://vault.iklim.co:8200 - -docker exec -it vault operator raft join \ - https://vault.iklim.co:8200 -``` - -Unseal each node after joining: - -```bash -docker exec -it vault operator unseal -docker exec -it vault operator unseal -``` - -### Step 5 — Verify cluster - -```bash -docker exec "$VAULT_CTR" vault operator raft list-peers -``` - -Expected: 3 peers, one `leader`, two `follower`. - -## cert-reloader — no additional changes needed for Raft - -cert-reloader writes the cert to `SWAG_CERT_DIR=/mnt/storagebox/ssl`. -Since StorageBox is mounted on all app nodes, every Vault instance already sees the same path. - -The cert renewal flow works unchanged with Raft: -``` -cert changed → copy to /mnt/storagebox/ssl/ → docker service update --force iklimco_vault -Vault (3 replicas) restart → each auto-unseals via healthcheck -``` - -## Reference -- Vault Raft storage docs: https://developer.hashicorp.com/vault/docs/configuration/storage/raft -- Vault Swarm setup: https://manjit28.medium.com/setting-up-a-secure-and-highly-available-hashicorp-vault-cluster-for-secrets-and-certificates-0ce01a370582 +Keep those notes only as historical context. For current prod, use `docker-stack-vault.yml` and `init/vault/vault-bootstrap.sh`. diff --git a/setup-vs-roadmap-map.md b/setup-vs-roadmap-map.md index 524469b..2e45ba6 100644 --- a/setup-vs-roadmap-map.md +++ b/setup-vs-roadmap-map.md @@ -1,24 +1,23 @@ # Setup Aşamaları — Roadmap Eşleştirme Tablosu -Bu tablo, `roadmap/test-env` ve `roadmap/prod-env` klasörlerindeki yol haritası adımlarının -Terraform/Ansible setup aşamalarından hangisinde ele alındığını gösterir. +Bu tablo, `roadmap/test-env` ve `roadmap/prod-env` klasörlerindeki yol haritası adımlarının Terraform/Ansible setup aşamalarından hangisinde ele alındığını gösterir. ## TEST ortamı | Roadmap adımı | Hangi aşamada ele alınmalı | | --- | --- | -| Hetzner firewall (sadece 22/80/443) | **Terraform `02-test-terraform-iaac.md`** — `firewall.tf` | -| Sunucu oluşturma (`iklim-app-01`, `iklim-db-01`) | **Terraform `02-test-terraform-iaac.md`** — `servers.tf` | -| Private network + placement group (`iklim-test-spread`) | **Terraform `02-test-terraform-iaac.md`** — `network.tf`, `placement.tf` | -| Floating IP (`iklim-test-app-fip`) | **Terraform `02-test-terraform-iaac.md`** — `floating_ip.tf` | +| Hetzner firewall (sadece 22/80/443) | **Terraform `02-test-terraform-iac.md`** — `firewall.tf` | +| Sunucu oluşturma (`iklim-app-01`, `iklim-db-01`) | **Terraform `02-test-terraform-iac.md`** — `servers.tf` | +| Private network + placement group (`iklim-test-spread`) | **Terraform `02-test-terraform-iac.md`** — `network.tf`, `placement.tf` | +| Floating IP (`iklim-test-app-fip`) | **Terraform `02-test-terraform-iac.md`** — `floating_ip.tf` | | Docker Engine kurulumu (app + db node) | **Ansible `03-test-ansible-bootstrap.md`** — `docker` role | | Security hardening (SSH, firewalld, fail2ban) | **Ansible `03-test-ansible-bootstrap.md`** — `hardening` role | | Docker Swarm init + `iklim-db-01` worker join | **Ansible `03-test-ansible-bootstrap.md`** — `swarm` role | | `type=service` ve `role=db` node label'ları | **Ansible `03-test-ansible-bootstrap.md`** — `swarm` role | | `/opt/iklimco/...` dizinleri | **Ansible `03-test-ansible-bootstrap.md`** — `node_dirs` role | | StorageBox DAVFS mount (`u469968-sub4`) | **Ansible `03-test-ansible-bootstrap.md`** — `storagebox` role | -| DB stack deploy (PostgreSQL + MongoDB on `iklim-db-01`) | **Manuel `04-test-db-docker-kurulum.md`** | -| `act_runner` systemd kurulumu | **Ansible `05-test-runner-ve-deploy-onkosullari.md`** — `act_runner` role (`test-app-post-stack.yml`) | +| DB stack deploy (PostgreSQL + MongoDB on `iklim-db-01`) | **Manuel `04-test-db-docker-setup.md`** | +| `act_runner` systemd kurulumu | **Ansible `05-test-runner-and-deploy-prerequisites.md`** — `act_runner` role (`test-app-post-stack.yml`) | | GoDaddy credentials storagebox'a yükleme | **Manuel kalır** — secret yönetimi, Terraform/Ansible dışı | | `docker-stack-infra.yml` port kaldırma + SWAG/cert-reloader ekleme | **Pipeline `deploy-test.yml`** + **repo değişikliği** — `roadmap/test-env/03` | | SWAG nginx proxy conf'ları (`template/swag/site-confs/*.conf.tpl`) | **Repo içinde teslim edildi** — `roadmap/test-env/04` | @@ -31,22 +30,22 @@ Terraform/Ansible setup aşamalarından hangisinde ele alındığını gösterir | Roadmap adımı | Hangi aşamada ele alınmalı | | --- | --- | -| 6 sunucu oluşturma (`iklim-app-01/02/03`, `iklim-db-01/02/03`) | **Terraform `06-prod-terraform-iaac.md`** — `servers.tf` | -| Private network + 2 placement group | **Terraform `06-prod-terraform-iaac.md`** — `network.tf`, `placement.tf` | -| Firewall (sadece 22/80/443 public; private port matrisi) | **Terraform `06-prod-terraform-iaac.md`** — `firewall.tf` | -| Floating IP (`iklim-prod-app-fip`, `iklim-app-01`'e atanır) | **Terraform `06-prod-terraform-iaac.md`** — `floating_ip.tf` | +| 6 sunucu oluşturma (`iklim-app-01/02/03`, `iklim-db-01/02/03`) | **Terraform `06-prod-terraform-iac.md`** — `servers.tf` | +| Private network + 2 placement group | **Terraform `06-prod-terraform-iac.md`** — `network.tf`, `placement.tf` | +| Firewall (sadece 22/80/443 public; private port matrisi) | **Terraform `06-prod-terraform-iac.md`** — `firewall.tf` | +| Floating IP (`iklim-prod-app-fip`, `iklim-app-01`'e atanır) | **Terraform `06-prod-terraform-iac.md`** — `floating_ip.tf` | | Docker Engine kurulumu (tüm node'lar — app ve db) | **Ansible `07-prod-ansible-bootstrap.md`** — `docker` role | | Security hardening (tüm node'lar) | **Ansible `07-prod-ansible-bootstrap.md`** — `hardening` role | | Swarm init (`iklim-app-01`) + manager join (`iklim-app-02/03`) | **Ansible `07-prod-ansible-bootstrap.md`** — `swarm` role | | `type=service` node label (3 app node) | **Ansible `07-prod-ansible-bootstrap.md`** — `swarm` role | | `/opt/iklimco/...` dizinleri + `/opt/iklimco/stacks` | **Ansible `07-prod-ansible-bootstrap.md`** — `node_dirs` role | | StorageBox DAVFS mount (`u469968-sub5`) | **Ansible `07-prod-ansible-bootstrap.md`** — `storagebox` role | -| DB node'larını Swarm'a worker olarak join et | **Manuel `08-prod-db-cluster-kurulum.md`** — Bölüm 2 | -| `role=db` node label (3 db node) | **Manuel `08-prod-db-cluster-kurulum.md`** — Bölüm 2 | -| etcd cluster deploy (Patroni için) | **Manuel `08-prod-db-cluster-kurulum.md`** — Bölüm 5.2 | -| MongoDB replica set deploy | **Manuel `08-prod-db-cluster-kurulum.md`** — Bölüm 4 | -| Patroni + PostgreSQL HA deploy | **Manuel `08-prod-db-cluster-kurulum.md`** — Bölüm 5.4 | -| 3× `act_runner` systemd (HA runner) | **Ansible `09-prod-runner-ha-ve-swarm.md`** — `act_runner` role | +| DB node'larını Swarm'a worker olarak join et | **Manuel `08-prod-db-cluster-setup.md`** — Bölüm 2 | +| `role=db` node label (3 db node) | **Manuel `08-prod-db-cluster-setup.md`** — Bölüm 2 | +| etcd cluster deploy (Patroni için) | **Manuel `08-prod-db-cluster-setup.md`** — Bölüm 5.2 | +| MongoDB replica set deploy | **Manuel `08-prod-db-cluster-setup.md`** — Bölüm 4 | +| Patroni + PostgreSQL HA deploy | **Manuel `08-prod-db-cluster-setup.md`** — Bölüm 5.4 | +| 3× `act_runner` systemd (HA runner) | **Ansible `09-prod-runner-ha-and-swarm.md`** — `act_runner` role | | GoDaddy credentials storagebox'a yükleme | **Manuel kalır** — secret yönetimi, Terraform/Ansible dışı | | `docker-stack-infra.yml` port kaldırma + SWAG/cert-reloader ekleme | **Repo değişikliği** — `roadmap/prod-env/03` | | SWAG nginx proxy conf'ları (`template/swag/site-confs/*.conf.tpl`) | **Repo içinde teslim edildi** — `roadmap/prod-env/04` | @@ -61,16 +60,16 @@ Terraform/Ansible setup aşamalarından hangisinde ele alındığını gösterir ``` Environment_Infrastructure/ setup/ ← Terraform + Ansible aşama dokümanları - 00-genel-yol-haritasi.md - 01-private-network-port-matrisi.md - 02-test-terraform-iaac.md + 00-general-roadmap.md + 01-private-network-port-matrix.md + 02-test-terraform-iac.md 03-test-ansible-bootstrap.md - 04-test-db-docker-kurulum.md - 05-test-runner-ve-deploy-onkosullari.md - 06-prod-terraform-iaac.md + 04-test-db-docker-setup.md + 05-test-runner-and-deploy-prerequisites.md + 06-prod-terraform-iac.md 07-prod-ansible-bootstrap.md - 08-prod-db-cluster-kurulum.md - 09-prod-runner-ha-ve-swarm.md + 08-prod-db-cluster-setup.md + 09-prod-runner-ha-and-swarm.md roadmap/ test-env/ ← Test ortamı Roadmap adımları prod-env/ ← Prod Roadmap adımları diff --git a/setup/00-genel-yol-haritasi.md b/setup/00-general-roadmap.md similarity index 79% rename from setup/00-genel-yol-haritasi.md rename to setup/00-general-roadmap.md index 8d4f510..612b96a 100644 --- a/setup/00-genel-yol-haritasi.md +++ b/setup/00-general-roadmap.md @@ -43,9 +43,9 @@ Minimum topology for the test environment: | Node | Role | Note | | --- | --- | --- | | `iklim-app-01` | Swarm manager + app worker + Gitea runner | CI/CD test deploy runs through this node | -| `iklim-db-01` | DB node | DB infrastructure will be installed manually; it will not be installed by Gitea CI/CD | +| `iklim-db-01` | DB node / Swarm worker | DB host prerequisites are prepared by Ansible; DB services are deployed as Swarm services by the environment stack/pipeline | -The test DB setup is brought only up to machine and OS preparation with Terraform/Ansible. PostgreSQL/MongoDB cluster installation is outside this phase. +The test DB setup is brought up to OS, Docker, Swarm worker, config directory, and WireGuard preparation with Terraform/Ansible. PostgreSQL/MongoDB runtime services are not installed directly on the OS; they run as Docker Swarm services. ### Prod @@ -56,23 +56,25 @@ HA topology for the prod environment: | `iklim-app-*` | 3 | Each one is a Swarm manager + app worker | | `iklim-db-*` | 3 | DB cluster nodes | -Prod DB infrastructure will be installed manually; it will not be installed by Gitea CI/CD. Terraform prepares the DB machines and network/firewall rules; Ansible installs OS hardening and base dependencies. +Prod DB host prerequisites are prepared by Terraform/Ansible. Runtime DB services are part of the current prod Swarm stack: etcd, Patroni/PostgreSQL, and MongoDB replica set are deployed by the prod root pipeline through `docker-stack-infra_db-prod.yml`. ## Public Port Policy -Ports open to the public internet are only: +Ports open to the public internet are normally only: - `22/tcp` SSH, only from admin IP/CIDR sources - `80/tcp` HTTP - `443/tcp` HTTPS +Test has one explicit exception: `51820/udp` is opened on the DB node for WireGuard VPN, authenticated cryptographically. Prod currently does not expose `51820/udp` in Terraform. + `8200/tcp` Vault will not be opened to the public internet. Vault must be reachable only from the private network or Docker overlay. -`docker-stack-infra.yml` has been aligned with this policy: only the SWAG service publishes ports 80/443; all other services such as Vault, APISIX, RabbitMQ, Prometheus, and Grafana are reachable only through the `iklimco-net` overlay. +Current prod stack behavior is aligned with this policy: `docker-stack-infra_db-prod.yml` publishes public traffic through SWAG on 80/443. Vault is deployed separately by `vault-bootstrap.sh` using `docker-stack-vault.yml`; it is not publicly exposed. ## Private Network Policy -The detailed matrix of ports that must be opened inside the private network is in `01-private-network-port-matrisi.md`. Agents must treat that file as the source when writing firewall or Ansible UFW rules. +The detailed matrix of ports that must be opened inside the private network is in `01-private-network-port-matrix.md`. Agents must treat that file as the source when writing Terraform Hetzner firewall rules and Ansible `firewalld` rules. ## Gitea Actions Runner Decision diff --git a/setup/01-private-network-port-matrisi.md b/setup/01-private-network-port-matrix.md similarity index 84% rename from setup/01-private-network-port-matrisi.md rename to setup/01-private-network-port-matrix.md index ab6204f..ade2d44 100644 --- a/setup/01-private-network-port-matrisi.md +++ b/setup/01-private-network-port-matrix.md @@ -1,8 +1,8 @@ -# 07 - Private Network Port Matrix +# 01 - Private Network Port Matrix -This file defines the ports that must be opened inside the Hetzner private network for test and prod environments. Ports open to the public internet will only be `22/tcp`, `80/tcp`, and `443/tcp`. Vault `8200/tcp` will not be opened publicly. +This file defines the ports that must be opened inside the Hetzner private network for test and prod environments. Public ingress is limited to `22/tcp`, `80/tcp`, and `443/tcp`, with one current test-only exception: `51820/udp` is public on the test DB node for WireGuard. Vault `8200/tcp` will not be opened publicly. -This matrix must be treated as the source for Terraform Hetzner firewall and Ansible UFW rules. +This matrix must be treated as the source for Terraform Hetzner firewall and Ansible `firewalld` rules. ## Network Plan @@ -11,25 +11,25 @@ This matrix must be treated as the source for Terraform Hetzner firewall and Ans | Subnet | CIDR | Purpose | | --- | --- | --- | | App/Swarm | `10.10.10.0/24` | `iklim-app-01` | -| DB | `10.10.20.0/24` | `test-db-01` | +| DB | `10.10.20.0/24` | `iklim-db-01` | ### Prod | Subnet | CIDR | Purpose | | --- | --- | --- | | App/Swarm | `10.20.10.0/24` | `iklim-app-01/02/03` | -| DB | `10.20.20.0/24` | `prod-db-01/02/03` | +| DB | `10.20.20.0/24` | `iklim-db-01/02/03` | ## Public Ingress Standard -Public ingress for all environments: +Public ingress: | Port | Protocol | Source | Target | Requirement | | --- | --- | --- | --- | --- | | `22` | TCP | Admin IP/CIDR | All nodes | SSH management | | `80` | TCP | Internet | `iklim-app-01` (gateway) | HTTP / ACME redirect | | `443` | TCP | Internet | `iklim-app-01` (gateway) | HTTPS | -| `51820` | UDP | `0.0.0.0/0`, `::/0` | `iklim-db-01` (DB node) | WireGuard VPN — authentication with cryptographic key | +| `51820` | UDP | `0.0.0.0/0`, `::/0` | `iklim-db-01` in test only | WireGuard VPN — authentication with cryptographic key | Critical ports that will not be opened publicly: @@ -80,11 +80,11 @@ These ports will not be opened publicly. Access will be allowed only from requir | `9090` | TCP | Prometheus UI/API | Admin CIDR or private ops | Prometheus service/node | Public closed | | `3000` | TCP | Grafana UI | Admin CIDR or private ops | Grafana service/node | Public closed | -`docker-stack-infra.yml` has been updated so that only the SWAG service publishes ports 80/443 in host mode. All other services contain no published ports; access is provided only through the `iklimco-net` overlay. This table remains the source for private ingress decisions. +The current prod root stack is `docker-stack-infra_db-prod.yml`; Vault is deployed separately with `docker-stack-vault.yml` through `vault-bootstrap.sh`. Public traffic is expected to enter through SWAG on 80/443. Private service reachability is provided by the `iklimco-net` overlay and by the explicit host-mode DB/cluster ports listed below. ## DB Node Ports -Because DB infrastructure will be installed manually, the exact cluster technology is outside this document. Still, the default ports for firewall purposes are below. +DB runtime services are deployed as Docker Swarm services. Prod currently uses Patroni/PostgreSQL, etcd, and a MongoDB replica set in `docker-stack-infra_db-prod.yml`; the required firewall ports are below. ### PostgreSQL / PostGIS (Patroni + etcd) @@ -129,7 +129,7 @@ App subnet (swarm firewall) — traffic inside itself: | Source | Target | Ports | | --- | --- | --- | | `10.20.10.0/24` | `10.20.10.0/24` | `2377/tcp`, `7946/tcp`, `7946/udp`, `4789/udp` (Swarm) | -| `10.20.10.0/24` | `10.20.10.0/24` | `8200/tcp`, `6379/tcp`, `5672/tcp`, `61613/tcp`, `15674/tcp`, `2379/tcp` (application services) | +| `10.20.10.0/24` | `10.20.10.0/24` | `8200/tcp`, `5672/tcp`, `61613/tcp`, `15674/tcp` (application services) | | Admin CIDR or VPN | `10.20.10.0/24` | `15672/tcp`, `9180/tcp`, `9090/tcp`, `3000/tcp` | App -> DB traffic (there is no related rule in the swarm firewall; it is allowed in the db firewall): @@ -157,7 +157,7 @@ DB -> App traffic (allowed in the swarm firewall): - The public firewall does not open `8200/tcp`. - DB ports are not open publicly. -- Swarm ports are open only inside the private app/swarm subnet. +- Swarm ports are open only between Swarm app and DB subnets. - The App/Swarm subnet reaches the DB subnet only through required DB ports. - The DB subnet is not opened to the app subnet with broad permissions. - Admin UI ports are restricted through admin CIDR/VPN/private ops instead of public access. diff --git a/setup/02-test-terraform-iaac.md b/setup/02-test-terraform-iac.md similarity index 91% rename from setup/02-test-terraform-iaac.md rename to setup/02-test-terraform-iac.md index 4e6af16..822af8b 100644 --- a/setup/02-test-terraform-iaac.md +++ b/setup/02-test-terraform-iac.md @@ -11,8 +11,8 @@ Terraform creates the following in the test environment: - App/Swarm subnet: `10.10.10.0/24` - DB subnet: `10.10.20.0/24` - Firewall: - - Public ingress: only `22/tcp`, `80/tcp`, `443/tcp` - - Private ingress: test rules in `01-private-network-port-matrisi.md` + - Public ingress: `22/tcp`, `80/tcp`, `443/tcp`, plus test DB WireGuard `51820/udp` + - Private ingress: test rules in `01-private-network-port-matrix.md` - SSH key - Placement group: `iklim-test-spread` - Floating IP: stable IPv4 for the swarm entry point @@ -21,7 +21,7 @@ Terraform creates the following in the test environment: - `iklim-db-01` - Ansible inventory output -Terraform does not install DB software. The DB node is prepared only at the machine, network, and firewall level. +Terraform does not install DB software. The DB node is prepared at the machine, network, and firewall level; Ansible later prepares Docker, Swarm worker membership, DB config directories, and WireGuard. ## Recommended File Structure @@ -69,7 +69,7 @@ The server type decision is based on the current test environment metrics in `.. | Server | Private IP | Role | | --- | --- | --- | | `iklim-app-01` | `10.10.10.11` | Swarm manager + app worker + Gitea runner | -| `iklim-db-01` | `10.10.20.11` | DB node prepared for manual DB installation | +| `iklim-db-01` | `10.10.20.11` | DB node / Swarm worker for DB services | Private IPs must be statically defined inside Terraform. Ansible inventory and firewall rules remain deterministic. @@ -91,7 +91,7 @@ Public ingress: | `80/tcp` | `0.0.0.0/0`, `::/0` | `iklim-app-01` | | `443/tcp` | `0.0.0.0/0`, `::/0` | `iklim-app-01` | -For public ingress, `8200/tcp`, `5432/tcp`, `27017/tcp`, `5672/tcp`, `15672/tcp`, `6379/tcp`, `2379/tcp`, `9000/tcp`, `9180/tcp`, `9090/tcp`, and `3000/tcp` will not be opened. +For public ingress, `8200/tcp`, `5432/tcp`, `27017/tcp`, `5672/tcp`, `15672/tcp`, `6379/tcp`, `2379/tcp`, `9000/tcp`, `9180/tcp`, `9090/tcp`, and `3000/tcp` will not be opened. `51820/udp` is the explicit test-only public exception for WireGuard. ### App (swarm) Firewall — Private Ingress @@ -133,9 +133,9 @@ Source from DB subnet, because `iklim-db-01` joins Swarm as a worker: | `7946/tcp,udp` | Docker Swarm node discovery | `10.10.10.0/24` (app subnet) | | `4789/udp` | Docker Swarm VXLAN overlay | `10.10.10.0/24` (app subnet) | -IP restriction is done in the SWAG nginx configuration, not in the Hetzner firewall. None of these ports are opened publicly from the `admin_allowed_cidrs` source. +IP restriction is done in the SWAG nginx configuration, not in the Hetzner firewall. None of these management ports are opened publicly from the `admin_allowed_cidrs` source. -For other private ingress rules, `01-private-network-port-matrisi.md` will be used as the source. +For other private ingress rules, `01-private-network-port-matrix.md` will be used as the source. ## Placement Group @@ -204,6 +204,6 @@ Each server gets `lifecycle { prevent_destroy = true }`. While this block exists - `terraform plan` works only with the test Hetzner Project token. - 2 servers are created after `terraform apply`. - The two servers can reach each other through the private network. -- Only `22`, `80`, and `443` are open at firewall level from the public internet. +- Only `22`, `80`, `443`, and test WireGuard `51820/udp` are open at firewall level from the public internet. - Vault `8200` remains closed from the public internet. - Terraform state is not committed to the repo. diff --git a/setup/03-test-ansible-bootstrap.md b/setup/03-test-ansible-bootstrap.md index 6c110d8..af2e1e6 100644 --- a/setup/03-test-ansible-bootstrap.md +++ b/setup/03-test-ansible-bootstrap.md @@ -97,7 +97,7 @@ ansible-playbook test-bootstrap.yml --tags "hardening" --ask-vault-pass | Host | Role | | --- | --- | | `iklim-app-01` | Swarm manager + app worker | -| `iklim-db-01` | OS-hardened DB node for manual DB installation | +| `iklim-db-01` | OS-hardened DB node / Swarm worker for DB services | ## Recommended File Structure @@ -281,7 +281,7 @@ Deploy prerequisites on `iklim-app-01`: /opt/iklimco/stacks ``` -Minimum for manual DB installation on the DB node: +Minimum DB-node host directories: ```text /opt/iklimco @@ -391,7 +391,7 @@ vault_iklim_password: "IKLIM_USER_PASSWORD" creates: "{{ storagebox_mount_point }}/.mounted_marker" ``` - A marker file can be written to the directory to confirm mount success: +A marker file can be written to the directory to confirm mount success: ```yaml - name: Write mount marker @@ -402,7 +402,7 @@ vault_iklim_password: "IKLIM_USER_PASSWORD" 6. **Create service bind mount directories** - In the test environment, the precipitation service's `image-data` volume is bind mounted on the host to `/mnt/storagebox/precipitation/images`. The directory is created by Ansible after StorageBox is mounted and left with `0755` permissions. +In the test environment, the precipitation service's `image-data` volume is bind mounted on the host to `/mnt/storagebox/precipitation/images`. The directory is created by Ansible after StorageBox is mounted and left with `0755` permissions. ```yaml - name: Create managed StorageBox directories @@ -447,13 +447,13 @@ An ed25519 SSH key pair is generated on the server and uploaded to the StorageBo 2. **Upload the public key to StorageBox** - This step is done manually and requires the password the first time: +This step is done manually and requires the password the first time: ```bash cat /root/.ssh/id_ed25519_storagebox.pub | ssh -p23 u469968-sub4@u469968-sub4.your-storagebox.de install-ssh-key ``` - Later access works passwordlessly: +Later access works passwordlessly: ```bash sftp -P23 u469968-sub4@u469968-sub4.your-storagebox.de @@ -461,14 +461,14 @@ An ed25519 SSH key pair is generated on the server and uploaded to the StorageBo 3. **Add private and public keys to Gitea** - Gitea -> Organization Settings -> Actions -> Secrets: +Gitea -> Organization Settings -> Actions -> Secrets: | Secret Name | Value | | --- | --- | | `STORAGEBOX_SSH_PRIV` | Contents of `/root/.ssh/id_ed25519_storagebox` | | `STORAGEBOX_SSH_PUB` | Contents of `/root/.ssh/id_ed25519_storagebox.pub` | - To get the key contents: +To get the key contents: ```bash cat /root/.ssh/id_ed25519_storagebox diff --git a/setup/04-test-db-docker-kurulum.md b/setup/04-test-db-docker-setup.md similarity index 74% rename from setup/04-test-db-docker-kurulum.md rename to setup/04-test-db-docker-setup.md index c0accf1..b5cdda7 100644 --- a/setup/04-test-db-docker-kurulum.md +++ b/setup/04-test-db-docker-setup.md @@ -1,6 +1,6 @@ -# 04 - Test DB Docker Installation (Swarm Worker) +# 04 - Test DB Docker Setup (Swarm Worker) -The purpose of this phase is to add the `iklim-db-01` node to Swarm as a worker and run PostgreSQL and MongoDB as Swarm services. +The purpose of this phase is to add the `iklim-db-01` node to Swarm as a worker and prepare the host for PostgreSQL and MongoDB Swarm services. ## Architecture Decision @@ -8,12 +8,12 @@ The roadmap states that DBs will be installed "manually". In the test environmen The installation has **two phases:** 1. **Preparation (Ansible):** The `test-db-post-stack.yml` playbook sets up DB directories, the `mongod.conf` configuration, and the WireGuard VPN service. -2. **Deploy (Gitea CI/CD):** The `deploy-test.yml` workflow deploys PostgreSQL and MongoDB services to Swarm through `docker-stack-infra.yml`. +2. **Deploy (Gitea CI/CD):** The test deploy workflow deploys PostgreSQL and MongoDB services as part of the environment stack. **Why?** 1. **Ease of management:** Version transitions and configuration management are much faster with Docker. 2. **Overlay Network:** Application services (`iklim-app-01`) can access DBs through the `iklimco-net` overlay network in an encrypted and isolated way. -3. **Data persistence:** Data is stored in Docker named volumes on `iklim-db-01`. StorageBox is used only for backups. +3. **Data persistence:** Runtime data is kept on the DB node. StorageBox is used for shared configuration, operational files, and backup-related paths, not as the primary DB data path. ## Prerequisites @@ -67,24 +67,21 @@ On `iklim-db-01`, through the `db_stack` and `wireguard` roles: - Places the `mongod.conf` file - Installs and configures the WireGuard VPN server (`51820/udp`) -> Deploying DB services (PostgreSQL, MongoDB) to Swarm is the responsibility of the Gitea CI/CD workflow (`deploy-test.yml`), not Ansible. This workflow deploys all services at once through `docker-stack-infra.yml`. +> Deploying DB services (PostgreSQL, MongoDB) to Swarm is the responsibility of the Gitea CI/CD workflow, not Ansible. The Ansible playbook prepares host directories, configuration, and WireGuard. ## 4. Volume and Data Structure -DB data is stored in Docker named volumes on `iklim-db-01`: +DB data is stored on `iklim-db-01` through the stack's configured volume or bind-mount layout. The Ansible `db_stack` role prepares MongoDB configuration at: -| Volume | Content | -|---|---| -| `iklim-db_postgresql_data` | PostgreSQL data files | -| `iklim-db_mongodb_data` | MongoDB data files | +```text +/opt/iklimco/db/mongodb/config/mongod.conf +``` -MongoDB logs are written to stdout and can be watched with `docker logs`. Configuration: `/opt/iklimco/db/mongodb/config/mongod.conf` - -> StorageBox is **not used** for DB data. It only has a role in the backup strategy. +MongoDB logs are written to stdout and can be watched with `docker logs`. ## 5. Acceptance Criteria - `iklim-db-01` appears as Ready and Active in the `docker node ls` command. - `docker stack services iklimco` shows both services with 1/1 replicas. - Access from the application node is available through the `iklim-db_postgresql` and `iklim-db_mongodb` DNS names. -- Data is preserved from named volumes after reboot; verify with `docker volume ls`. +- Data is preserved after reboot according to the stack's configured DB volume/bind-mount layout. diff --git a/setup/05-test-runner-ve-deploy-onkosullari.md b/setup/05-test-runner-and-deploy-prerequisites.md similarity index 84% rename from setup/05-test-runner-ve-deploy-onkosullari.md rename to setup/05-test-runner-and-deploy-prerequisites.md index 4184956..6adba4a 100644 --- a/setup/05-test-runner-ve-deploy-onkosullari.md +++ b/setup/05-test-runner-and-deploy-prerequisites.md @@ -8,7 +8,7 @@ A single runner is used in the test environment for cost and simplicity: | Host | Service Name | System User | Labels | | --- | --- | --- | --- | -| `iklim-app-01` | `gitea-act-runner` | `gitea-runner` | `ubuntu-latest`, `ubuntu-22.04`, `ubuntu-20.04`, `test-runner` | +| `iklim-app-01` | `gitea-act-runner` | `gitea-runner` | `ubuntu-latest`, `ubuntu-22.04`, `ubuntu-20.04`, `test-runner:docker://catthehacker/ubuntu:act-22.04` | ## 1. Runner User and Permissions @@ -56,14 +56,15 @@ Critical parts of the configuration: ```yaml runner: labels: - - "ubuntu-latest:docker://ubuntu:latest" - - "ubuntu-22.04:docker://ubuntu:22.04" - - "ubuntu-20.04:docker://ubuntu:20.04" - - "test-runner:docker://ubuntu:22.04" + - "ubuntu-latest" + - "ubuntu-22.04" + - "ubuntu-20.04" + - "test-runner:docker://catthehacker/ubuntu:act-22.04" container: - network: "iklimco-net" # Access to DB services through overlay - options: "-v /var/run/docker.sock:/var/run/docker.sock" # For Docker commands + network: "bridge" + options: "-v /mnt/storagebox:/mnt/storagebox" + docker_host: "unix:///var/run/docker.sock" ``` Status check: @@ -94,7 +95,7 @@ The following secrets must be defined at Gitea Organization level for pipelines ## 6. Custom Image Build and Harbor Push -`docker-stack-infra.yml` and microservice stacks use private images under `registry.tarla.io/iklimco/`. These images are built and pushed to the registry with the `ops/push-harbor-custom-images.sh` script. +Environment stack files and microservice stacks use private images under `registry.tarla.io/iklimco/`. These images are built and pushed to the registry with the `ops/push-harbor-custom-images.sh` script. APISIX config files (`build/apisix-core/config.yaml`, `build/apisix-dashboard/conf.yaml`) are generated from templates under `template/` with `envsubst`. `push-harbor-custom-images.sh` performs this generation internally; temporary files are cleaned automatically when the build finishes. @@ -114,6 +115,6 @@ bash ops/push-harbor-custom-images.sh 1. The runner labeled `test-runner` appears as **Idle** (green) on the Gitea Runners page. 2. A workflow using `runs-on: test-runner` is triggered successfully. -3. The job container can access the Docker daemon and the `iklimco-net` overlay network. +3. The job can access the Docker daemon through `docker_host`, and deploy workflows connect job containers to `iklimco-net` when overlay access is required. 4. The `8200/tcp` (Vault) port is closed to the public internet. 5. `registry.tarla.io/iklimco/custom-apisix`, `custom-apisix-dashboard`, and `custom-prometheus` images exist in Harbor and are pullable. diff --git a/setup/06-prod-terraform-iaac.md b/setup/06-prod-terraform-iac.md similarity index 95% rename from setup/06-prod-terraform-iaac.md rename to setup/06-prod-terraform-iac.md index 90e7dee..4a84570 100644 --- a/setup/06-prod-terraform-iaac.md +++ b/setup/06-prod-terraform-iac.md @@ -12,7 +12,7 @@ Terraform creates the following in the prod environment: - DB subnet: `10.20.20.0/24` - Firewall: - Public ingress: only `22/tcp`, `80/tcp`, `443/tcp` - - Private ingress: prod rules in `01-private-network-port-matrisi.md` + - Private ingress: prod rules in `01-private-network-port-matrix.md` - SSH key - Placement groups: - `iklim-prod-app-spread` @@ -145,6 +145,13 @@ The following ports will not be opened publicly in prod: ## Private Firewall +Firewall placement follows the Swarm placement model: + +- DB/cluster services on `iklim-db-*` nodes: Patroni/PostgreSQL, MongoDB, and etcd. +- App/service-node infrastructure on `iklim-app-*` nodes: Vault, RabbitMQ, APISIX, Prometheus, Grafana, SWAG, and the Redis/Sentinel services from `docker-stack-infra_db-prod.yml`. + +RabbitMQ ports are therefore documented under the app firewall. Redis and Redis Sentinel do not publish host-mode ports in the current prod stack; they stay on the Docker overlay network and do not need Hetzner firewall openings. + ### App (swarm) Firewall — Private Ingress Source from app subnet (`10.20.10.0/24`): @@ -340,7 +347,7 @@ Local state is used for now (`terraform.tfstate`). The state file is not committ - Swarm nodes are inside the `iklim-prod-app-spread` placement group. - DB nodes are inside the `iklim-prod-db-spread` placement group. - Public firewall allows only `22`, `80`, and `443` ingress. -- Private firewall is compatible with `01-private-network-port-matrisi.md`. +- Private firewall is compatible with `01-private-network-port-matrix.md`. - DB replication ports are accessible only from the DB subnet. - Floating IP is created and assigned to `iklim-app-01`. - Terraform state and secret tfvars are not committed. diff --git a/setup/07-prod-ansible-bootstrap.md b/setup/07-prod-ansible-bootstrap.md index 3895118..192591e 100644 --- a/setup/07-prod-ansible-bootstrap.md +++ b/setup/07-prod-ansible-bootstrap.md @@ -119,6 +119,8 @@ ansible/ vars.yml vault.yml prod-bootstrap.yml + roles/ + db_stack/ roles/ base/ hardening/ @@ -131,6 +133,8 @@ ansible/ db_stack/ ``` +`ansible/prod/ansible.cfg` sets `roles_path = roles:../roles`. Because of that ordering, `ansible/prod/roles/db_stack` is the production-specific role that is used by `prod-bootstrap.yml`; the shared `ansible/roles/db_stack` remains the common fallback/reference implementation. Production DB behavior that writes Patroni, MongoDB, and replica-set auth files to StorageBox belongs to the prod-local role. + ## Base Role Applied to all prod nodes: @@ -200,30 +204,35 @@ Prod Swarm will be set up with 3 managers: 1. `docker swarm init` on `iklim-app-01` (Advertise/data path addr: `10.20.10.11`) 2. `iklim-app-02` and `iklim-app-03` join as managers. 3. `iklim-db-01/02/03` join as workers. -4. Overlay network is created: `iklimco-net` +4. `iklimco-net` is not created by the Ansible swarm role. It is created and owned by the Swarm stack (`docker-stack-infra_db-prod.yml`) so Docker embedded DNS works for service VIPs and aliases. 5. Node labels: - `iklim-app-*` -> `type=service` - - `iklim-db-*` -> `role=db`, `db-index=01/02/03`, for Patroni node coordination + - `iklim-db-*` -> `role=db` + - `iklim-db-*` -> `db-index=01/02/03`, for Patroni node coordination 6. All nodes remain `AVAILABILITY=Active`. -The `db-index` labels are added through `iklim-app-01` in a separate play inside `prod-bootstrap.yml`, not by the swarm role. +Labeling is intentionally split across two automation layers: + +- The shared `swarm` role adds the generic environment labels: `type=service` on app nodes and `role=db` on DB nodes. +- The production playbook adds `db-index=01/02/03` through `iklim-app-01` in a separate play inside `prod-bootstrap.yml`. + +This split keeps the common Swarm role reusable while letting prod add the Patroni/MongoDB coordination labels it needs. ## Node Directory Role On all `iklim-app-*` nodes: ```text /opt/iklimco/ssl -/opt/iklimco/init -/opt/iklimco/stacks -/opt/iklimco/vault/data ``` -`/opt/iklimco/vault/data` is the host path volume of the Vault Raft node; it must be created separately on every app node. Swarm does not manage this directory as an overlay volume; if it is missing, the Vault container will not start. +Vault data is managed by the `docker-stack-vault.yml` stack through Docker volumes. The app nodes need the local SSL directory because `cert-distributor` syncs certificates from StorageBox into `/opt/iklimco/ssl` for Vault. On DB nodes: ```text /opt/iklimco/db /opt/iklimco/backup +/opt/iklimco/db/mongodb +/opt/iklimco/db/postgresql ``` ## StorageBox DAVFS Mount Role @@ -256,19 +265,22 @@ Applied to `iklim-app-*` nodes. Gitea Act Runner is installed on each app node a ## DB Stack Role -Applied to `iklim-db-*` nodes. On each DB node, it creates `/opt/iklimco/db` and `/opt/iklimco/backup` directories, as well as a local reference directory for MongoDB. The actual production configuration, including node-specific `mongod.conf`, replica set auth key, and Patroni configurations, is set up on StorageBox at `/mnt/storagebox/db/mongodb-0X/config/` and `/mnt/storagebox/db/postgresql-0X/config/` in the `08-prod-db-cluster-kurulum.md` step. etcd data is stored on local Docker named volumes (not StorageBox). +Applied to `iklim-db-*` nodes. On each DB node, it creates `/opt/iklimco/db`, `/opt/iklimco/backup`, `/opt/iklimco/db/mongodb`, and `/opt/iklimco/db/postgresql`. The production configuration, including node-specific `mongod.conf`, replica set auth key, and Patroni configurations, is deployed by the Ansible `db_stack` role to StorageBox at `/mnt/storagebox/db/mongodb-0X/config/` and `/mnt/storagebox/db/postgresql-0X/config/`. etcd data is stored on local Docker named volumes. ## DB Stack Env Variables -Password variables required by the DB cluster stack (`docker-stack-db.prod.yml`) — `DATABASE_POSTGRES_ROOT_PASSWD`, `DATABASE_POSTGRES_REPLICATOR_PASSWORD`, `DATABASE_MONGODB_ROOT_PASSWD` — are stored in `prod/secrets/iklim.co/.env.secrets.shared` on StorageBox, alongside the other shared secrets. No separate file is needed. +Password variables required by the prod infra stack (`docker-stack-infra_db-prod.yml`) — including `DATABASE_POSTGRES_ROOT_PASSWD`, `DATABASE_POSTGRES_REPLICATOR_PASSWORD`, `DATABASE_MONGODB_ROOT_PASSWD`, and `ETCD_ROOT_PASSWORD` — are stored in `prod/secrets/iklim.co/.env.secrets.shared` on StorageBox, alongside the other shared secrets. No separate file is needed. ## StorageBox Directory Structure The `storagebox` Ansible rolü `storagebox_managed_directories` (`group_vars/all/vars.yml`) aracılığıyla aşağıdaki dizinleri bootstrap sırasında **otomatik** oluşturur. Manüel adım gerekmez: - `/mnt/storagebox/ssl` → `SWAG_CERT_DIR` -- `/mnt/storagebox/swag/config` → `SWAG_CONFIG_DIR` +- `/mnt/storagebox/swag` +- `/mnt/storagebox/swag/dns-conf` → `SWAG_DNS_CONFIG_DIR` - `/mnt/storagebox/swag/site-confs` → `SWAG_SITE_CONFS_DIR` +- `/mnt/storagebox/swag/proxy-confs` → `SWAG_PROXY_CONFS_DIR` +- `/mnt/storagebox/swag/certbot` - `/mnt/storagebox/grafana/data` → `GRAFANA_DATA_DIR` - `/mnt/storagebox/precipitation/images` @@ -300,12 +312,12 @@ grep -n "swarm init\|swarm join" init/swarm-init.sh - 3 Swarm manager nodes appear as Leader/Reachable in `docker node ls`. - 3 DB nodes appear as Workers in `docker node ls`. - Manager quorum is provided: 3 managers, 1 loss tolerated. -- The `iklimco-net` overlay network exists. +- The `iklimco-net` overlay network is created by the Swarm stack after `docker-stack-infra_db-prod.yml` deploy. - Node labels (`type=service`, `role=db`, `db-index=01/02/03`) are verified with inspect. - `swarm-init.sh` does not attempt init again in an active Swarm; it is idempotent. - `/mnt/storagebox` is mounted on every node. -- The `/opt/iklimco/vault/data` directory exists on every app node. -- The `ssl`, `swag/config`, `swag/site-confs`, `grafana/data`, and `precipitation/images` directories exist on StorageBox. +- The `/opt/iklimco/ssl` directory exists on every app node. +- The `db`, `ssl`, `swag`, `swag/dns-conf`, `swag/site-confs`, `swag/proxy-confs`, `swag/certbot`, `grafana/data`, and `precipitation/images` directories exist on StorageBox. - The Gitea Act Runner service is running on every app node. -- `/opt/iklimco/db` and `/opt/iklimco/backup` directories exist on DB nodes. Node-specific `mongod.conf` and other DB configurations are created on StorageBox (`/mnt/storagebox/db/...`) in the `08-prod-db-cluster-kurulum.md` step. +- `/opt/iklimco/db` and `/opt/iklimco/backup` directories exist on DB nodes. Node-specific `mongod.conf` and other DB configurations are created on StorageBox (`/mnt/storagebox/db/...`) in the `08-prod-db-cluster-setup.md` step. - Public firewall allows only `22`, `80`, and `443` ingress. diff --git a/setup/08-prod-db-cluster-kurulum.md b/setup/08-prod-db-cluster-setup.md similarity index 73% rename from setup/08-prod-db-cluster-kurulum.md rename to setup/08-prod-db-cluster-setup.md index 7b584f0..4490d2d 100644 --- a/setup/08-prod-db-cluster-kurulum.md +++ b/setup/08-prod-db-cluster-setup.md @@ -27,7 +27,9 @@ iklim-db-03 (Swarm worker, 10.20.20.13) patroni-03 [Patroni + PostgreSQL — standby] ``` -DB containers discover each other through **overlay DNS aliases** (`mongodb-01`, `etcd-01`, `patroni-01`, etc.) on the shared `iklimco-net` overlay network. Each service publishes its port in `host` mode so replication traffic goes directly through the Hetzner private network while the overlay DNS resolves service names correctly. All containers are defined in the single `docker-stack-db.prod.yml` stack file at the repo root. +DB containers discover each other through **overlay DNS aliases** (`mongodb-01`, `etcd-01`, `patroni-01`, etc.) on the shared `iklimco-net` overlay network. Patroni/PostgreSQL, MongoDB, and etcd are the DB/cluster services covered by this document; they publish their cluster ports in `host` mode so replication traffic goes directly through the Hetzner private network while overlay DNS resolves service names correctly. + +The current prod DB services are defined in the root `docker-stack-infra_db-prod.yml` stack file. That stack also contains non-DB infrastructure services such as Redis, Redis Sentinel, and RabbitMQ. Those services are intentionally different: they run on `node.labels.type == service` app/service nodes, do not publish host-mode ports in this stack, and communicate through the `iklimco-net` overlay network only. Do not generalize the DB host-mode rule to Redis or RabbitMQ. ## 1. Firewall Update @@ -145,6 +147,10 @@ terraform apply ## 2. Add DB Nodes to Swarm +This is handled by `Environment_Infrastructure/ansible/prod/prod-bootstrap.yml` through the `swarm` role. The role initializes Swarm on `iklim-app-01`, joins `iklim-app-02/03` as managers, joins `iklim-db-01/02/03` as workers, and labels DB nodes. + +Manual equivalent, kept for troubleshooting only: + **Swarm manager'lardan birinde** (iklim-app-01) join token al: ```bash @@ -157,19 +163,35 @@ docker swarm join-token worker docker swarm join --token 10.20.10.11:2377 ``` -Label the nodes **on iklim-app-01**: +Label the nodes **on iklim-app-01**. In automation this is split into two phases: + +- the shared `swarm` role adds `role=db` to DB nodes; +- the prod-specific `prod-bootstrap.yml` play adds `db-index=01/02/03`. + +Manual equivalent: ```bash -docker node update --label-add role=db --label-add db-index=01 iklim-db-01 -docker node update --label-add role=db --label-add db-index=02 iklim-db-02 -docker node update --label-add role=db --label-add db-index=03 iklim-db-03 +docker node update --label-add role=db iklim-db-01 +docker node update --label-add role=db iklim-db-02 +docker node update --label-add role=db iklim-db-03 + +docker node update --label-add db-index=01 iklim-db-01 +docker node update --label-add db-index=02 iklim-db-02 +docker node update --label-add db-index=03 iklim-db-03 docker node ls ``` ## 3. StorageBox Directory Structure -DB data and logs are stored on **local Docker named volumes** (performance, WAL/compaction requirements). Only config files are placed on StorageBox. On each DB node, where `/mnt/storagebox` must already be mounted: +DB data is stored on local DB-node paths prepared by Ansible: + +```text +/opt/iklimco/db/mongodb +/opt/iklimco/db/postgresql +``` + +Configuration files are placed on StorageBox. On each DB node, where `/mnt/storagebox` must already be mounted: ```bash # On iklim-db-01: @@ -185,7 +207,7 @@ mkdir -p /mnt/storagebox/db/mongodb-03/config mkdir -p /mnt/storagebox/db/postgresql-03/config ``` -Config files (`mongod.conf`, `patroni.yml`) are deployed by the Ansible `db_stack` role into these directories. Named Docker volumes (`mongodb-01-data`, `etcd-01-data`, `postgresql-01-data`, etc.) are created automatically by the stack deploy. +Config files (`mongod.conf`, `patroni.yml`) and the MongoDB replica set key are deployed by the Ansible `db_stack` role into these directories. etcd uses Docker named volumes (`etcd-01-data`, `etcd-02-data`, `etcd-03-data`) from `docker-stack-infra_db-prod.yml`. ## 4. MongoDB Replica Set @@ -216,14 +238,18 @@ security: ### Replica Set Auth Key -The **same** key file must exist on all DB nodes: +The **same** key file must exist on all DB nodes. In the current production setup, this is automated by `ansible/prod/roles/db_stack/tasks/db_node.yml`: + +- `iklim-db-01` generates `/mnt/storagebox/db/mongodb-01/config/rs-auth.key` if it is missing. +- the same key content is copied to `/mnt/storagebox/db/mongodb-02/config/rs-auth.key` and `/mnt/storagebox/db/mongodb-03/config/rs-auth.key`; +- permissions are set to `0400`. + +Manual recovery equivalent, kept only for troubleshooting: ```bash -# Create on iklim-db-01: openssl rand -base64 756 > /mnt/storagebox/db/mongodb-01/config/rs-auth.key chmod 400 /mnt/storagebox/db/mongodb-01/config/rs-auth.key -# Copy the same content to the other nodes: cat /mnt/storagebox/db/mongodb-01/config/rs-auth.key \ > /mnt/storagebox/db/mongodb-02/config/rs-auth.key cat /mnt/storagebox/db/mongodb-01/config/rs-auth.key \ @@ -234,14 +260,16 @@ chmod 400 /mnt/storagebox/db/mongodb-0{2,3}/config/rs-auth.key ### Stack File — MongoDB -MongoDB services are defined in `docker-stack-db.prod.yml` (repo root). Each service uses a named Docker volume for data and log, and a StorageBox bind mount for config: +MongoDB services are defined in `docker-stack-infra_db-prod.yml` (repo root). Each service uses a local DB-node bind mount for data and a StorageBox bind mount for config: ```yaml mongodb-01: - image: mongo:8.3.2 + image: ${IMAGE_MONGODB} + environment: + MONGO_INITDB_ROOT_USERNAME: "${DATABASE_MONGODB_ROOT_USER}" + MONGO_INITDB_ROOT_PASSWORD: "${DATABASE_MONGODB_ROOT_PASSWD}" volumes: - - mongodb-01-data:/data/db - - mongodb-01-log:/data/log + - /opt/iklimco/db/mongodb:/data/db - /mnt/storagebox/db/mongodb-01/config:/data/configdb networks: iklimco-net: @@ -260,11 +288,18 @@ mongodb-01: - node.hostname == iklim-db-01 ``` -Volumes `mongodb-01-data`, `mongodb-01-log`, etc. are declared at the bottom of `docker-stack-db.prod.yml` and are created automatically on first deploy. +The same pattern is repeated for `mongodb-02` and `mongodb-03`, with node-specific StorageBox config paths and placement constraints. ### Replica Set Initialization -Run **once** after the stack is deployed: +Replica set initialization is handled by the root prod workflow step `Initialize MongoDB Replica Set`. The workflow: + +1. Connects to the first host from `DATABASE_MONGODB_HOST`. +2. Runs `rs.initiate()` if the replica set is uninitialized. +3. Checks current members if the replica set already exists. +4. Runs `rs.add()` through the primary if hosts from `DATABASE_MONGODB_HOST` are missing. + +Manual equivalent, kept for troubleshooting only: ```bash # On iklim-app-01 (overlay network erişimi için): @@ -293,7 +328,7 @@ Patroni coordinates PostgreSQL primary/standby roles through etcd. If the primar ### 5.1 Custom Image (Patroni + PostGIS) -Patroni is installed on top of the `postgis/postgis:18-3.6` image. This image is pushed to Harbor and used in the stack. +Patroni is installed on top of the `postgis/postgis:18-3.6` image. This image is pushed to Harbor and used in `docker-stack-infra_db-prod.yml` via `${CUSTOM_IMAGE_REGISTRY}${IMAGE_PATRONI}`. `build/patroni-postgis/Dockerfile`: @@ -335,13 +370,13 @@ docker push registry.tarla.io/iklimco/custom-patroni-postgis:18-3.6 ### 5.2 etcd Cluster -etcd services are defined in `docker-stack-db.prod.yml`. Each service uses a named Docker volume for data and has an overlay DNS alias. Environment variables reference peer URLs by alias, not by hardcoded IP: +etcd services are defined in `docker-stack-infra_db-prod.yml`. Each service uses a named Docker volume for data and has an overlay DNS alias. Environment variables reference peer URLs by alias, not by hardcoded IP: ```yaml etcd-01: - image: bitnami/etcd:3 + image: ${IMAGE_ETCD} environment: - ALLOW_NONE_AUTHENTICATION: "yes" + ALLOW_NONE_AUTHENTICATION: "no" ETCD_NAME: etcd-01 ETCD_INITIAL_ADVERTISE_PEER_URLS: http://etcd-01:2380 ETCD_LISTEN_PEER_URLS: http://0.0.0.0:2380 @@ -350,6 +385,7 @@ etcd-01: ETCD_INITIAL_CLUSTER: "etcd-01=http://etcd-01:2380,etcd-02=http://etcd-02:2380,etcd-03=http://etcd-03:2380" ETCD_INITIAL_CLUSTER_STATE: new ETCD_INITIAL_CLUSTER_TOKEN: iklimco-etcd-prod + ETCD_ROOT_PASSWORD: "${ETCD_ROOT_PASSWORD}" volumes: - etcd-01-data:/bitnami/etcd/data networks: @@ -366,7 +402,7 @@ etcd-01: **APISIX etcd usage:** In prod, APISIX shares this etcd cluster with the `/apisix` prefix. Patroni uses the `/service/` prefix and APISIX uses the `/apisix/` prefix — no collision. The overlay DNS names (`etcd-01:2379`, `etcd-02:2379`, `etcd-03:2379`) are reachable from app nodes via the `iklimco-net` overlay. Therefore, the app subnet → DB nodes port 2379 firewall rule is mandatory; it was added in Section 1. -**Important:** `ETCD_INITIAL_CLUSTER_STATE` must be `new` on the first deploy and `existing` on all later deploys. The deploy steps in Section 6 detect this automatically; no manual update is required. +**Important:** `ETCD_INITIAL_CLUSTER_STATE` is currently defined in `docker-stack-infra_db-prod.yml`. When changing etcd cluster membership, do not blindly expand `ETCD_INITIAL_CLUSTER` on a running cluster; add members through etcd membership operations first. ### 5.3 Patroni Configuration @@ -447,17 +483,19 @@ For Node 02 and 03, only `name`, `restapi.connect_address`, and `postgresql.conn ### 5.4 Stack File — Patroni -Patroni services are defined in `docker-stack-db.prod.yml`. Each service uses the custom image, a named Docker volume for data, a StorageBox bind mount for the config file, and overlay DNS aliases: +Patroni services are defined in `docker-stack-infra_db-prod.yml`. Each service uses the custom image, a local DB-node bind mount for data, a StorageBox bind mount for the config file, and overlay DNS aliases: ```yaml patroni-01: - image: registry.tarla.io/iklimco/custom-patroni-postgis:18-3.6 + image: ${CUSTOM_IMAGE_REGISTRY}${IMAGE_PATRONI} environment: - DATABASE_POSTGRES_ROOT_PASSWD: "${DATABASE_POSTGRES_ROOT_PASSWD}" - DATABASE_POSTGRES_REPLICATOR_PASSWORD: "${DATABASE_POSTGRES_REPLICATOR_PASSWORD}" + POSTGRES_USER: "${DATABASE_POSTGRES_ROOT_USER}" + POSTGRES_PASSWORD: "${DATABASE_POSTGRES_ROOT_PASSWD}" + REPLICATOR_PASSWORD: "${DATABASE_POSTGRES_REPLICATOR_PASSWORD}" + ETCD_ROOT_PASSWORD: "${ETCD_ROOT_PASSWORD}" TZ: "Europe/Istanbul" volumes: - - postgresql-01-data:/var/lib/postgresql/data + - /opt/iklimco/db/postgresql:/var/lib/postgresql/data - /mnt/storagebox/db/postgresql-01/config/patroni.yml:/etc/patroni/patroni.yml:ro networks: iklimco-net: @@ -480,7 +518,7 @@ patroni-01: - node.hostname == iklim-db-01 ``` -Volumes `postgresql-01-data`, `postgresql-02-data`, `postgresql-03-data` are declared at the bottom of `docker-stack-db.prod.yml` and created automatically on first deploy. +The same pattern is repeated for `patroni-02` and `patroni-03`, with node-specific StorageBox config paths and placement constraints. ### 5.5 Status Check @@ -508,11 +546,11 @@ docker exec -it $(docker ps -q -f name=iklimco_patroni-01 | head -1) \ ## 6. Deploy -All DB services (etcd, MongoDB, Patroni) are in the single `docker-stack-db.prod.yml` stack. Deploy from `iklim-app-01` in the repo working directory. +All DB services (etcd, MongoDB, Patroni) are in the current root prod stack `docker-stack-infra_db-prod.yml`. Normal deployment is done by `.gitea/workflows/deploy-prod.yml`, not by running a separate DB stack manually. ### .env File -DB stack password variables (`DATABASE_POSTGRES_ROOT_PASSWD`, `DATABASE_POSTGRES_REPLICATOR_PASSWORD`, `DATABASE_MONGODB_ROOT_PASSWD`) are stored in `prod/secrets/iklim.co/.env.secrets.shared` on StorageBox. Fetch it to `iklim-app-01` before deploy: +DB stack password variables (`DATABASE_POSTGRES_ROOT_PASSWD`, `DATABASE_POSTGRES_REPLICATOR_PASSWORD`, `DATABASE_MONGODB_ROOT_PASSWD`, `ETCD_ROOT_PASSWORD`) are stored in `prod/secrets/iklim.co/.env.secrets.shared` on StorageBox. The workflow fetches this file automatically. ```bash scp -P 23 STORAGEBOX_USER@STORAGEBOX_USER.your-storagebox.de:prod/secrets/iklim.co/.env.secrets.shared \ @@ -522,44 +560,18 @@ chmod 600 /tmp/.env.secrets.shared ### Deploy Steps +The root prod workflow deploys the stack with: + ```bash -# On iklim-app-01, in the repo working directory: -set -a; . /tmp/.env.secrets.shared; set +a - -# Automatic ETCD_INITIAL_CLUSTER_STATE detection: -DEPLOY_FILE="docker-stack-db.prod.yml" -if docker service ls --filter name=iklimco_etcd-01 -q 2>/dev/null | grep -q .; then - echo "ℹ️ etcd services mevcut, 'existing' ile deploy ediliyor..." - DEPLOY_FILE=$(mktemp /tmp/docker-stack-db.XXXXXX.yml) - sed "s/ETCD_INITIAL_CLUSTER_STATE: new/ETCD_INITIAL_CLUSTER_STATE: existing/g" \ - docker-stack-db.prod.yml > "$DEPLOY_FILE" -else - echo "ℹ️ İlk deploy, 'new' state kullanılıyor..." -fi - docker stack deploy \ --with-registry-auth \ - -c "$DEPLOY_FILE" \ + --resolve-image changed \ + -c docker-stack-infra_db-prod.yml \ iklimco - -[ "$DEPLOY_FILE" != "docker-stack-db.prod.yml" ] && rm -f "$DEPLOY_FILE" - -# Wait for etcd cluster to be ready: -echo "⏳ etcd bekleniyor..." -for i in $(seq 1 18); do - if docker run --rm --network iklimco-net alpine \ - sh -c "wget -qO- http://etcd-01:2379/health 2>/dev/null | grep -q '\"health\":\"true\"'"; then - echo "✅ etcd ready" - break - fi - [ "$i" -eq 18 ] && echo "❌ etcd timeout" && exit 1 - echo " attempt $i/18 — 10s bekleniyor..." - sleep 10 -done - -docker stack services iklimco ``` +After the stack deploy, the workflow waits for etcd, initializes APISIX, initializes the MongoDB replica set, and runs PostgreSQL/MongoDB init scripts. + ### DB Node Placement Check ```bash @@ -572,7 +584,7 @@ All tasks must run on the expected `iklim-db-*` nodes. ### MongoDB Replica Set Initialization -Run once after the stack is deployed: +Handled by the workflow. Manual form for troubleshooting: ```bash # From iklim-app-01 via overlay network: @@ -596,7 +608,7 @@ App containers connect to DB services through the `iklimco-net` overlay network ### MongoDB Replica Set Connection String -Variables in `env-prod/.env`: +Variables in StorageBox `prod/secrets/iklim.co/.env`: ```bash DATABASE_MONGODB_HOST=mongodb-01:27017,mongodb-02:27017,mongodb-03:27017 @@ -613,7 +625,7 @@ mongodb://:@mongodb-01:27017,mongodb-02:27017,mongodb-03:27017/< ### PostgreSQL — Patroni -Variables in `env-prod/.env`: +Variables in StorageBox `prod/secrets/iklim.co/.env`: ```bash DATABASE_POSTGRES_HOST=patroni-01:5432,patroni-02:5432,patroni-03:5432 @@ -647,8 +659,7 @@ curl -s http://patroni-01:8008/primary Prod cluster yapısında `pg-proxy` veya `mongo-proxy` **kullanılmaz**. Ofis bilgisayarından erişim için doğrudan DB subnet'i hedef alınır. ### WireGuard Ayarı -Ofis bilgisayarındaki `.conf` dosyasında `AllowedIPs` güncellenmelidir: -`AllowedIPs = 10.8.0.1/32, 10.20.20.0/24` +Ofis bilgisayarındaki `.conf` dosyasında `AllowedIPs` güncellenmelidir: `AllowedIPs = 10.8.0.1/32, 10.20.20.0/24` ### Bağlantı Parametreleri (Multi-Host) Modern veritabanı araçları (DBeaver, Compass vb.) küme farkındalıklı bağlantı kurmalıdır: @@ -660,7 +671,7 @@ Modern veritabanı araçları (DBeaver, Compass vb.) küme farkındalıklı bağ ## Acceptance Criteria -- `docker stack services iklimco` — 9 services visible (etcd-01/02/03, mongodb-01/02/03, patroni-01/02/03), all `1/1` +- `docker stack services iklimco` — etcd-01/02/03, mongodb-01/02/03, patroni-01/02/03 are visible and all target replicas are healthy - `docker service ps iklimco_patroni-01/02/03` — each task runs on its expected `iklim-db-*` node - `docker service ps iklimco_mongodb-01/02/03` — each task runs on its expected `iklim-db-*` node - `docker service ps iklimco_etcd-01/02/03` — each task runs on its expected `iklim-db-*` node diff --git a/setup/09-prod-runner-ha-ve-swarm.md b/setup/09-prod-runner-ha-and-swarm.md similarity index 64% rename from setup/09-prod-runner-ha-ve-swarm.md rename to setup/09-prod-runner-ha-and-swarm.md index 71b17c3..3a0d8ab 100644 --- a/setup/09-prod-runner-ha-ve-swarm.md +++ b/setup/09-prod-runner-ha-and-swarm.md @@ -16,7 +16,7 @@ In this model, if any manager/runner is lost, the other runners can pick up pipe ## Runner Installation Model -The runner will not run as a Docker container. There is no Docker socket mount. +The runner will not run as a Docker container. It runs as a systemd service on the app nodes. Job containers start on Docker `bridge`; deploy workflows connect the job container to `iklimco-net` after the stack creates that network. Installation: @@ -33,7 +33,7 @@ If runner jobs use Docker CLI for deploy, the `gitea-runner` user needs access t Shared labels on all prod runners: ```text -prod-runner +prod-runner:docker://catthehacker/ubuntu:act-22.04 ubuntu-24.04 ``` @@ -86,20 +86,19 @@ For the GoDaddy API key: https://developer.godaddy.com/keys — create a **Produ ### Gitea `PROD_FLOATING_IP` Variable -For DNS automation, `PROD_FLOATING_IP` must be defined as a Gitea project variable. See the "Gitea Variable: PROD_FLOATING_IP" step in `06-prod-terraform-iaac.md`. +For DNS automation, `PROD_FLOATING_IP` must be defined as a Gitea project variable. See the "Gitea Variable: PROD_FLOATING_IP" step in `06-prod-terraform-iac.md`. ### Docker Secrets -Before the infra stack is deployed, the following Docker secrets must be created on `iklim-app-01`. These secrets are referenced by `docker-stack-infra.prod.yml`; if they do not exist, stack deploy fails. +Before the infra stack is deployed, `rabbitmq_erlang_cookie` must exist as a Docker secret. The current prod workflow creates it in the `Create Infrastructure Docker Secrets` step if it is missing. ```bash -# RabbitMQ Erlang cluster cookie; must be the same on all RabbitMQ nodes: +# RabbitMQ Erlang cluster cookie; must be the same on all RabbitMQ nodes. +# The workflow does this automatically if the secret is missing: openssl rand -hex 32 | docker secret create rabbitmq_erlang_cookie - ``` -> The `vault_unseal_key` secret is created after Vault is started for the first time; see `roadmap/prod-env/07-vault-raft-plan.md` Step 3. It is not required for the first infra stack deploy; it is waited for until the health check is triggered. -> -> This secret is also used during Vault restarts triggered by cert-reloader: when `cert-reloader` detects a certificate change, it runs `docker service update --force iklimco_vault`; while Vault containers restart, they read from the `vault_unseal_key` Docker secret and automatically unseal. If the secret is missing, Vault remains sealed after every certificate renewal. +> The `vault_unseal_key` secret is managed by `init/vault/vault-bootstrap.sh`. The bootstrap script creates a placeholder on first deploy, deploys `docker-stack-vault.yml`, initializes/unseals Vault, and rotates the secret to the real unseal key. Verify secrets: @@ -120,7 +119,7 @@ Before the deploy pipeline runs, the following template files must exist in the These files are created in the test environment (`test-env/04-swag-nginx-configs.md`); they are not created separately for prod. Template files are shared by both environments; prod-specific values are injected with environment variables during deploy. -Verify that the `prod/secrets/iklim.co/.env.prod` file on StorageBox contains the following variables: +Verify that the `prod/secrets/iklim.co/.env` file on StorageBox contains the following variables: ```bash API_SUBDOMAIN=api.iklim.co @@ -129,11 +128,12 @@ RABBITMQ_SUBDOMAIN=rabbitmq.iklim.co GRAFANA_SUBDOMAIN=grafana.iklim.co RESTRICTED_IPS="78.187.87.109/32,95.70.151.248/32" SWAG_CERT_DIR=/mnt/storagebox/ssl -SWAG_CONFIG_DIR=/mnt/storagebox/swag/config +SWAG_DNS_CONFIG_DIR=/mnt/storagebox/swag/dns-conf SWAG_SITE_CONFS_DIR=/mnt/storagebox/swag/site-confs +SWAG_PROXY_CONFS_DIR=/mnt/storagebox/swag/proxy-confs ``` -The pipeline sources these variables and renders the template files into the `$SWAG_SITE_CONFS_DIR` (`/mnt/storagebox/swag/site-confs`) directory. Because StorageBox is mounted commonly on all app nodes, even if the configuration is created on a single runner, SWAG containers on other nodes access the same files. Detail: `roadmap/prod-env/04-swag-nginx-configs.md`. +The pipeline sources these variables and renders the template files into the `$SWAG_SITE_CONFS_DIR` (`/mnt/storagebox/swag/site-confs`) directory. Because StorageBox is mounted commonly on all app nodes, even if the configuration is created on a single runner, SWAG containers on other nodes access the same files. ### APISIX Configuration @@ -194,27 +194,41 @@ All prod deploy workflows, including infra and microservices, must use the same | 2 | Prepare Folders | | | 3 | Set up SSH Key and Add to known_hosts | | | 4 | Update Apt Repository and Install Required Tools | `gettext tree jq` — `jq` is required for the GoDaddy DNS API | -| 5 | Fetch Service Secret Files | Fetch `.env.secrets.*` from StorageBox | -| 6 | Initialize Workspace | Fetch `.env` and `.env.secrets.shared` from StorageBox; run `init-infra-dev.sh` | -| 7 | Upload Updated Secrets to Storagebox | | -| 8 | Provision Vault AppRole IDs and Docker Secrets | | -| 9 | Upload Updated Env to Storagebox | | -| 10 | Prepare Init Files | Cert copy lines removed | -| 11 | Initialize Docker Swarm | | -| 12 | Docker Login to Harbor | | -| 13 | **Update DNS Records** * | GoDaddy API; `api/apigw/rabbitmq/grafana` A records; idempotent | -| 14 | **Prepare SWAG Directories** * | `$SWAG_CONFIG_DIR/dns-conf`; renders nginx conf templates; reloads running SWAG | -| 15 | Bootstrap Vault TLS Placeholder | | -| 16 | Deploy Swarm Stack | base + prod overlay together | -| 17 | **Wait for etcd** * | Waits until Patroni etcd (`etcd-01:2379`) is healthy | -| 18 | **Run APISIX Init** * | `SPRING_PROFILES_ACTIVE=prod`; idempotent; writes to etcd | -| 19 | **Bootstrap SWAG Certificate** * | Waits for SWAG to obtain the cert; copies it to `SWAG_CERT_DIR` | -| 20 | **Run Database Init Scripts** * | `postgresql`/`mongodb` Swarm VIP; SQL+JS init; idempotent | -| 21 | Review Environment | | +| 5 | Fetch Prod Env From Storagebox | Fetch `.env` and `.env.secrets.shared` | +| 6 | Fetch Service Secret Files | Fetch `.env.secrets.` and `.env.secrets.swag` | +| 7 | Prepare Database Init Files | Render PostgreSQL/MongoDB init templates | +| 8 | Docker Login to Harbor | | +| 9 | Prepare SWAG Directories | Render `dns-conf` and `site-confs`; reload node-local SWAG if present | +| 10 | Bootstrap Vault TLS Placeholder | Creates temporary cert only if missing | +| 11 | Create Infrastructure Docker Secrets | Creates `rabbitmq_erlang_cookie` if missing | +| 12 | Deploy Swarm Stacks | `docker-stack-infra_db-prod.yml` | +| 13 | Connect Runner to Overlay Network | Connects job container to `iklimco-net` | +| 14 | Initialize Production Infrastructure | Runs `init-infra-prod.sh`; this triggers Vault bootstrap and RabbitMQ setup | +| 15 | Wait for Infrastructure Services | Waits for `iklimco_vault` and `iklimco_rabbitmq` | +| 16 | Provision Vault AppRole IDs and Docker Secrets | Downloads service `vault-files`, runs `init/provision-all-services.sh` | +| 17 | Upload Updated Secrets to Storagebox | Uploads `.env.secrets.*` and `.env` | +| 18 | Wait for etcd | Waits for etcd health | +| 19 | Run APISIX Init | `SPRING_PROFILES_ACTIVE=prod` | +| 20 | Bootstrap SWAG Certificate | Waits for SWAG and cert-reloader output in `SWAG_CERT_DIR` | +| 21 | Initialize MongoDB Replica Set | `rs.initiate()` or missing-member `rs.add()` | +| 22 | Run Database Init Scripts | Patroni primary + MongoDB replica set; SQL+JS init | +| 23 | Update DNS Records | GoDaddy API; `api/apigw/rabbitmq/grafana` A records | +| 24 | Review Environment | | -### Removal of Cert Scp Lines +### Stack Placement Boundary -Lines removed from the `Initialize Workspace` step: +`docker-stack-infra_db-prod.yml` is intentionally a mixed infrastructure stack. The DB/cluster services in that file are placed on DB nodes and expose host-mode cluster ports: + +- Patroni/PostgreSQL, MongoDB, and etcd run on `iklim-db-*` workers. + +The service-node infrastructure in the same file remains overlay-only unless a reverse proxy or explicit published port is defined by the stack: + +- Redis, Redis Sentinel, and RabbitMQ run on `node.labels.type == service` app/service nodes. +- Redis and RabbitMQ must not be treated as DB-node host-mode services. + +### Historical Note: Removed Cert Scp Lines + +Older workflow versions copied certificate files manually in an `Initialize Workspace` step. That step no longer exists in the current root prod workflow. The removed lines are kept here only as a historical reference: ```yaml # REMOVED — manual cert copy with scp is no longer required: @@ -222,7 +236,7 @@ scp -P 23 ${{ vars.STORAGEBOX_USER }}@${{ vars.STORAGEBOX_USER }}.your-storagebo scp -P 23 ${{ vars.STORAGEBOX_USER }}@${{ vars.STORAGEBOX_USER }}.your-storagebox.de:prod/app/iklim.co/ssl/STAR.iklim.co_key.pem ./STAR.iklim.co_key.pem ``` -Line also removed from the `Prepare Init Files` step: +This line was also removed from the old `Prepare Init Files` step: ```yaml # REMOVED: @@ -231,97 +245,55 @@ sudo cp STAR.iklim.co.full.crt STAR.iklim.co_key.pem /opt/iklimco/ssl/ The certificate is now obtained by SWAG from Let's Encrypt and written to the `SWAG_CERT_DIR` (`/mnt/storagebox/ssl/`) directory in the `Bootstrap SWAG Certificate` step. Later renewals are handled automatically by cert-reloader. -### Bootstrap SWAG Certificate (Step 19) +### Bootstrap SWAG Certificate (Step 20) -On the first deploy, SWAG obtains the Let's Encrypt certificate with the GoDaddy DNS-01 challenge. This step waits for SWAG to obtain the certificate, for up to 10 minutes, and then copies it to the `SWAG_CERT_DIR` directory: +On the first deploy, SWAG obtains the Let's Encrypt certificate with the GoDaddy DNS-01 challenge. The current step waits for the Swarm `iklimco_swag` service to be running, then waits for `cert-reloader` to write `STAR.iklim.co.full.crt` to `SWAG_CERT_DIR`. ```yaml - name: Bootstrap SWAG Certificate run: | set -a; . ./.env; set +a - echo "Waiting for SWAG container to start..." - SWAG_CTR="" - for i in $(seq 1 24); do - SWAG_CTR=$(docker ps -q -f name=iklimco_swag 2>/dev/null | head -1) - [ -n "$SWAG_CTR" ] && break - sleep 10 - done - - if [ -z "$SWAG_CTR" ]; then - echo "❌ SWAG container did not start" - exit 1 - fi - - CERT_PATH="/config/etc/letsencrypt/live/iklim.co/fullchain.pem" - echo "Waiting for cert (up to 10 min)..." - for i in $(seq 1 20); do - if docker exec "$SWAG_CTR" test -f "$CERT_PATH" 2>/dev/null; then - echo "✅ Cert obtained" - break - fi - echo " attempt $i/20 — waiting 30s..." - sleep 30 - done - - if ! docker exec "$SWAG_CTR" test -f "$CERT_PATH" 2>/dev/null; then - echo "❌ SWAG did not obtain cert. Logs:" - docker service logs iklimco_swag --tail 50 - exit 1 - fi - - docker exec "$SWAG_CTR" cat "$CERT_PATH" | \ - docker run --rm -i -v "${SWAG_CERT_DIR}:/output" alpine \ - sh -c "cat > /output/STAR.iklim.co.full.crt && chmod 644 /output/STAR.iklim.co.full.crt" - docker exec "$SWAG_CTR" cat "/config/etc/letsencrypt/live/iklim.co/privkey.pem" | \ - docker run --rm -i -v "${SWAG_CERT_DIR}:/output" alpine \ - sh -c "cat > /output/STAR.iklim.co_key.pem && chmod 644 /output/STAR.iklim.co_key.pem" - echo "✅ Cert bootstrapped to ${SWAG_CERT_DIR}/" + echo "Waiting for SWAG service..." + docker service ps iklimco_swag --filter 'desired-state=running' + echo "Waiting for cert-reloader output in ${SWAG_CERT_DIR}..." + docker run --rm -v "${SWAG_CERT_DIR}:/ssl:ro" alpine \ + test -f /ssl/STAR.iklim.co.full.crt working-directory: /workspace/iklim.co ``` -After this step, certificate files exist inside `SWAG_CERT_DIR` (`/mnt/storagebox/ssl/`); Vault TLS reads these files. Later renewals are handled automatically by cert-reloader. When the pipeline runs again, this step only waits for the SWAG container to be ready; certificate issuance is managed by SWAG/cert-reloader within Let's Encrypt's 90-day cycle. +After this step, certificate files exist inside `SWAG_CERT_DIR` (`/mnt/storagebox/ssl/`). `cert-distributor` syncs these files to node-local `/opt/iklimco/ssl`, where Vault reads them. Later renewals are handled automatically by SWAG, cert-reloader, and cert-distributor. -### Run Database Init Scripts (Step 20) +### Run Database Init Scripts (Step 22) -PostgreSQL and MongoDB init scripts run through Swarm overlay DNS service names (`postgresql`, `mongodb`): +PostgreSQL and MongoDB init scripts run after Patroni primary and MongoDB replica set readiness: ```yaml - name: Run Database Init Scripts run: | set -a; . ./.env; . ./.env.secrets.shared; set +a - echo "⏳ Waiting for PostgreSQL..." - until docker run --rm --network iklimco-net \ - -e PGPASSWORD="${DATABASE_POSTGRES_ROOT_PASSWD}" \ - postgis/postgis:18-3.6 \ - pg_isready -h postgresql -U "${DATABASE_POSTGRES_ROOT_USER}" -q 2>/dev/null; do - sleep 5 - done + PG_URI="postgresql://${DATABASE_POSTGRES_ROOT_USER}@${DATABASE_POSTGRES_HOST}/postgres?connect_timeout=5&target_session_attrs=read-write" + MONGO_URI="mongodb://${DATABASE_MONGODB_ROOT_USER}:${DATABASE_MONGODB_ROOT_PASSWD}@${DATABASE_MONGODB_HOST}/admin?${DATABASE_MONGODB_PARAMS}" for sql_file in $(ls ./init/postgresql/*.sql 2>/dev/null | sort); do echo "▶ $(basename "$sql_file")" docker run --rm -i --network iklimco-net \ -e PGPASSWORD="${DATABASE_POSTGRES_ROOT_PASSWD}" \ postgis/postgis:18-3.6 \ - psql -h postgresql -U "${DATABASE_POSTGRES_ROOT_USER}" < "$sql_file" + psql "$PG_URI" < "$sql_file" done - echo "⏳ Waiting for MongoDB..." - until docker run --rm --network iklimco-net mongo:8.3.2 \ - mongosh "mongodb://${DATABASE_MONGODB_ROOT_USER}:${DATABASE_MONGODB_ROOT_PASSWD}@mongodb/admin" \ - --eval "db.runCommand({ping:1})" --quiet 2>/dev/null; do - sleep 5 - done for js_file in $(ls ./init/mongodb/*.js 2>/dev/null | sort); do echo "▶ $(basename "$js_file")" - docker run --rm -i --network iklimco-net mongo:8.3.2 \ - mongosh "mongodb://${DATABASE_MONGODB_ROOT_USER}:${DATABASE_MONGODB_ROOT_PASSWD}@mongodb/admin" \ - --quiet < "$js_file" + docker run --rm -i --network iklimco-net "${IMAGE_MONGODB}" \ + sh -c 'cat > /tmp/init.js && mongosh "$MONGO_INIT_URI" --quiet --file /tmp/init.js' \ + < "$js_file" done echo "✅ Database init scripts completed" working-directory: /workspace/iklim.co ``` -- `postgresql` and `mongodb`: Swarm VIP service names, resolved on the `iklimco-net` overlay; Patroni primary automatic routing happens at VIP level +- `DATABASE_POSTGRES_HOST`: multi-host Patroni target; the workflow uses `target_session_attrs=read-write` to reach the primary +- `DATABASE_MONGODB_HOST`: MongoDB replica set host list - SQL files `./init/postgresql/*.sql` and JS files `./init/mongodb/*.js` are created in the `Prepare Init Files` step by the `init_postgresql`/`init_mongodb` functions in `common-functions-prod.sh` - Idempotent: `CREATE IF NOT EXISTS` / `createCollection` semantics; runs safely again on later deploys @@ -331,27 +303,19 @@ In prod, all 3 app nodes are manager + app worker, so services can be distribute ### Microservices -Each microservice has two stack files: +Prod microservice workflows do not rebuild application images. They read `deploy/prod.env`, promote the tested Harbor digest to a stable prod tag, and call `swarm_service_update` with `deploy/docker-stack-service.yml`. -| File | Content | Environment | -| --- | --- | --- | -| `BE-/docker-stack-service.yml` | Base definitions, `replicas: 1` | Test + Prod | -| `BE-/docker-stack-service.prod.yml` | `replicas: 3`, `max_replicas_per_node: 1` | Prod only | - -Prod deploy command: +For first deploy, `swarm_service_update` exports `SERVICE_IMAGE` and runs: ```bash -docker stack deploy \ - -c BE-/docker-stack-service.yml \ - -c BE-/docker-stack-service.prod.yml \ - iklimco +docker stack deploy --with-registry-auth -c deploy/docker-stack-service.yml iklimco ``` -`max_replicas_per_node: 1` is mandatory; without it, when the Swarm node count is lower than the replica count, Swarm places more than one replica on the same node. +For existing services it performs `docker service update` with `--update-order start-first` and `--update-failure-action rollback`. ### Infra Services -`docker-stack-infra.yml` (base) and `docker-stack-infra.prod.yml` (overlay) are deployed together. The overlay overrides services such as Vault, APISIX, RabbitMQ, and Redis Sentinel with `replicas: 3` and `max_replicas_per_node: 1`. Detail: `Environment_Infrastructure/roadmap/prod-env/03-infra-stack-changes.md`. +The current prod infra stack is `docker-stack-infra_db-prod.yml`. Vault is not inside this stack; it is deployed separately by `vault-bootstrap.sh` using `docker-stack-vault.yml`. #### cert-reloader and Vault Auto-Unseal @@ -360,53 +324,28 @@ The `cert-reloader` sidecar service runs as `replicas: 1` inside the infra stack Certificate renewal flow: ``` -SWAG renews the certificate -> writes it to SWAG_CONFIG_DIR (/mnt/storagebox/swag/config) +SWAG renews the certificate -> stores it inside the SWAG named volume cert-reloader detects the MD5 change - -> copies it to /mnt/storagebox/ssl/ directory (common mount on all app nodes) + -> copies it to /mnt/storagebox/ssl/ directory (StorageBox) +cert-distributor syncs it to /opt/iklimco/ssl on service nodes -> runs docker service update --force iklimco_vault Vault (3 replicas) restarts - -> each instance reads the new certificate from the /mnt/storagebox/ssl/ mount - -> healthcheck checks sealed status every 30 seconds - -> if sealed: reads from the vault_unseal_key Docker secret and automatically unseals + -> each instance reads the new certificate from /opt/iklimco/ssl + -> entrypoint retry-unseal loop reads from the vault_unseal_key Docker secret and unseals ``` -The auto-unseal mechanism is provided by the Vault healthcheck inside `docker-stack-infra.yml`: - -```yaml -healthcheck: - test: - - "CMD" - - "sh" - - "-c" - - >- - vault status -format=json 2>/dev/null | grep -q '"sealed":false' || - vault operator unseal $$(cat /run/secrets/vault_unseal_key 2>/dev/null) - interval: 30s - timeout: 10s - start_period: 15s - retries: 5 -``` - -The 3 replicas run their own healthchecks independently; all of them unseal separately. The certificate renewal -> restart -> auto-unseal chain requires no manual intervention. Detail: `roadmap/prod-env/06-cert-reloader.md`. +The 3 Vault replicas run their own retry-unseal loop independently. The certificate renewal -> distribution -> restart -> unseal chain requires no manual intervention after bootstrap. #### Vault Raft Configuration -Vault is defined as 3 replicas with Raft storage in the `docker-stack-infra.prod.yml` overlay: +Vault is defined as 3 replicas with Raft storage in `docker-stack-vault.yml`: ```yaml vault: - environment: - VAULT_LOCAL_CONFIG: >- - {"api_addr":"https://vault.iklim.co:8200", - "cluster_addr":"https://{{ .Node.Hostname }}:8201", - "storage":{"raft":{"path":"/vault/file","node_id":"{{ .Node.Hostname }}"}}, - "listener":[{"tcp":{"address":"0.0.0.0:8200", - "tls_cert_file":"/vault/certs/STAR.iklim.co.full.crt", - "tls_key_file":"/vault/certs/STAR.iklim.co_key.pem"}}], - "default_lease_ttl":"168h","max_lease_ttl":"720h","ui":true} volumes: - - /opt/iklimco/vault/data:/vault/file # separate host path on each node — created with Ansible - - ${SWAG_CERT_DIR}:/vault/certs:ro # StorageBox shared — all nodes see the same path + - vault-data-vl:/vault/file + - vault-logs-vl:/vault/logs + - /opt/iklimco/ssl:/vault/certs:ro deploy: mode: replicated replicas: 3 @@ -416,59 +355,37 @@ vault: - node.labels.type == service ``` -`{{ .Node.Hostname }}` is a Docker Swarm Go template; it gives each Vault instance a unique `node_id` and `cluster_addr`. Because `/opt/iklimco/vault/data` is a host path volume, it is not an overlay volume; it must be created separately on each app node during Ansible bootstrap. See `07-prod-ansible-bootstrap.md` — Node Directory Role. Detail: `roadmap/prod-env/07-vault-raft-plan.md`. +The Vault stack uses `vault-template-v2.json`, `vault_unseal_key`, and the `iklimco-net` external network. Bootstrap and unseal are handled by `init/vault/vault-bootstrap.sh`. ## Vault Raft Cluster Initial Setup -After the infra stack is deployed for the first time, the Vault Raft cluster is initialized manually once. These steps are not repeated on every deploy; they are applied only during initial setup. +Vault Raft cluster setup is no longer a manual post-deploy procedure. It is handled by `init/vault/vault-bootstrap.sh`, called through `init-infra-prod.sh` by the root prod workflow. ### Step 1 — Stack Deploy -```bash -docker stack deploy -c docker-stack-infra.yml -c docker-stack-infra.prod.yml iklimco -``` +The bootstrap script deploys: -3 Vault containers start. The first initialized node becomes the leader. +```bash +docker stack deploy --with-registry-auth -c docker-stack-vault.yml iklimco +``` ### Step 2 — Vault Initialize (iklim-app-01) -```bash -VAULT_CTR=$(docker ps -q -f name=iklimco_vault) -docker exec -it "$VAULT_CTR" vault operator init -``` - -Store the unseal keys and root token from the output securely. Save the unseal key as a Docker secret: +The script runs `vault operator init -key-shares=1 -key-threshold=1` if Vault is not initialized. It stores bootstrap output under `/tmp/vault-bootstrap/main-vault-init.txt` during the run. ```bash -echo -n "" | docker secret create vault_unseal_key - +echo "bootstrap" | docker secret create vault_unseal_key - ``` -> After this step, the `vault_unseal_key` secret exists. During later certificate renewals, cert-reloader restarts Vault; the healthcheck reads this secret and automatically unseals, so no manual intervention is required. +Then it rotates `vault_unseal_key` to the real unseal key and unseals the leader and peers. ### Step 3 — Unseal the Leader -```bash -docker exec -it "$VAULT_CTR" vault operator unseal -``` +No manual unseal command is required in the normal path. ### Step 4 — Join the Other Nodes to the Raft Cluster -The Vault containers on `iklim-app-02` and `iklim-app-03` join the cluster: - -```bash -docker exec -it vault operator raft join \ - https://vault.iklim.co:8200 - -docker exec -it vault operator raft join \ - https://vault.iklim.co:8200 -``` - -Each node is also unsealed after it joins: - -```bash -docker exec -it vault operator unseal -docker exec -it vault operator unseal -``` +Peer join and peer unseal are handled by `vault-bootstrap.sh`. ### Step 5 — Verify the Cluster @@ -646,20 +563,20 @@ Expected: valid JSON weather response. - `rabbitmq_erlang_cookie` appears in `docker secret ls`. - The `ssl`, `swag/config`, `swag/site-confs`, `grafana/data`, and `precipitation/images` directories exist on StorageBox; see `07-prod-ansible-bootstrap.md` — StorageBox Directory Structure. - The `template/swag/site-confs/default.conf`, `api.conf.tpl`, `apigw.conf.tpl`, `rabbitmq.conf.tpl`, and `grafana.conf.tpl` template files exist in the repo. -- StorageBox `prod/secrets/iklim.co/.env.prod` has correct values for `API_SUBDOMAIN`, `APIGW_SUBDOMAIN`, `RABBITMQ_SUBDOMAIN`, `GRAFANA_SUBDOMAIN`, `RESTRICTED_IPS`, `SWAG_CERT_DIR`, `SWAG_CONFIG_DIR`, and `SWAG_SITE_CONFS_DIR`. +- StorageBox `prod/secrets/iklim.co/.env` has correct values for `API_SUBDOMAIN`, `APIGW_SUBDOMAIN`, `RABBITMQ_SUBDOMAIN`, `GRAFANA_SUBDOMAIN`, `RESTRICTED_IPS`, `SWAG_CERT_DIR`, `SWAG_DNS_CONFIG_DIR`, `SWAG_SITE_CONFS_DIR`, and `SWAG_PROXY_CONFS_DIR`. - After the first deploy, `docker exec $(docker ps -q -f name=iklimco_swag) nginx -t` succeeds and returns `syntax is ok`. - The output of `cat /mnt/storagebox/swag/site-confs/api.conf | grep server_name` contains `server_name api.iklim.co;`. - The `ssls/1` PUT block does not exist inside `init/apisix-core/init.sh`. - The `registry.tarla.io/iklimco/custom-apisix:3.12.0` image exists in Harbor and its `config.yaml` contains `real_ip_header`, `real_ip_recursive`, and `set_real_ip_from` (covering `10.0.0.0/8`, `172.16.0.0/12`, `192.168.0.0/16`) configuration. - After the first deploy, real client IP appears in APISIX access logs, not the SWAG overlay IP: `docker exec $(docker ps -q -f name=iklimco_apisix | head -1) tail -5 /usr/local/apisix/logs/access.log` - `docker service ps iklimco_cert-reloader` shows that the service is running. -- `docker service ls` does not contain `iklimco_etcd`, `iklimco_postgresql`, `iklimco_mongodb`, `iklimco_pg-proxy`, or `iklimco_mongo-proxy`; they are removed by the post-deploy step in `deploy-prod.yml` (base stack services superseded by the `iklim-db` stack or deprecated in prod). +- `docker service ls` contains the current prod infra services from `docker-stack-infra_db-prod.yml` and the separate `iklimco_vault` service from `docker-stack-vault.yml`; deprecated base-stack services such as `iklimco_postgresql`, `iklimco_mongodb`, `iklimco_pg-proxy`, and `iklimco_mongo-proxy` are not present. - The output of `docker service logs iklimco_cert-reloader --tail 20` contains `[cert-reloader] started` and has no error lines. - The `notAfter` date of the Vault TLS endpoint certificate matches `/mnt/storagebox/ssl/STAR.iklim.co.full.crt`: `docker exec $(docker ps -q -f name=iklimco_vault | head -1) sh -c 'echo | openssl s_client -connect vault.iklim.co:8200 2>/dev/null | openssl x509 -noout -dates'` - `vault operator raft list-peers` returns 3 peers: 1 leader, 2 followers. - The `vault_unseal_key` Docker secret exists and appears in `docker secret ls`. - 3 Vault containers are not sealed: `docker exec $(docker ps -q -f name=iklimco_vault | head -1) vault status | grep Sealed` -> `Sealed false`. -- The first deploy pipeline successfully completes all 21 steps; the `Review Environment` step succeeds. +- The first deploy pipeline successfully completes all current root workflow steps; the `Review Environment` step succeeds. - After the `Bootstrap SWAG Certificate` step, `ls /mnt/storagebox/ssl/` -> `STAR.iklim.co.full.crt` and `STAR.iklim.co_key.pem` exist. - The `Run Database Init Scripts` step completes without error; PostgreSQL and MongoDB are healthy and init scripts are applied. - In the output of `docker service ls --filter label=project=co.iklim`, all infra services show `X/X`.