docs(infra): restructure and update infrastructure setup documentation

- Anglicized setup and facts markdown file names for better consistency.

- Updated 01-swarm-init-multinode.md to highlight Ansible automation of Swarm initialization and labeling.

- Overhauled 03-infra-stack-changes.md to describe the single monolithic file strategy and reflect current Redis, RabbitMQ, and etcd cluster configurations.

- Fixed minor overrides and typos in Patroni templates and Ansible bootstrap documents.

- Restructured README and roadmap mapping to align with the renamed setup documents.
This commit is contained in:
Murat ÖZDEMİR 2026-06-15 16:42:18 +03:00
parent 1fd752526b
commit 67dc2986dd
19 changed files with 666 additions and 1188 deletions

139
README.md
View File

@ -1,64 +1,111 @@
# 🌍 iklim.co Altyapı ve Sunucu Yönetimi
# iklim.co Altyapı ve Sunucu Yönetimi
Bu depo, `iklim.co` projesinin **test** ve **production** ortamlarını kurmak, yönetmek ve modernize etmek için gerekli olan Infrastructure-as-Code (IaC) varlıklarını, teknik rehberleri ve operasyonel standartları barındırır.
Bu depo, `iklim.co` test ve production ortamlarını provision etmek, yapılandırmak, işletmek ve modernize etmek için kullanılan Infrastructure-as-Code varlıklarını, kurulum runbook'larını, operasyonel facts dokümanlarını ve planlama notlarını içerir.
Altyapı yönetimi; Hetzner Cloud üzerinde Terraform ile kaynak provisioning, Ansible ile işletim sistemi yapılandırması ve Docker Swarm üzerinde mikroservis mimarisinin kurgulanması süreçlerini kapsar.
Altyapı yönetimi Hetzner Cloud üzerinde Terraform ile kaynak provisioning, Ansible ile işletim sistemi ve Swarm bootstrap, Docker Swarm üzerinde altyapı ve uygulama servislerinin deploy edilmesi süreçlerini kapsar.
---
## Depo Yapısı
## 📂 Depo Yapısı ve Temel Bölümler
### Terraform (`terraform/`)
Bu depodaki dökümantasyon ve kod varlıkları beş ana kategoriye ayrılmıştır:
Terraform, uzak test ve production ortamları için Hetzner Cloud kaynaklarını tanımlar:
### 1. 🛣️ Roadmap (`roadmap/`)
Ortamların (test ve prod) sıfırdan kurulması veya mevcut yapının güncellenmesi için gerekli olan **iş gereksinimlerini, teknik hedefleri ve adım adım uygulama planlarını** içerir.
- Altyapıda yapılacak büyük değişikliklerin (örn: Redis Sentinel geçişi, APISIX konfigürasyonu, RabbitMQ Quorum Queues) stratejik dökümantasyonudur.
- [roadmap/test-env/](./roadmap/test-env/) - Test ortamı gereksinimleri ve planları.
- [roadmap/prod-env/](./roadmap/prod-env/) - Üretim ortamı HA (High Availability) ve güvenilirlik planları.
- `terraform/hetzner/test`: test sunucuları, network, firewall, Floating IP, placement ve outputs.
- `terraform/hetzner/prod`: production app/service node'ları, DB node'ları, private networking, firewall'lar, placement group'lar, Floating IP ve outputs.
### 2. 🛠️ Setup (`setup/`)
Altyapının fiziksel olarak ayağa kaldırılması için kullanılan **uygulama dökümanlarıdır**. Bu bölüm şunları yönetmek için kullanılır:
- **Terraform:** Bulut kaynaklarının (Server, Network, Firewall) üretilmesi.
- **Ansible:** İşletim sistemi hazırlığı, güvenlik sertleştirme (hardening), Docker/Swarm kurulumu.
- **CI/CD:** Deployment workflow'larının (Gitea Actions) ve stack manifest'lerinin oluşturulması/güncellenmesi.
- Örn: [setup/06-prod-terraform-iaac.md](./setup/06-prod-terraform-iaac.md), [setup/07-prod-ansible-bootstrap.md](./setup/07-prod-ansible-bootstrap.md)
Dev ortamı lokal ve Docker Compose tabanlıdır; bu Terraform stack'leri tarafından provision edilmez.
### 3. 🗺️ Setup vs Roadmap Matrisi (`setup-vs-roadmap-map.md`)
İşterler doğrultusunda hazırlanan **Roadmap** dökümanları ile bu isterleri teknik olarak hayata geçiren **Setup** dökümanları arasındaki ilişkiyi açıklar.
- Hangi roadmap adımının hangi setup dökümanı ile uygulandığını gösteren bir eşleşme matrisidir.
- [setup-vs-roadmap-map.md](./setup-vs-roadmap-map.md) dökümanından detaylara ulaşılabilir.
### Ansible (`ansible/`)
### 4. 📊 Hetzner Sizing Report (`hetzner-sizing-report.md`)
İklim altyapı servisleri (API Gateway, Microservices, Databases, Broker) için seçilen **Hetzner sunucu tiplerini, CPU/RAM kapasitelerini ve maliyet/performans analizlerini** anlatır.
- Ortam kurulumundan önce kapasite planlaması için temel referans noktasıdır.
- [hetzner-sizing-report.md](./hetzner-sizing-report.md) dökümanını inceleyin.
Ansible, Terraform provisioning sonrası uzak host'ları hazırlar:
### 5. 💡 Facts (`facts/`)
Ortam kurulumları tamamlandıktan sonra ortaya çıkan, **sistemin o anki gerçek durumunu (source of truth) ve bilinmesi gereken kritik teknik detayları** barındıran dökümanlardır.
- "Sistem şu an nasıl çalışıyor?" sorusunun cevabıdır.
- [facts/firewall.md](./facts/firewall.md): Aktif firewall kuralları ve port matrisi.
- [facts/swarm-node-recovery-swag-failover.md](./facts/swarm-node-recovery-swag-failover.md): Node düşmesi durumunda manuel müdahale ve recovery prosedürleri.
- `ansible/test`: test bootstrap playbook'ları, inventory ve ortama özel değişkenler.
- `ansible/prod`: production bootstrap playbook'ları, inventory, değişkenler ve prod'a özel rol override'ları.
- `ansible/roles`: `base`, `hardening`, `docker`, `swarm`, `node_dirs`, `storagebox`, `storagebox_ssh_key`, `act_runner` ve ortak `db_stack` gibi paylaşılan roller.
---
Production, `ansible/prod/ansible.cfg` içinde `roles_path = roles:../roles` kullanır. Bu nedenle `ansible/prod/roles/db_stack` gibi prod-local roller mevcut olduğunda paylaşılan rollerden önce çalışır.
## 🧱 Kurulum Akışı (Kanonik Sıra)
### Setup Runbook'ları (`setup/`)
Bir ortamı sıfırdan kurarken veya majör bir güncelleme yaparken şu sırayı takip edin:
Setup dokümanları, ortamları ayağa kaldırmak veya büyük altyapı değişikliklerini uygulamak için kullanılan kanonik uygulama runbook'larıdır. Güncel dosyalar:
1. **Analiz:** [hetzner-sizing-report.md](./hetzner-sizing-report.md) ile kaynak ihtiyacını belirleyin.
2. **Planlama:** `roadmap/` altındaki ilgili ortam dökümanlarını inceleyerek yapılacak değişiklikleri anlayın.
3. **Hizalama:** [setup-vs-roadmap-map.md](./setup-vs-roadmap-map.md) ile hangi setup dökümanlarını kullanacağınızı netleştirin.
4. **Uygulama:** `setup/` dökümanlarını (00'dan 09'a kadar) sırasıyla takip ederek Terraform ve Ansible süreçlerini işletin.
5. **Doğrulama:** Kurulum sonrası sistemin çalışma prensipleri için `facts/` dökümanlarını referans alın.
- [setup/00-general-roadmap.md](./setup/00-general-roadmap.md)
- [setup/01-private-network-port-matrix.md](./setup/01-private-network-port-matrix.md)
- [setup/02-test-terraform-iac.md](./setup/02-test-terraform-iac.md)
- [setup/03-test-ansible-bootstrap.md](./setup/03-test-ansible-bootstrap.md)
- [setup/04-test-db-docker-setup.md](./setup/04-test-db-docker-setup.md)
- [setup/05-test-runner-and-deploy-prerequisites.md](./setup/05-test-runner-and-deploy-prerequisites.md)
- [setup/06-prod-terraform-iac.md](./setup/06-prod-terraform-iac.md)
- [setup/07-prod-ansible-bootstrap.md](./setup/07-prod-ansible-bootstrap.md)
- [setup/08-prod-db-cluster-setup.md](./setup/08-prod-db-cluster-setup.md)
- [setup/09-prod-runner-ha-and-swarm.md](./setup/09-prod-runner-ha-and-swarm.md)
---
Bu dokümanlar Terraform, Ansible, Swarm label'ları, StorageBox path'leri, runner ön koşulları, DB servisleri ve production Swarm deploy modelinin birlikte nasıl çalıştığınııklar.
## ✅ Ön Koşullar ve Araçlar
### Roadmap (`roadmap/`)
- **Terraform >= 1.6**: Altyapı provisioning.
- **Ansible**: Konfigürasyon yönetimi.
- **Hetzner Cloud API Token**: Ortam bazlı yetkilendirme.
- **SSH Key**: Sunucu erişimi için sisteme tanımlı anahtar çifti.
Roadmap dokümanları test ve production değişiklikleri için gereksinimleri, tasarım hedeflerini ve migration planlarınııklar:
---
*iklim.co Infrastructure Team - 2026*
- [roadmap/test-env/](./roadmap/test-env/)
- [roadmap/prod-env/](./roadmap/prod-env/)
Roadmap dokümanlarını amaç ve tasarım bağlamı için kullanın. Güncel uygulama akışı için setup runbook'larını kullanın.
### Setup vs Roadmap Map
[setup-vs-roadmap-map.md](./setup-vs-roadmap-map.md), roadmap maddelerini bu maddeleri hayata geçiren setup dokümanları ve implementation alanları ile eşler.
### Facts (`facts/`)
Facts dokümanları güncel durum detaylarını ve operasyonel geçmişi korur:
- [facts/firewall.md](./facts/firewall.md): aktif firewall ve port bilgileri.
- [facts/node-recovery-failover.md](./facts/node-recovery-failover.md): node recovery ve failover prosedürleri.
- [facts/prod-kurulum-gecmisi.md](./facts/prod-kurulum-gecmisi.md): production kurulum geçmişi ve güncel production notları.
Facts dokümanlarını “sistem şu an nasıl çalışıyor?” sorusu, tarihsel bağlam ve setup sonrası doğrulama için kullanın.
### Hetzner Sizing Report
[hetzner-sizing-report.md](./hetzner-sizing-report.md), altyapı servisleri, veritabanları, broker'lar ve uygulama workload'ları için sunucu sizing, CPU/RAM seçimleri ve maliyet/performans değerlendirmelerini açıklar.
### Confluence Export (`confluence-wiki/`)
`confluence-wiki/`, altyapı notlarının repository dışına yayınlanması veya mirror edilmesi gerektiğinde kullanılan wiki odaklı/export edilmiş dokümantasyon materyallerini içerir.
## Güncel Production Modeli
Production şu anda ayrık altyapı modeli kullanır:
- Ana infra ve DB stack: root `docker-stack-infra_db-prod.yml`.
- Vault stack: root `docker-stack-vault.yml`.
- Vault bootstrap: root `init/vault/vault-bootstrap.sh`; production deploy akışında `init-infra-prod.sh` üzerinden çağrılır.
- Production pipeline source of truth: root `.gitea/workflows/deploy-prod.yml` ve root `prod_env-ci_dc-pipeline.md`.
`docker-stack-infra_db-prod.yml` bilinçli olarak karma bir stack'tir:
- Patroni/PostgreSQL, MongoDB ve etcd gibi DB/cluster servisleri `iklim-db-*` node'larında çalışır ve gerektiği yerde host-mode cluster portları kullanır.
- Redis, Redis Sentinel ve RabbitMQ gibi service-node altyapı servisleri `node.labels.type == service` app/service node'larında çalışır ve stack veya reverse proxy tarafından açıkça expose edilmedikçe Docker overlay network üzerinde kalır.
## Kanonik Kurulum Akışı
Yeni bir ortam veya büyük bir altyapı güncellemesi için:
1. [hetzner-sizing-report.md](./hetzner-sizing-report.md) dosyasını inceleyin.
2. Tasarım amacını anlamak için ilgili `roadmap/` dokümanlarını inceleyin.
3. Her roadmap maddesinin hangi setup runbook'u ile uygulandığını görmek için [setup-vs-roadmap-map.md](./setup-vs-roadmap-map.md) dosyasını kontrol edin.
4. Hedef ortam için numaralı `setup/` runbook'larını sırayla takip edin.
5. Güncel davranışı, recovery prosedürlerini, firewall durumunu ve production geçmişini doğrulamak için `facts/` dokümanlarını kullanın.
## Gerekli Araçlar
- Terraform `>= 1.6`
- Ansible
- Hedef ortam için Hetzner Cloud API token
- Sunucu erişimi için yetkili SSH key pair
## Notlar
- Dev ortamı lokal ve Docker Compose tabanlıdır; uzak Terraform/Ansible otomasyonu test ve production ortamlarını hedefler.
- Test daha küçük bir uzak ortamdır ve single-node DB/App varsayımlarına dayanır.
- Production üç app/service node ve üç DB node içeren high-availability uzak ortamdır.

View File

@ -15,7 +15,7 @@ etcd3:
- etcd-02:2379
- etcd-03:2379
username: root
password: "{{ vault_etcd_root_password }}"
password: "${ETCD_ROOT_PASSWORD}"
bootstrap:
dcs:

View File

@ -1,4 +1,4 @@
# Docker Swarm Node Recovery
# Test — Docker Swarm Node Recovery
Test ortamında tek manager (`iklim-app-01`) ve tek worker (`iklim-db-01`) bulunur. Hangi node'un yeniden kurulduğuna göre recovery süreci farklılaşır.
@ -32,17 +32,19 @@ DB verileri `iklim-db-01`'deki named volume'larda korunur, kayıp yaşanmaz.
Yeni `iklim-db-01` Swarm'dan habersiz başlar (`inactive`). Manager (`iklim-app-01`) eski dead node kaydını tutar.
> ⚠️ **Veri kaybı:** `iklim-db-01` yeniden kurulduğu için tüm named volume'lar silinmiştir. 3. adım öncesinde backup'tan restore yapılması zorunludur.
### Çözüm
```bash
# 1. Ansible bootstrap — yeni node otomatik join olur
cd ansible/test
ansible-playbook -i inventory/generated/test.yml test-bootstrap.yml --ask-vault-pass
# 2. iklim-app-01 üzerinde — eski dead node kaydını temizle
# 1. iklim-app-01 üzerinde — eski dead node kaydını temizle (bootstrap'tan ÖNCE yapılmalı)
docker node ls # eski node ID'yi bul
docker node rm <eski-node-id>
# 2. Ansible bootstrap — yeni node otomatik join olur
cd ansible/test
ansible-playbook -i inventory/generated/test.yml test-bootstrap.yml --ask-vault-pass
# 3. DB stack'i yeniden deploy et (backup'tan restore sonrası)
ansible-playbook -i inventory/generated/test.yml test-db-post-stack.yml --ask-vault-pass
```
@ -68,7 +70,7 @@ ansible-playbook -i inventory/generated/test.yml test-db-post-stack.yml --ask-va
| Senaryo | Manuel Adım | Ansible Yeterli mi? |
|---|---|---|
| Manager (`iklim-app-01`) ölür | `docker swarm leave --force` (worker'da) | Sonrasında evet |
| Worker (`iklim-db-01`) ölür | `docker node rm <id>` (manager'da) | Büyük ölçüde evet |
| Worker (`iklim-db-01`) ölür | `docker node rm <id>` (manager'da, bootstrap'tan önce) | Hayır — backup restore gerekir |
| Her ikisi ölür | Yok | Evet |
## Neden Prod'da Bu Sorun Yok
@ -81,6 +83,8 @@ Prod ortamında birden fazla manager node (en az 3) çalıştırılır. Tek mana
SWAG, cert-reloader, Prometheus ve Grafana cluster-native (replicated) değildir; her zaman tek instance çalışırlar ve varsayılan olarak `iklim-app-01`'e (Floating IP node) sabitlenmişlerdir. `iklim-app-01` çöktüğünde bu servisler durur; DNS/HTTPS erişimi ve izleme (monitoring) kesilir. Swarm quorum 2 manager ile devam eder; mikroservisler ve Vault başka node'lara taşınır.
`cert-distributor` bu kuralın dışındadır: `mode: global` ile `node.labels.type == service` olan tüm node'larda çalışır; StorageBox'tan sertifikayı node-lokal `/opt/iklimco/ssl`'e kopyalar (Vault FUSE mount kısıtlaması nedeniyle). `iklim-app-01` düştüğünde diğer node'lardaki `cert-distributor` instance'ları çalışmaya devam eder — failover gerektirmez.
Tüm bu servislerin verileri ve konfigürasyonları StorageBox'ta tutulur:
- **SWAG:** `/mnt/storagebox/swag/config`
- **SSL:** `/mnt/storagebox/ssl`
@ -91,12 +95,12 @@ Tüm bu servislerin verileri ve konfigürasyonları StorageBox'ta tutulur:
### 1. Servisleri Başka Node'a Taşı
SWAG ve cert-reloader birlikte taşınmalıdır. Prometheus ve Grafana da bağımsız olarak veya aynı anda taşınabilir.
SWAG ve cert-reloader birlikte taşınmalıdır. Prometheus ve Grafana da bağımsız olarak veya aynı anda taşınabilir. `cert-distributor` global mode'da çalıştığından taşıma gerekmez.
```bash
# iklim-app-02 veya iklim-app-03 üzerinde (aktif manager):
# SWAG & Cert-Reloader taşıma
# SWAG & Cert-Reloader taşıma (replicas=1 olduğundan taşıma sırasında kısa kesinti yaşanır)
docker service update --constraint-add "node.hostname == iklim-app-02" --constraint-rm "node.hostname == iklim-app-01" iklimco_swag
docker service update --constraint-add "node.hostname == iklim-app-02" --constraint-rm "node.hostname == iklim-app-01" iklimco_cert-reloader
@ -121,8 +125,12 @@ hcloud floating-ip assign <floating-ip-id> <iklim-app-02-server-id>
4. `iklim-prod-app-fip` satırının sağındaki **⋮** (üç nokta) menüsünü aç → **Reassign**.
5. Açılan listeden **`iklim-app-02`**'yi seç → **Reassign** butonuna tıkla.
> **Not:** Floating IP Hetzner panelinde yeniden atandıktan sonra `iklim-app-02`'nin network interface'inde de aktif olması gerekir. Ansible bootstrap bu konfigürasyonu yapıyorsa otomatiktir; emin olmak için `ip addr show` ile Floating IP'nin bind edildiğini doğrula.
### 3. Doğrula
SWAG başlama ve sertifika kontrolü birkaç saniye sürebilir; servis `Running` görünse de ilk `curl` başarısız dönebilir. Birkaç saniye bekleyip tekrar dene.
```bash
docker service ls | grep -E 'swag|cert-reloader|prometheus|grafana'
curl -si https://api.iklim.co/health
@ -133,6 +141,9 @@ curl -si https://api.iklim.co/health
Node Swarm'a yeniden katıldıktan sonra tüm servisleri tekrar `iklim-app-01`'e taşıyıp Floating IP'yi geri aktarabilirsiniz.
```bash
# Önce node'un Swarm'a gerçekten katıldığını doğrula (STATUS = Ready olmalı)
docker node ls
# Servisleri geri taşı
for svc in iklimco_swag iklimco_cert-reloader iklimco_prometheus iklimco_grafana; do
docker service update --constraint-add "node.hostname == iklim-app-01" --constraint-rm "node.hostname == iklim-app-02" $svc
@ -149,5 +160,62 @@ hcloud floating-ip assign <floating-ip-id> <iklim-app-01-server-id>
| Swarm quorum | Otomatik — 2 manager yeterli |
| Vault, mikroservisler | Otomatik — `node.labels.type == service` constraint ile başka node'a schedule edilir |
| SWAG, cert-reloader | Manuel — `docker service update --constraint-*` + Floating IP taşıma |
| cert-distributor | Otomatik — `mode: global`, tüm servis node'larında zaten çalışır |
| Prometheus, Grafana | Manuel — `docker service update --constraint-*` |
| Veriler & Konfig | StorageBox'ta; failover node hemen erişir, veri kaybı yaşanmaz |
---
# Prod — DB Node Recovery
Her DB node'u (`iklim-db-01`, `iklim-db-02`, `iklim-db-03`) aynı servis üçlüsünü barındırır:
| Node | Servisler |
|------|-----------|
| `iklim-db-01` | `etcd-01`, `patroni-01`, `mongodb-01` |
| `iklim-db-02` | `etcd-02`, `patroni-02`, `mongodb-02` |
| `iklim-db-03` | `etcd-03`, `patroni-03`, `mongodb-03` |
## Senaryo A: Node Geçici Olarak Çöker (Volume'lar Korunur)
etcd, Patroni ve MongoDB'nin tamamı 3 üyeli HA cluster'lardır; quorum için 2 node yeterlidir.
| Servis | Etki | Otomatik İyileşme |
|--------|------|-------------------|
| etcd | 2/3 node ile quorum devam eder | Node geri dönünce cluster'a otomatik katılır |
| Patroni | Replica düşerse primary devam eder; primary düşerse etcd üzerinden yeni primary seçilir | Node geri dönünce replica olarak otomatik katılır |
| MongoDB | 2/3 node ile quorum devam eder; gerekirse yeni primary seçilir | Node geri dönünce primary'den initial sync ile güncellenir |
**Manuel adım gerekmez.** Docker Swarm `restart_policy: on-failure` servisleri otomatik başlatır.
## Senaryo B: Node Yeniden Kurulur (Volume'lar Silinir)
etcd named volume'ları node-lokal olduğundan node yeniden kurulunca kaybolur. Patroni ve MongoDB kendi kendine iyileşir; etcd manuel müdahale gerektirir.
```bash
# Aktif bir etcd container'ından — eski üyeyi cluster'dan çıkar
docker exec -it $(docker ps -q -f name=iklimco_etcd-01) \
etcdctl member list --endpoints=http://etcd-01:2379,http://etcd-02:2379,http://etcd-03:2379
# Çıktıdan yeniden kurulan node'un <member-id>'sini al:
docker exec -it $(docker ps -q -f name=iklimco_etcd-01) \
etcdctl member remove <member-id> --endpoints=http://etcd-01:2379,http://etcd-02:2379,http://etcd-03:2379
# Servisleri yeniden başlat (etcd boş volume ile existing cluster'a katılır;
# Patroni primary'den pg_basebackup ile otomatik clone alır;
# MongoDB hostname değişmediyse primary'den otomatik initial sync yapar)
docker service update --force iklimco_etcd-0N
docker service update --force iklimco_patroni-0N
docker service update --force iklimco_mongodb-0N
```
> **MongoDB hostname değişirse:** Replica set konfigürasyonu eski hostname'i tutar. `mongosh` ile `rs.remove("<eski-host>:27017")` ardından `rs.add("<yeni-host>:27017")` çalıştır.
> **etcd `ETCD_INITIAL_CLUSTER_STATE`:** Stack dosyasında `new` olarak tanımlıdır (ilk kurulum için). Yeniden kurulum senaryosunda Swarm servisi `--force` ile güncellenince etcd boş volume ile başlar ve mevcut cluster'a `existing` modunda katılmaya çalışır. Bitnami etcd image'ı bunu otomatik algılar; sorun yaşanırsa stack dosyasında ilgili node'un `ETCD_INITIAL_CLUSTER_STATE` değerini geçici olarak `existing` yapıp redeploy et, ardından geri al.
## Özet
| Servis | Geçici çöküş | Yeniden kurulum |
|--------|-------------|-----------------|
| etcd | Otomatik | Manuel: `member remove``service update --force` |
| Patroni | Otomatik | Otomatik: boş dir'den primary'yi clone alır |
| MongoDB | Otomatik | Otomatik (aynı hostname); hostname değişirse `rs.remove` + `rs.add` |

View File

@ -2,6 +2,11 @@
Prod kurulum adımları ve mevcut yapı.
Bu dosya kurulum geçmişini korur. Güncel prod deploy akışı için ana kaynak
repo kökündeki `prod_env-ci_dc-pipeline.md` dosyasıdır. Aşağıdaki manuel deploy
adımları, ilk kurulum ve sorun giderme geçmişi olarak tutulur; normal prod deploy
artık root `.gitea/workflows/deploy-prod.yml` üzerinden yürür.
## Terraform
### Hetzner Cloud Yapılandırması
@ -166,7 +171,27 @@ ansible-playbook prod-bootstrap.yml \
--vault-password-file=../.vault_pass
```
## DB Stack Deploy
## Güncel Production Deploy Kaynakları
| Alan | Güncel kaynak |
| --- | --- |
| Root prod workflow | `.gitea/workflows/deploy-prod.yml` |
| Detaylı CI/CD dokümanı | `prod_env-ci_dc-pipeline.md` |
| Ana infra stack | `docker-stack-infra_db-prod.yml` |
| Vault HA stack | `docker-stack-vault.yml` |
| Vault bootstrap script | `init/vault/vault-bootstrap.sh` |
| Prod env ve secret dosyaları | `prod/secrets/iklim.co/.env`, `.env.secrets.*` |
Güncel yapıda `.deleted` suffix'li eski stack dosyaları yoktur ve prod akışında
dikkate alınmaz. Ana infra stack `docker-stack-infra_db-prod.yml` dosyasıdır.
Vault stack'i bu dosyanın içinde değildir; `vault-bootstrap.sh` tarafından
`docker-stack-vault.yml` ile deploy edilir.
## Tarihsel Manuel DB Stack Deploy (2026-05-21)
Bu bölüm ilk prod DB/infra kurulum geçmişini korumak için bırakılmıştır. Güncel
normal akışta bu adımlar elle çalıştırılmaz; root prod workflow ana stack deploy,
Vault bootstrap, MongoDB replica set init ve DB init scriptlerini yönetir.
### Custom Image Build
@ -174,6 +199,9 @@ ansible-playbook prod-bootstrap.yml \
### Stack Deploy
Tarihsel not: Bu komut bloğundaki `docker-stack-db-prod.yml` artık güncel stack
dosyası değildir. Güncel ana stack `docker-stack-infra_db-prod.yml` dosyasıdır.
```bash
# Lokal → app-01
scp ./docker-stack-* root@178.104.210.41:/home/iklim/
@ -198,6 +226,10 @@ history -c && history -w
### MongoDB Replica Set Init
Tarihsel not: İlk kurulumda `rs.initiate` elle verilmişti. Güncel root prod
workflow içinde `Initialize MongoDB Replica Set` adımı replica set yoksa
`rs.initiate()`, eksik üye varsa primary üzerinden `rs.add()` çalıştırır.
```bash
ssh root@<db-01-ip>
@ -242,10 +274,10 @@ history -c && history -w
curl -s http://10.20.20.11:8008/cluster | python3 -m json.tool
```
## Mevcut Durum (2026-05-21)
## Tarihsel Durum (2026-05-21)
| Adım | Durum |
| --- | --- |
| ------------------------------------------------------- | ---------- |
| Terraform — 6 sunucu, ağ, firewall, floating IP | ✅ |
| Ansible base + hardening + docker + node_dirs | ✅ |
| Ansible storagebox + storagebox_ssh_key | ✅ |
@ -256,12 +288,52 @@ curl -s http://10.20.20.11:8008/cluster | python3 -m json.tool
| DB stack deploy (etcd + MongoDB + Patroni) | ✅ |
| MongoDB replica set init (rs0: 1 primary, 2 secondary) | ✅ |
| Patroni HA cluster (1 leader, 2 replica, lag=0) | ✅ |
| Ana infra stack deploy (docker-stack-infra_db-prod.yml) | ⏳ bekliyor |
| MongoDB rs.initiate (ilk deploy sonrası elle) | ⏳ bekliyor |
| Ana infra stack deploy (docker-stack-infra_db-prod.yml) | |
| MongoDB rs.initiate (ilk deploy sonrası elle) | ✅ |
| Deploy pipeline ilk çalışma | ⏳ bekliyor |
## Güncel Durum (2026-06-15)
| Alan | Güncel durum |
| --- | --- |
| Prod deploy kaynak dokümanı | `prod_env-ci_dc-pipeline.md` |
| Root prod workflow | `.gitea/workflows/deploy-prod.yml` |
| Ana infra stack | `docker-stack-infra_db-prod.yml` |
| Vault HA stack | `docker-stack-vault.yml` |
| Vault deploy yöntemi | `init/vault/vault-bootstrap.sh` tarafından bootstrap/deploy |
| Eski `.deleted` stack dosyaları | Silindi, güncel akışta yok |
| Prod env dosyası | StorageBox `prod/secrets/iklim.co/.env` -> workflow workspace `./.env` |
| Shared secrets | StorageBox `prod/secrets/iklim.co/.env.secrets.shared` |
| Service secrets | StorageBox `prod/secrets/iklim.co/.env.secrets.<svc>` |
| SWAG secrets | StorageBox `prod/secrets/iklim.co/.env.secrets.swag` |
| MongoDB replica set init | Workflow içinde otomatik/idempotent adım olarak yönetiliyor |
| PostgreSQL init | Patroni primary beklenerek `./init/postgresql/*.sql` ile çalışıyor |
| MongoDB init | Replica set hazırlandıktan sonra `./init/mongodb/*.js` ile çalışıyor |
| DNS update | Workflow GoDaddy API ile `api`, `apigw`, `rabbitmq`, `grafana` A kayıtlarını güncelliyor |
Güncel prod workflow ana hatlarıyla şu sırayı izler:
1. StorageBox'tan `.env`, `.env.secrets.shared`, service secret dosyaları ve `.env.secrets.swag` alınır.
2. PostgreSQL ve MongoDB init template'leri `./init/postgresql` ve `./init/mongodb` altına üretilir.
3. Harbor pull login yapılır.
4. SWAG DNS/site config dosyaları hazırlanır.
5. Vault için geçici TLS placeholder cert gerekirse oluşturulur.
6. `rabbitmq_erlang_cookie` Docker secret'ı oluşturulur veya mevcutsa korunur.
7. `docker-stack-infra_db-prod.yml` `iklimco` stack'ine deploy edilir.
8. Runner job container `iklimco-net` overlay network'üne bağlanır.
9. `init-infra-prod.sh` çalışır; bu script Vault bootstrap ve RabbitMQ prod hazırlığını yapar.
10. Vault AppRole ID/Secret ID değerleri ve Docker secrets üretilir.
11. Güncellenen `.env` ve `.env.secrets.*` dosyaları StorageBox'a yüklenir.
12. etcd, APISIX, SWAG certificate, MongoDB replica set, DB init scriptleri ve DNS kayıtları doğrulanır/güncellenir.
## Önemli Mimari Notlar
### Ana Infra Stack ve Vault Ayrımı (2026-06-15)
Güncel durumda ana infra stack `docker-stack-infra_db-prod.yml` dosyasıdır. Bu stack Redis master/replica/sentinel, RabbitMQ cluster, APISIX, APISIX Dashboard, Prometheus, Grafana, SWAG, cert-reloader, cert-distributor, etcd, Patroni ve MongoDB replica set servislerini içerir.
Vault ana infra stack içinde değildir. Vault HA cluster `docker-stack-vault.yml` dosyasıyla, `init/vault/vault-bootstrap.sh` tarafından deploy edilir. Bootstrap akışı placeholder `vault_unseal_key` oluşturur, `iklimco_vault` servisini deploy eder, Vault init/unseal işlemini yapar ve Docker secret'ı gerçek unseal key ile rotate eder.
### Tek Stack Yaklaşımı (2026-05-26)
`docker-stack-infra-prod.yml` ve `docker-stack-db-prod.yml` tek dosyada birleştirildi: `docker-stack-infra_db-prod.yml`. Her iki dosya da aynı `iklimco` stack adına deploy edildiğinden servis isimleri değişmedi.
@ -270,7 +342,9 @@ curl -s http://10.20.20.11:8008/cluster | python3 -m json.tool
**Network:** `iklimco-net` artık stack tarafından oluşturulur (MTU=1400, attachable). Ansible `swarm` rolündeki network oluşturma task'ı kaldırıldı.
**MongoDB rs.initiate:** İlk deploy sonrası `rs.initiate` elle verilmeli (DB Stack Deploy bölümüne bakınız).
**MongoDB rs.initiate:** Bu not ilk kurulum dönemine aittir. Güncel prod workflow
`Initialize MongoDB Replica Set` adımında `rs.initiate()` ve gerektiğinde `rs.add()`
işlemlerini yönetir.
**Network silinirse:** Stack'i yeniden deploy et — `docker stack deploy -c docker-stack-infra_db-prod.yml iklimco`
@ -278,6 +352,11 @@ curl -s http://10.20.20.11:8008/cluster | python3 -m json.tool
`retry_join.leader_api_addr` olarak `iklimco_vault` (Swarm servis adı) kullanılır. Stack-owned network sayesinde Docker DNS bu VIP'i kayıt eder. `leader_tls_server_name: vault.iklim.co` ile `*.iklim.co` sertifikası TLS doğrulamasını geçer.
Güncel Vault deploy akışında bu ayar `docker-stack-vault.yml` ve Vault template
dosyaları üzerinden kullanılır. Vault stack deploy'u root workflow'da doğrudan
değil, `init-infra-prod.sh` -> `init/vault/init-prod.sh` ->
`init/vault/vault-bootstrap.sh` zinciriyle yapılır.
### Runner / iklimco-net (2026-05-26)
Act runner config'de `container.network: "bridge"` kullanılır (önceki `iklimco-net`). Workflow'da "Connect Runner to Overlay Network" adımı "Deploy Swarm Stacks" sonrasına taşındı — böylece stack'in oluşturduğu `iklimco-net`'e runner job container bağlanabilir.

View File

@ -41,6 +41,9 @@ This scheme is applied consistently across `docker-stack-infra.yml` and all 10 m
`node.role == worker` is intentionally not used anywhere. DB nodes are Swarm workers, but targeting them via `node.role == worker` would also match any future worker-only app nodes. The explicit `node.labels.role == db` label provides precise, unambiguous targeting regardless of Swarm role.
## Otomasyon Notu
**ÖNEMLİ:** Aşağıda listelenen tüm Swarm ilklendirme, join token işlemleri ve node etiketleme (labeling) süreçleri artık manuel yapılmamaktadır. Bu işlemler `Environment_Infrastructure/ansible/prod/prod-bootstrap.yml` ve ortak `swarm` rolü tarafından **tamamen otomatik** olarak yürütülmektedir. Buradaki manuel bash komutları yalnızca referans, bilgi ve sorun giderme (troubleshooting) amaçlı tutulmaktadır.
## Step 1 — Init Swarm on iklim-app-01 (the prod-runner node)
```bash
@ -102,7 +105,7 @@ docker node update --label-add role=db --label-add db-index=03 iklim-db-03
> DB nodes are Swarm **workers** only — they never become managers.
> DB services are pinned to them via `node.labels.role == db` placement constraint.
> See `08-prod-db-cluster-kurulum.md` for DB stack deployment.
> See `08-prod-db-cluster-setup.md` for DB stack deployment.
## Step 6 — Verify

View File

@ -60,7 +60,7 @@ To get the Floating IP: `terraform output prod_floating_ip`
Logic: for each record, pipeline queries the current value via GoDaddy API. If already correct, it skips. Otherwise it creates/updates the record.
> The Floating IP is assigned to `iklim-app-01` (`06-prod-terraform-iaac.md` — `floating_ip.tf`).
> The Floating IP is assigned to `iklim-app-01` (`06-prod-terraform-iac.md` — `floating_ip.tf`).
> If failover is needed, the Floating IP can be reassigned to another app node; DNS does not change.
## Notes

View File

@ -1,702 +1,75 @@
# 03 — docker-stack-infra.yml Changes (Prod)
# 03 — Production Infrastructure and DB Stack Model
## Context
### File strategy — overlay approach
This document records the production infrastructure target that is now implemented by the current setup runbooks. The execution source is no longer the old base-plus-prod overlay model.
Prod-specific service changes are **not written directly** into `docker-stack-infra.yml`; they are kept in a separate overlay file:
Current references:
| File | Usage |
|------|-------|
| `docker-stack-infra.yml` | Base — works as-is for test |
| `docker-stack-infra.prod.yml` | Prod overlay — additional services and overrides |
- Setup source: `../../setup/08-prod-db-cluster-setup.md` and `../../setup/09-prod-runner-ha-and-swarm.md`
- Main infra and DB stack: root `docker-stack-infra_db-prod.yml`
- Vault stack: root `docker-stack-vault.yml`
- Vault bootstrap: root `init/vault/vault-bootstrap.sh`, called through `init-infra-prod.sh`
```bash
# Test deploy:
docker stack deploy -c docker-stack-infra.yml iklimco
## Current Stack Strategy
# Prod deploy (Swarm merges both files):
docker stack deploy -c docker-stack-infra.yml -c docker-stack-infra.prod.yml iklimco
```
Production uses a split stack model:
Docker Swarm merge rule: if the same service name appears in both files, the overlay wins (deploy, environment, etc.); services only present in the overlay are added.
- `docker-stack-infra_db-prod.yml`: APISIX, APISIX Dashboard, SWAG, cert services, Redis/Sentinel, RabbitMQ, Prometheus, Grafana, Patroni/PostgreSQL, MongoDB, and etcd.
- `docker-stack-vault.yml`: Vault Raft cluster only.
### Prod-specific changes summary
- APISIX: 1 → 3 replicas (overlay override)
- Redis: single-instance → Sentinel cluster — 1 master + 2 replicas + 3 sentinels (overlay adds new services)
- RabbitMQ: 1 → 3-node Erlang cluster (overlay override + env)
- Vault: 1 → 3-node Raft cluster (overlay override) — see `07-vault-raft-plan.md`
- No separate APISIX etcd: Patroni etcd is shared (`/apisix` prefix)
- `init/apisix-core/init.sh`: when `PROFILE=prod`, rate limit `policy:local``policy:redis`
The previous `docker-stack-infra.yml` + `docker-stack-infra.prod.yml` overlay strategy is superseded for production. Do not create or deploy `docker-stack-infra.prod.yml` for the current prod environment.
### swag-vl volume — not used in prod, not defined in overlay
## Placement Boundary
Test-env Step 9 adds the `swag-vl` named volume to the base file. In prod, SWAG mounts to the StorageBox via the `${SWAG_CONFIG_DIR}` env var, so this volume is unused by any service. No need to remove it in the overlay — Swarm does not create unused volume definitions, it remains harmless.
`docker-stack-infra_db-prod.yml` is intentionally a mixed stack. The placement model is the important boundary:
No `swag-vl` definition is made in `docker-stack-infra.prod.yml`.
- DB/cluster services run on `iklim-db-*`: Patroni/PostgreSQL, MongoDB, and etcd.
- App/service-node infrastructure runs on `iklim-app-*` with `node.labels.type == service`: Redis, Redis Sentinel, RabbitMQ, APISIX, APISIX Dashboard, SWAG, cert-reloader/cert-distributor, Prometheus, and Grafana.
- Redis and RabbitMQ are not DB-node host-mode services. They stay on the overlay network unless explicitly exposed by the stack or SWAG/APISIX.
### Monitoring Persistence
DB services that require direct cluster traffic publish host-mode ports where the current stack defines them. Redis and RabbitMQ must not be changed to host-mode just because they live in the same stack file.
Prometheus and Grafana run as single instances, but their storage profiles are different:
- **Prometheus:** keep TSDB on a local Docker volume (`prometheus-vl`). Prometheus local storage should not run on StorageBox/DAVFS because of filesystem semantics and WAL/compaction I/O.
- **Grafana:** keep `/var/lib/grafana` on StorageBox (`/mnt/storagebox/grafana/data`) so dashboards, plugins, and the SQLite database are available if the single active instance is manually moved to another node.
## Current Production Services
Grafana uses the `GRAFANA_DATA_DIR` env var with a named-volume fallback for test. Prometheus continues to use the named Docker volume. See Step 9 for implementation details.
| Area | Current model |
| --- | --- |
| APISIX | 3 replicas on service nodes; config stored in etcd with `/apisix` prefix |
| Redis | Sentinel model on service nodes; overlay-only |
| RabbitMQ | 3-node service-node cluster; management exposed through SWAG, restricted by IP |
| Vault | Separate 3-node Raft stack via `docker-stack-vault.yml` |
| PostgreSQL | 3-node Patroni cluster on DB nodes |
| MongoDB | 3-node replica set on DB nodes |
| etcd | 3-node cluster on DB nodes, shared by Patroni and APISIX |
| Prometheus | Single instance; local Docker volume |
| Grafana | Single instance; StorageBox-backed data path |
**Note:** PostgreSQL and MongoDB are not in `docker-stack-infra.yml`. See `08-prod-db-cluster-kurulum.md`.
## Monitoring Persistence
## Step 1 — Apply all test-env changes first
Prometheus TSDB remains on a local Docker volume because StorageBox/DAVFS is not suitable for Prometheus WAL and compaction I/O.
Follow every step in `test-env/03-infra-stack-changes.md`:
- Add `swag` service
- Add `cert-reloader` service
- Remove published ports for vault, apisix, rabbitmq, prometheus, grafana, apisix-dashboard
- Add `swag-vl` volume
Grafana uses `/mnt/storagebox/grafana/data` through `GRAFANA_DATA_DIR` so dashboards, plugins, and the SQLite database survive manual service movement between service nodes.
## Step 2 — Vault: 3-node Raft cluster (prod)
## APISIX and etcd
Vault starts directly with 3 replicas; the Phase 1 single-instance stage is skipped in prod.
See `07-vault-raft-plan.md` Phase 2 for detailed setup steps.
APISIX uses the DB-node etcd cluster through overlay DNS aliases such as `etcd-01`, `etcd-02`, and `etcd-03`. Patroni and APISIX use different etcd prefixes, so their data does not collide.
```yaml
vault:
deploy:
mode: replicated
replicas: 3
placement:
max_replicas_per_node: 1
constraints:
- node.labels.type == service
```
The app subnet to DB subnet firewall rule for etcd client traffic is part of the current production firewall model. See `../../setup/06-prod-terraform-iac.md`.
## Step 3 — APISIX: 3 replicas + init.sh rate limit update (prod overlay)
## Redis and RabbitMQ
Add to `docker-stack-infra.prod.yml`:
Redis/Sentinel and RabbitMQ are service-node infrastructure. Their placement follows `node.labels.type == service`.
```yaml
# docker-stack-infra.prod.yml
services:
apisix:
deploy:
mode: replicated
replicas: 3
placement:
max_replicas_per_node: 1
constraints:
- node.labels.type == service
RabbitMQ-related private firewall rules belong to the app/service-node firewall model. Redis and Sentinel do not publish host-mode ports in the current prod stack and do not require Hetzner firewall openings.
apisix-dashboard:
deploy:
mode: replicated
replicas: 3
placement:
max_replicas_per_node: 1
constraints:
- node.labels.type == service
```
## Historical / Superseded by Setup
APISIX and apisix-dashboard are stateless (config lives in Patroni etcd) — 3 replicas is safe.
Swarm distributes SWAG requests to APISIX replicas via VIP (IPVS round-robin).
The following earlier roadmap ideas are retained only as historical context:
### init.sh — rate limit policy:redis (prod)
- Creating `docker-stack-infra.prod.yml` as a prod overlay.
- Deploying prod with `docker stack deploy -c docker-stack-infra.yml -c docker-stack-infra.prod.yml iklimco`.
- Keeping Vault inside the prod infra overlay with `/opt/iklimco/vault/data` host-path storage.
- Treating PostgreSQL/MongoDB as separate DB stacks such as `docker-stack-db.prod.yml`.
- Validating a prod merge with `docker stack config -c docker-stack-infra.yml -c docker-stack-infra.prod.yml`.
With `policy:local`, each APISIX instance counts independently → the global limit effectively becomes 3× with 3 replicas.
Switch to `policy:redis` for `PROFILE=prod`.
Keep the following APISIX plugin limits in `init/apisix-core/init.sh` for `test/prod` unless stated otherwise:
| Scope | Plugin | Target limit |
|-------|--------|--------------|
| WebSocket `/ws` | `limit-conn` | `conn: 5` per `remote_addr` |
| Auth routes `/v1/auth/*`, `/v1/users/*` | `limit-count` | `count: 12`, `time_window: 60` per `remote_addr` |
| Global rule | `limit-count` | `count: 60`, `time_window: 60` per `remote_addr` |
Update the rate limit and connection limit blocks in `init/apisix-core/init.sh`.
**1. Define threshold constants at the script header:**
```bash
GLOBAL_LIMIT_COUNT=60
GLOBAL_LIMIT_WINDOW=60
AUTH_LIMIT_COUNT=12
AUTH_LIMIT_WINDOW=60
WS_LIMIT_CONN=5
```
**2. Update WebSocket route plugins (test/prod):**
```bash
if [[ "$PROFILE" != "dev" ]]; then
WS_PLUGINS=',"plugins":{"limit-conn":{"conn":'"$WS_LIMIT_CONN"',"burst":2,"default_conn_delay":0.1,"key":"remote_addr","key_type":"var","rejected_code":429}}'
else
WS_PLUGINS=""
fi
```
**3. Update Auth route plugins (test/prod):**
```bash
if [[ "$PROFILE" != "dev" ]]; then
AUTH_LIMIT=',"plugins":{"limit-count":{"count":'"$AUTH_LIMIT_COUNT"',"time_window":'"$AUTH_LIMIT_WINDOW"',"key_type":"var","key":"remote_addr","rejected_code":429,"policy":"local"}}'
else
AUTH_LIMIT=""
fi
```
**4. Update Global rate limit rule (test/prod):**
```bash
if [[ "$PROFILE" != "dev" ]]; then
if [[ "$PROFILE" == "prod" ]]; then
RATE_POLICY="redis"
RATE_REDIS=',"redis_host":"redis","redis_port":6379,"redis_password":"'"$REDIS_PASSWORD"'"'
else
RATE_POLICY="local"
RATE_REDIS=""
fi
call_api "global rate limit" -X PUT "$APISIX_ADMIN_URL/global_rules/1" \
-H "X-API-KEY: $API_KEY" -H "Content-Type: application/json" \
-d '{"plugins":{"limit-count":{"count":'"$GLOBAL_LIMIT_COUNT"',"time_window":'"$GLOBAL_LIMIT_WINDOW"',"key_type":"var","key":"remote_addr","rejected_code":429,"policy":"'"$RATE_POLICY"'","allow_degradation":true'"$RATE_REDIS"'}}}'
fi
```
> APISIX's `limit-count` plugin does not natively support Redis Sentinel; `policy:redis` works with a single endpoint.
> The `redis` service name stays constant within Swarm overlay DNS. `allow_degradation: true` ensures that if Redis is
> temporarily unreachable (e.g. Sentinel failover ~10-30 s, or master rescheduling), APISIX passes requests through
> instead of returning errors — rate limiting is briefly suspended but API access is unaffected.
> Microservices use Spring Data Redis Sentinel natively and are unaffected by master changes.
> Docker Swarm has no inter-service anti-affinity; the `redis` master placement relies on Swarm's spread strategy
> to avoid co-locating with a replica. This is a known limitation — accepted in favour of operational simplicity.
## Step 4 — etcd: Separate APISIX etcd removed — Patroni etcd shared
The standalone `etcd` service in `docker-stack-infra.yml` is **not used in prod and must be disabled** by setting `replicas: 0` in the prod overlay.
APISIX uses the 3-node Patroni etcd cluster running on DB nodes, via the `/apisix` prefix.
### Why consolidated?
- A standalone single-instance etcd was a SPOF for APISIX.
- Patroni etcd is already 3-node HA — APISIX gets a more reliable config store.
- etcd supports prefix-based namespacing; Patroni uses `/service/`, APISIX uses `/apisix/` — no collision.
### APISIX etcd connection configuration
Update the etcd endpoints in the APISIX service in `docker-stack-infra.yml` to point to DB nodes:
```yaml
apisix:
environment:
APISIX_STAND_ALONE: "false"
# via apisix/conf/config.yaml or environment:
# etcd:
# host:
# - "http://etcd-01:2379"
# - "http://etcd-02:2379"
# - "http://etcd-03:2379"
# prefix: "/apisix"
```
The preferred method is mounting `config.yaml` via a Docker config or volume. etcd endpoints use **overlay DNS aliases** defined in `docker-stack-db.prod.yml``etcd-01`, `etcd-02`, `etcd-03` — which are reachable from app nodes via the `iklimco-net` overlay:
```yaml
# config/apisix/config.yaml
etcd:
host:
- "http://etcd-01:2379"
- "http://etcd-02:2379"
- "http://etcd-03:2379"
prefix: "/apisix"
timeout: 30
```
### Disable standalone etcd in prod overlay
Docker Swarm overlay files cannot delete services from the base stack, but `replicas: 0` stops the container entirely:
```yaml
# docker-stack-infra.prod.yml
services:
etcd:
deploy:
replicas: 0
```
### Firewall requirement
etcd access from app nodes to DB nodes must be open (port 2379, app subnet → DB subnet). Verify from an app node:
```bash
docker run --rm --network iklimco-net alpine \
sh -c "wget -qO- http://etcd-01:2379/health"
```
## Step 5 — Redis: Sentinel cluster (prod overlay)
Redis runs as a single instance in test. In prod, Sentinel provides HA.
![[redis-sentinel-vs-cluster.png]]
Bitnami images are used — all configuration is done via env vars, no separate `.conf` file needed.
### Prerequisites
```bash
# Create Docker secret for Redis password:
openssl rand -hex 32 | docker secret create redis_password -
```
### Topology
```
any app node: redis (1 replica, spread by Swarm — not pinned)
2 app nodes: redis-replica (2 replicas, max 1/node, spread across app nodes)
all app nodes: redis-sentinel (3 replicas, max 1/node, spread across all app nodes)
```
### docker-stack-infra.prod.yml — Redis services
The existing `redis` service is overridden in the prod overlay as **master**; `redis-replica` and `redis-sentinel` are added as new services. The service name (`redis`) remains unchanged so the APISIX connection config does not need updating.
```yaml
# docker-stack-infra.prod.yml
services:
redis: # override base single-instance redis → master
image: bitnamisecure/redis:latest
environment:
ALLOW_EMPTY_PASSWORD: no
REDIS_PASSWORD: ${REDIS_PASSWORD}
REDIS_REPLICATION_MODE: master
deploy:
mode: replicated
replicas: 1
placement:
constraints:
- node.labels.type == service
restart_policy:
condition: any
delay: 5s
labels:
project: co.iklim
redis-replica:
image: bitnamisecure/redis:latest
environment:
ALLOW_EMPTY_PASSWORD: no
REDIS_REPLICATION_MODE: slave
REDIS_MASTER_HOST: redis
REDIS_MASTER_PORT_NUMBER: "6379"
REDIS_MASTER_PASSWORD: ${REDIS_PASSWORD}
REDIS_PASSWORD: ${REDIS_PASSWORD}
deploy:
mode: replicated
replicas: 2
placement:
max_replicas_per_node: 1
constraints:
- node.labels.type == service
preferences:
- spread: node.hostname
restart_policy:
condition: any
delay: 5s
labels:
project: co.iklim
redis-sentinel:
image: bitnamisecure/redis-sentinel:latest
environment:
REDIS_SENTINEL_MASTER_NAME: prod-master
REDIS_MASTER_HOST: redis
REDIS_MASTER_PORT_NUMBER: "6379"
REDIS_MASTER_PASSWORD: ${REDIS_PASSWORD}
REDIS_SENTINEL_QUORUM: "2"
REDIS_SENTINEL_DOWN_AFTER_MILLISECONDS: "5000"
REDIS_SENTINEL_FAILOVER_TIMEOUT: "10000"
deploy:
mode: replicated
replicas: 3
placement:
max_replicas_per_node: 1
constraints:
- node.labels.type == service
preferences:
- spread: node.hostname
restart_policy:
condition: any
delay: 5s
labels:
project: co.iklim
```
### Microservice connection (Spring Data Redis)
Microservices must use a Sentinel-aware connection:
```yaml
# application-prod.yml
spring:
data:
redis:
sentinel:
master: prod-master
nodes:
- redis-sentinel:26379
password: ${REDIS_PASSWORD}
```
### Verification
```bash
# Query master identity:
docker exec $(docker ps -q -f name=iklimco_redis-sentinel | head -1) \
redis-cli -p 26379 SENTINEL get-master-addr-by-name prod-master
```
## Step 6 — RabbitMQ: 3-node Erlang cluster (prod overlay)
RabbitMQ runs as a 3-node cluster with one instance per app node.
### Prerequisites
```bash
# Create Docker secret for Erlang cookie (must be identical on all nodes):
openssl rand -hex 32 | docker secret create rabbitmq_erlang_cookie -
```
### docker-stack-infra.prod.yml — RabbitMQ override
```yaml
# docker-stack-infra.prod.yml (add alongside redis services)
services:
rabbitmq:
image: rabbitmq:3-management
hostname: "rabbitmq-{{.Node.Hostname}}"
environment:
RABBITMQ_ERLANG_COOKIE_FILE: /run/secrets/rabbitmq_erlang_cookie
RABBITMQ_USE_LONGNAME: "true"
RABBITMQ_NODENAME: "rabbit@rabbitmq-{{.Node.Hostname}}"
secrets:
- rabbitmq_erlang_cookie
networks:
iklimco-net:
aliases:
- "rabbitmq-{{.Node.Hostname}}"
deploy:
mode: replicated
replicas: 3
placement:
max_replicas_per_node: 1
constraints:
- node.labels.type == service
update_config:
parallelism: 1
order: stop-first
labels:
project: co.iklim
secrets:
rabbitmq_erlang_cookie:
external: true
networks:
iklimco-net:
external: true
```
### Cluster join procedure (first setup)
RabbitMQ nodes do not form a cluster automatically; manual join is required after first start:
```bash
# Find the RabbitMQ container on iklim-app-02:
CTR=$(docker ps -q -f name=iklimco_rabbitmq)
# Stop, join, start:
docker exec "$CTR" rabbitmqctl stop_app
docker exec "$CTR" rabbitmqctl join_cluster rabbit@rabbitmq-iklim-app-01
docker exec "$CTR" rabbitmqctl start_app
# Repeat for iklim-app-03
```
```bash
# Verify cluster status (from any node):
docker exec "$CTR" rabbitmqctl cluster_status
```
> **HA policy:** After the cluster is formed, set quorum queues as the default:
> ```bash
> docker exec "$CTR" rabbitmqctl set_policy ha-all ".*" \
> '{"queue-type":"quorum"}' --apply-to queues
> ```
## Step 7 — RabbitMQ WebSocket Sticky Sessions (Consistent Hash)
RabbitMQ Web STOMP (over WebSocket) requires a persistent connection. In a 3-node RabbitMQ cluster, if an APISIX instance uses the default Swarm VIP for the `rabbitmq` upstream, it may cause unnecessary inter-node traffic or connection drops if the session doesn't persist on the same node.
To optimize this, we implement **Consistent Hashing (chash)** at the APISIX layer based on the client's IP address (`remote_addr`).
### 1. Update APISIX Upstream Configuration (init.sh)
Update the `rabbitmq` upstream definition in `init/apisix-core/init.sh` to target specific cluster nodes instead of the generic service name, enabling the `chash` algorithm for prod.
```bash
# Update upstream rabbitmq block in init.sh
if [[ "$PROFILE" == "prod" ]]; then
# Direct node DNS names to bypass Swarm VIP and allow chash to work effectively
RABBITMQ_NODES='{"rabbitmq-iklim-app-01:15674":1, "rabbitmq-iklim-app-02:15674":1, "rabbitmq-iklim-app-03:15674":1}'
LB_TYPE="chash"
HASH_KEY="remote_addr"
else
RABBITMQ_NODES='{"rabbitmq:15674":1}'
LB_TYPE="roundrobin"
HASH_KEY=""
fi
call_api "upstream rabbitmq" -X PUT "$APISIX_ADMIN_URL/upstreams/rabbitmq-upstream" \
-H "X-API-KEY: $API_KEY" -H "Content-Type: application/json" \
-d '{
"name": "rabbitmq-upstream",
"type": "'"$LB_TYPE"'",
"key": "'"$HASH_KEY"'",
"nodes": '"$RABBITMQ_NODES"',
"timeout": {"connect": 10, "send": 3600, "read": 3600},
"scheme": "http",
'"$HC"'
}'
```
### 2. Enable Real IP Detection in APISIX
Consistent hashing by `remote_addr` requires APISIX to see the actual client IP, not the internal IP of the SWAG (Nginx) proxy.
> **DNS Note:** For `chash` to work with node-specific names, the RabbitMQ service must have network aliases configured for each node (e.g., `rabbitmq-{{.Node.Hostname}}`) as shown in Step 6.
In the `config.yaml` inside the custom APISIX image (`custom-apisix:3.12.0`):
```yaml
nginx_config:
http:
real_ip_header: "X-Real-IP"
set_real_ip_from: "10.0.0.0/8"
```
## Step 8 — Create `docker-stack-infra.prod.yml`
Create this file in the repo root alongside `docker-stack-infra.yml`. It combines all prod-specific overrides from Steps 26 (including disabling the standalone `etcd` from Step 4):
```yaml
# docker-stack-infra.prod.yml
# Prod overlay — deploy with:
# docker stack deploy -c docker-stack-infra.yml -c docker-stack-infra.prod.yml iklimco
services:
vault:
environment:
VAULT_LOCAL_CONFIG: >-
{"api_addr":"https://vault.iklim.co:8200",
"cluster_addr":"https://{{ .Node.Hostname }}:8201",
"storage":{"raft":{"path":"/vault/file","node_id":"{{ .Node.Hostname }}"}},
"listener":[{"tcp":{"address":"0.0.0.0:8200",
"tls_cert_file":"/vault/certs/STAR.iklim.co.full.crt",
"tls_key_file":"/vault/certs/STAR.iklim.co_key.pem"}}],
"default_lease_ttl":"168h","max_lease_ttl":"720h","ui":true}
volumes:
- /opt/iklimco/vault/data:/vault/file
- ${SWAG_CERT_DIR}:/vault/certs:ro
deploy:
mode: replicated
replicas: 3
placement:
max_replicas_per_node: 1
constraints:
- node.labels.type == service
apisix:
deploy:
mode: replicated
replicas: 3
placement:
max_replicas_per_node: 1
constraints:
- node.labels.type == service
apisix-dashboard:
deploy:
mode: replicated
replicas: 3
placement:
max_replicas_per_node: 1
constraints:
- node.labels.type == service
redis:
image: bitnamisecure/redis:latest
environment:
ALLOW_EMPTY_PASSWORD: no
REDIS_PASSWORD: ${REDIS_PASSWORD}
REDIS_REPLICATION_MODE: master
deploy:
mode: replicated
replicas: 1
placement:
constraints:
- node.labels.type == service
restart_policy:
condition: any
delay: 5s
labels:
project: co.iklim
redis-replica:
image: bitnamisecure/redis:latest
environment:
ALLOW_EMPTY_PASSWORD: no
REDIS_REPLICATION_MODE: slave
REDIS_MASTER_HOST: redis
REDIS_MASTER_PORT_NUMBER: "6379"
REDIS_MASTER_PASSWORD: ${REDIS_PASSWORD}
REDIS_PASSWORD: ${REDIS_PASSWORD}
deploy:
mode: replicated
replicas: 2
placement:
max_replicas_per_node: 1
constraints:
- node.labels.type == service
preferences:
- spread: node.hostname
restart_policy:
condition: any
delay: 5s
labels:
project: co.iklim
redis-sentinel:
image: bitnamisecure/redis-sentinel:latest
environment:
REDIS_SENTINEL_MASTER_NAME: prod-master
REDIS_MASTER_HOST: redis
REDIS_MASTER_PORT_NUMBER: "6379"
REDIS_MASTER_PASSWORD: ${REDIS_PASSWORD}
REDIS_SENTINEL_QUORUM: "2"
REDIS_SENTINEL_DOWN_AFTER_MILLISECONDS: "5000"
REDIS_SENTINEL_FAILOVER_TIMEOUT: "10000"
deploy:
mode: replicated
replicas: 3
placement:
max_replicas_per_node: 1
constraints:
- node.labels.type == service
preferences:
- spread: node.hostname
restart_policy:
condition: any
delay: 5s
labels:
project: co.iklim
rabbitmq:
image: rabbitmq:3-management
hostname: "rabbitmq-{{.Node.Hostname}}"
environment:
RABBITMQ_ERLANG_COOKIE_FILE: /run/secrets/rabbitmq_erlang_cookie
RABBITMQ_USE_LONGNAME: "true"
RABBITMQ_NODENAME: "rabbit@rabbitmq-{{.Node.Hostname}}"
secrets:
- rabbitmq_erlang_cookie
networks:
iklimco-net:
aliases:
- "rabbitmq-{{.Node.Hostname}}"
deploy:
mode: replicated
replicas: 3
placement:
max_replicas_per_node: 1
constraints:
- node.labels.type == service
update_config:
parallelism: 1
order: stop-first
labels:
project: co.iklim
secrets:
rabbitmq_erlang_cookie:
external: true
networks:
iklimco-net:
external: true
```
## Step 9 — Monitoring Data Persistence
Prometheus and Grafana run as single instances. Grafana data is placed on the StorageBox shared filesystem for manual failover. Prometheus TSDB stays on a local Docker volume because DAVFS/StorageBox is not suitable for Prometheus WAL and compaction I/O.
**Changes already applied to `docker-stack-infra.yml`:**
```yaml
prometheus:
volumes:
- prometheus-vl:/prometheus
grafana:
volumes:
- ${GRAFANA_DATA_DIR:-grafana-vl}:/var/lib/grafana
```
Test uses the named Docker volume fallback (`grafana-vl`) for Grafana, and Prometheus always uses the named Docker volume (`prometheus-vl`) — no test env change needed.
**Add to `prod/secrets/iklim.co/.env.prod` on storagebox** (already in `env-prod/.env`):
```bash
GRAFANA_DATA_DIR=/mnt/storagebox/grafana/data
```
> `/mnt/storagebox/grafana/data` is created automatically by the Ansible `storagebox` role during bootstrap via the `storagebox_managed_directories` variable. No manual step required.
> Grafana writes its SQLite database and dashboard JSON to `/var/lib/grafana`.
> Prometheus writes its TSDB to `/prometheus` on the local `prometheus-vl` Docker volume; it is not shared between nodes.
## Step 10 — Verify
```bash
# Base file must be valid on its own (test deploy):
docker stack config -c docker-stack-infra.yml > /dev/null && echo "base OK"
# Prod merge must be valid:
docker stack config -c docker-stack-infra.yml -c docker-stack-infra.prod.yml > /dev/null && echo "prod merge OK"
```
## Step 11 — Database Proxies and Developer Access
In the production environment, the `pg-proxy` and `mongo-proxy` services (socat-based) defined in the base `docker-stack-infra.yml` are **deprecated and will not be used**.
### Rationale
- **Leader Tracking:** Simple L4 proxies (socat) cannot track the Patroni Leader or MongoDB Primary. They point to a single service VIP, which might lead to a Read-Only replica during failover.
- **HA Connection Strings:** Modern DB drivers (JDBC, libpq, MongoClient) support multi-host connection strings, which provide native failover and load balancing without an intermediate proxy.
### Developer Access Strategy
- **Direct Subnet Access:** Developers connect via WireGuard directly to the DB subnet (`10.20.20.0/24`).
- **No Translation:** Instead of mapping ports like `15432`, the standard ports (`5432`, `27017`) are used across all cluster nodes.
## Placement and Replica Summary — prod
| Service | File | Replicas | Placement | HA Note |
| ---------------- | ------------ | -------- | ------------------------------------------- | ------------------------------------------------------------------------------------- |
| swag | base | 1 | `node.hostname == iklim-app-01` | No clustering support; Floating IP pinned to node |
| cert-reloader | base | 1 | `node.hostname == iklim-app-01` | Cron-style task; duplicate would be problematic |
| vault | prod overlay | 3 | `node.labels.type == service`; max 1/node | Raft cluster — see `07-vault-raft-plan.md` |
| apisix | prod overlay | 3 | `node.labels.type == service`; max 1/node | Stateless; config in Patroni etcd; rate limit policy:redis |
| apisix-dashboard | prod overlay | 3 | `node.labels.type == service`; max 1/node | Stateless; reads from etcd |
| redis (master) | prod overlay | 1 | `node.labels.type == service`; Swarm spread | Sentinel cluster master; not pinned — reschedules on node failure |
| redis-replica | prod overlay | 2 | `node.labels.type == service`; max 1/node | Sentinel replica; spread:hostname |
| redis-sentinel | prod overlay | 3 | `node.labels.type == service`; max 1/node | Quorum=2; failover automatic |
| rabbitmq | prod overlay | 3 | `node.labels.type == service`; max 1/node | Erlang cluster; quorum queues |
| prometheus | base | 1 | `node.labels.type == service` | No native HA; Thanos is overkill at this scale |
| grafana | base | 1 | `node.labels.type == service` | Not critical |
> PostgreSQL and MongoDB run in separate DB stacks on `iklimco-*` nodes. See `08-prod-db-cluster-kurulum.md`.
> etcd: 3-node cluster on DB nodes — APISIX shares it via `/apisix` prefix.
For current execution, use the setup runbooks and root stack files listed in the Context section.

View File

@ -1,121 +1,83 @@
# 07 — Vault: 3-Node Raft Cluster (Prod)
# 07 — Vault Raft Stack and Bootstrap Automation (Prod)
## Context
Vault starts directly as a 3-node Raft cluster in prod. The single-instance phase used in test is skipped.
Test used a single Vault instance (file storage, 1 replica on the manager node). Prod goes straight to Raft HA.
Production Vault is a 3-node Raft cluster, but it is no longer initialized through a manual post-deploy runbook.
## Vault service configuration
Current references:
- **Replicas:** 3 (one per service node)
- **Storage:** Raft integrated storage
- **Placement:** `node.labels.type == service` (all 3 app nodes)
- **Cert distribution:** No SSH needed — all nodes mount StorageBox, cert-reloader writes to `SWAG_CERT_DIR=/mnt/storagebox/ssl`, Vault reads from that path on every node
- Setup source: `../../setup/09-prod-runner-ha-and-swarm.md`
- Stack file: root `docker-stack-vault.yml`
- Bootstrap script: root `init/vault/vault-bootstrap.sh`
- Template: root `init/vault/vault-template-v2.json`
### Prerequisites
## Current Model
- [ ] All 3 service nodes are running and labeled `type=service`
- [ ] `/mnt/storagebox/ssl/` directory is mounted and accessible on all 3 app nodes
- [ ] Vault data directory `/opt/iklimco/vault/data/` exists on all 3 nodes (host path volumes)
Vault is deployed separately from `docker-stack-infra_db-prod.yml`.
### Vault service YAML (docker-stack-infra.prod.yml overlay)
The Vault stack uses:
```yaml
vault:
# ... (image, secrets, healthcheck unchanged from base)
environment:
VAULT_LOCAL_CONFIG: >-
{"api_addr":"https://vault.iklim.co:8200",
"cluster_addr":"https://{{ .Node.Hostname }}:8201",
"storage":{"raft":{"path":"/vault/file","node_id":"{{ .Node.Hostname }}"}},
"listener":[{"tcp":{"address":"0.0.0.0:8200",
"tls_cert_file":"/vault/certs/STAR.iklim.co.full.crt",
"tls_key_file":"/vault/certs/STAR.iklim.co_key.pem"}}],
"default_lease_ttl":"168h","max_lease_ttl":"720h","ui":true}
volumes:
- /opt/iklimco/vault/data:/vault/file # host path per node
- ${SWAG_CERT_DIR}:/vault/certs:ro # StorageBox — shared across all nodes, no SSH distribution needed
deploy:
mode: replicated
replicas: 3
placement:
max_replicas_per_node: 1
constraints:
- node.labels.type == service
- 3 replicas, one per service node when placement allows it.
- Docker volumes such as `vault-data-vl` and `vault-logs-vl`.
- `/opt/iklimco/ssl:/vault/certs:ro` for TLS certificates.
- `iklimco-net` as an external overlay network.
- `vault_unseal_key` as a Docker secret.
The production workflow calls `init-infra-prod.sh`, which calls `init/vault/vault-bootstrap.sh`. The bootstrap script handles stack deploy, initialization, unseal key secret rotation, peer join, and peer unseal.
## Certificate Flow
Vault does not read TLS certificates directly from `/mnt/storagebox/ssl`.
The current flow is:
```text
SWAG renews certificate
cert-reloader copies renewed files to /mnt/storagebox/ssl
cert-distributor syncs certificate files to /opt/iklimco/ssl on service nodes
Vault reads /opt/iklimco/ssl through the /vault/certs mount
```
> `{{ .Node.Hostname }}` is Docker Swarm's Go template for the node hostname —
> gives each Vault instance a unique `node_id`.
## Bootstrap Flow
## Raft initialization procedure (first deploy)
Normal production bootstrap is automated:
### Step 1 — Deploy the stack
1. Create or refresh the placeholder `vault_unseal_key` secret when needed.
2. Deploy `docker-stack-vault.yml`.
3. Initialize Vault with one key share and one threshold if it is not initialized.
4. Replace the placeholder `vault_unseal_key` secret with the real unseal key.
5. Unseal the leader.
6. Join peers to the Raft cluster.
7. Unseal peers.
8. Verify Raft peers and service health.
These operations belong to `vault-bootstrap.sh`, not to a manual operator checklist.
## Verification
Use the current setup verification flow:
```bash
docker stack deploy -c docker-stack-infra.yml -c docker-stack-infra.prod.yml iklimco
docker service ps iklimco_vault
docker exec $(docker ps -q -f name=iklimco_vault | head -1) vault status
docker exec $(docker ps -q -f name=iklimco_vault | head -1) vault operator raft list-peers
```
All 3 Vault containers start. Only the first one to initialize becomes the leader.
Expected state:
### Step 2 — Initialize Vault on the leader (iklim-app-01)
- Vault service has 3 running tasks.
- `vault status` reports `Sealed false`.
- Raft list shows one leader and two followers.
```bash
VAULT_CTR=$(docker ps -q -f name=iklimco_vault)
docker exec -it "$VAULT_CTR" vault operator init
```
## Historical / Superseded by Setup
Save the unseal keys and root token securely. Store the unseal key as a Docker secret:
The previous manual procedure is superseded:
```bash
echo -n "<unseal-key>" | docker secret create vault_unseal_key -
```
- Deploying Vault through `docker-stack-infra.yml` + `docker-stack-infra.prod.yml`.
- Creating `/opt/iklimco/vault/data` host-path directories on each app node.
- Running `vault operator init` manually.
- Manually copying/storing unseal keys.
- Manually running `vault operator raft join` on peers.
- Manually unsealing each peer after join.
### Step 3 — Unseal the leader
```bash
docker exec -it "$VAULT_CTR" vault operator unseal
```
The healthcheck auto-unseals on subsequent restarts via the `vault_unseal_key` secret.
### Step 4 — Join remaining nodes to the Raft cluster
On iklim-app-02 and iklim-app-03 containers:
```bash
docker exec -it <vault-on-iklim-app-02> vault operator raft join \
https://vault.iklim.co:8200
docker exec -it <vault-on-iklim-app-03> vault operator raft join \
https://vault.iklim.co:8200
```
Unseal each node after joining:
```bash
docker exec -it <vault-on-iklim-app-02> vault operator unseal
docker exec -it <vault-on-iklim-app-03> vault operator unseal
```
### Step 5 — Verify cluster
```bash
docker exec "$VAULT_CTR" vault operator raft list-peers
```
Expected: 3 peers, one `leader`, two `follower`.
## cert-reloader — no additional changes needed for Raft
cert-reloader writes the cert to `SWAG_CERT_DIR=/mnt/storagebox/ssl`.
Since StorageBox is mounted on all app nodes, every Vault instance already sees the same path.
The cert renewal flow works unchanged with Raft:
```
cert changed → copy to /mnt/storagebox/ssl/ → docker service update --force iklimco_vault
Vault (3 replicas) restart → each auto-unseals via healthcheck
```
## Reference
- Vault Raft storage docs: https://developer.hashicorp.com/vault/docs/configuration/storage/raft
- Vault Swarm setup: https://manjit28.medium.com/setting-up-a-secure-and-highly-available-hashicorp-vault-cluster-for-secrets-and-certificates-0ce01a370582
Keep those notes only as historical context. For current prod, use `docker-stack-vault.yml` and `init/vault/vault-bootstrap.sh`.

View File

@ -1,24 +1,23 @@
# Setup Aşamaları — Roadmap Eşleştirme Tablosu
Bu tablo, `roadmap/test-env` ve `roadmap/prod-env` klasörlerindeki yol haritası adımlarının
Terraform/Ansible setup aşamalarından hangisinde ele alındığını gösterir.
Bu tablo, `roadmap/test-env` ve `roadmap/prod-env` klasörlerindeki yol haritası adımlarının Terraform/Ansible setup aşamalarından hangisinde ele alındığını gösterir.
## TEST ortamı
| Roadmap adımı | Hangi aşamada ele alınmalı |
| --- | --- |
| Hetzner firewall (sadece 22/80/443) | **Terraform `02-test-terraform-iaac.md`** — `firewall.tf` |
| Sunucu oluşturma (`iklim-app-01`, `iklim-db-01`) | **Terraform `02-test-terraform-iaac.md`** — `servers.tf` |
| Private network + placement group (`iklim-test-spread`) | **Terraform `02-test-terraform-iaac.md`** — `network.tf`, `placement.tf` |
| Floating IP (`iklim-test-app-fip`) | **Terraform `02-test-terraform-iaac.md`** — `floating_ip.tf` |
| Hetzner firewall (sadece 22/80/443) | **Terraform `02-test-terraform-iac.md`** — `firewall.tf` |
| Sunucu oluşturma (`iklim-app-01`, `iklim-db-01`) | **Terraform `02-test-terraform-iac.md`** — `servers.tf` |
| Private network + placement group (`iklim-test-spread`) | **Terraform `02-test-terraform-iac.md`** — `network.tf`, `placement.tf` |
| Floating IP (`iklim-test-app-fip`) | **Terraform `02-test-terraform-iac.md`** — `floating_ip.tf` |
| Docker Engine kurulumu (app + db node) | **Ansible `03-test-ansible-bootstrap.md`**`docker` role |
| Security hardening (SSH, firewalld, fail2ban) | **Ansible `03-test-ansible-bootstrap.md`**`hardening` role |
| Docker Swarm init + `iklim-db-01` worker join | **Ansible `03-test-ansible-bootstrap.md`**`swarm` role |
| `type=service` ve `role=db` node label'ları | **Ansible `03-test-ansible-bootstrap.md`**`swarm` role |
| `/opt/iklimco/...` dizinleri | **Ansible `03-test-ansible-bootstrap.md`**`node_dirs` role |
| StorageBox DAVFS mount (`u469968-sub4`) | **Ansible `03-test-ansible-bootstrap.md`**`storagebox` role |
| DB stack deploy (PostgreSQL + MongoDB on `iklim-db-01`) | **Manuel `04-test-db-docker-kurulum.md`** |
| `act_runner` systemd kurulumu | **Ansible `05-test-runner-ve-deploy-onkosullari.md`** — `act_runner` role (`test-app-post-stack.yml`) |
| DB stack deploy (PostgreSQL + MongoDB on `iklim-db-01`) | **Manuel `04-test-db-docker-setup.md`** |
| `act_runner` systemd kurulumu | **Ansible `05-test-runner-and-deploy-prerequisites.md`** — `act_runner` role (`test-app-post-stack.yml`) |
| GoDaddy credentials storagebox'a yükleme | **Manuel kalır** — secret yönetimi, Terraform/Ansible dışı |
| `docker-stack-infra.yml` port kaldırma + SWAG/cert-reloader ekleme | **Pipeline `deploy-test.yml`** + **repo değişikliği**`roadmap/test-env/03` |
| SWAG nginx proxy conf'ları (`template/swag/site-confs/*.conf.tpl`) | **Repo içinde teslim edildi**`roadmap/test-env/04` |
@ -31,22 +30,22 @@ Terraform/Ansible setup aşamalarından hangisinde ele alındığını gösterir
| Roadmap adımı | Hangi aşamada ele alınmalı |
| --- | --- |
| 6 sunucu oluşturma (`iklim-app-01/02/03`, `iklim-db-01/02/03`) | **Terraform `06-prod-terraform-iaac.md`** — `servers.tf` |
| Private network + 2 placement group | **Terraform `06-prod-terraform-iaac.md`** — `network.tf`, `placement.tf` |
| Firewall (sadece 22/80/443 public; private port matrisi) | **Terraform `06-prod-terraform-iaac.md`** — `firewall.tf` |
| Floating IP (`iklim-prod-app-fip`, `iklim-app-01`'e atanır) | **Terraform `06-prod-terraform-iaac.md`** — `floating_ip.tf` |
| 6 sunucu oluşturma (`iklim-app-01/02/03`, `iklim-db-01/02/03`) | **Terraform `06-prod-terraform-iac.md`** — `servers.tf` |
| Private network + 2 placement group | **Terraform `06-prod-terraform-iac.md`** — `network.tf`, `placement.tf` |
| Firewall (sadece 22/80/443 public; private port matrisi) | **Terraform `06-prod-terraform-iac.md`** — `firewall.tf` |
| Floating IP (`iklim-prod-app-fip`, `iklim-app-01`'e atanır) | **Terraform `06-prod-terraform-iac.md`** — `floating_ip.tf` |
| Docker Engine kurulumu (tüm node'lar — app ve db) | **Ansible `07-prod-ansible-bootstrap.md`**`docker` role |
| Security hardening (tüm node'lar) | **Ansible `07-prod-ansible-bootstrap.md`**`hardening` role |
| Swarm init (`iklim-app-01`) + manager join (`iklim-app-02/03`) | **Ansible `07-prod-ansible-bootstrap.md`**`swarm` role |
| `type=service` node label (3 app node) | **Ansible `07-prod-ansible-bootstrap.md`**`swarm` role |
| `/opt/iklimco/...` dizinleri + `/opt/iklimco/stacks` | **Ansible `07-prod-ansible-bootstrap.md`**`node_dirs` role |
| StorageBox DAVFS mount (`u469968-sub5`) | **Ansible `07-prod-ansible-bootstrap.md`**`storagebox` role |
| DB node'larını Swarm'a worker olarak join et | **Manuel `08-prod-db-cluster-kurulum.md`** — Bölüm 2 |
| `role=db` node label (3 db node) | **Manuel `08-prod-db-cluster-kurulum.md`** — Bölüm 2 |
| etcd cluster deploy (Patroni için) | **Manuel `08-prod-db-cluster-kurulum.md`** — Bölüm 5.2 |
| MongoDB replica set deploy | **Manuel `08-prod-db-cluster-kurulum.md`** — Bölüm 4 |
| Patroni + PostgreSQL HA deploy | **Manuel `08-prod-db-cluster-kurulum.md`** — Bölüm 5.4 |
| 3× `act_runner` systemd (HA runner) | **Ansible `09-prod-runner-ha-ve-swarm.md`** — `act_runner` role |
| DB node'larını Swarm'a worker olarak join et | **Manuel `08-prod-db-cluster-setup.md`** — Bölüm 2 |
| `role=db` node label (3 db node) | **Manuel `08-prod-db-cluster-setup.md`** — Bölüm 2 |
| etcd cluster deploy (Patroni için) | **Manuel `08-prod-db-cluster-setup.md`** — Bölüm 5.2 |
| MongoDB replica set deploy | **Manuel `08-prod-db-cluster-setup.md`** — Bölüm 4 |
| Patroni + PostgreSQL HA deploy | **Manuel `08-prod-db-cluster-setup.md`** — Bölüm 5.4 |
| 3× `act_runner` systemd (HA runner) | **Ansible `09-prod-runner-ha-and-swarm.md`** — `act_runner` role |
| GoDaddy credentials storagebox'a yükleme | **Manuel kalır** — secret yönetimi, Terraform/Ansible dışı |
| `docker-stack-infra.yml` port kaldırma + SWAG/cert-reloader ekleme | **Repo değişikliği**`roadmap/prod-env/03` |
| SWAG nginx proxy conf'ları (`template/swag/site-confs/*.conf.tpl`) | **Repo içinde teslim edildi**`roadmap/prod-env/04` |
@ -61,16 +60,16 @@ Terraform/Ansible setup aşamalarından hangisinde ele alındığını gösterir
```
Environment_Infrastructure/
setup/ ← Terraform + Ansible aşama dokümanları
00-genel-yol-haritasi.md
01-private-network-port-matrisi.md
02-test-terraform-iaac.md
00-general-roadmap.md
01-private-network-port-matrix.md
02-test-terraform-iac.md
03-test-ansible-bootstrap.md
04-test-db-docker-kurulum.md
05-test-runner-ve-deploy-onkosullari.md
06-prod-terraform-iaac.md
04-test-db-docker-setup.md
05-test-runner-and-deploy-prerequisites.md
06-prod-terraform-iac.md
07-prod-ansible-bootstrap.md
08-prod-db-cluster-kurulum.md
09-prod-runner-ha-ve-swarm.md
08-prod-db-cluster-setup.md
09-prod-runner-ha-and-swarm.md
roadmap/
test-env/ ← Test ortamı Roadmap adımları
prod-env/ ← Prod Roadmap adımları

View File

@ -43,9 +43,9 @@ Minimum topology for the test environment:
| Node | Role | Note |
| --- | --- | --- |
| `iklim-app-01` | Swarm manager + app worker + Gitea runner | CI/CD test deploy runs through this node |
| `iklim-db-01` | DB node | DB infrastructure will be installed manually; it will not be installed by Gitea CI/CD |
| `iklim-db-01` | DB node / Swarm worker | DB host prerequisites are prepared by Ansible; DB services are deployed as Swarm services by the environment stack/pipeline |
The test DB setup is brought only up to machine and OS preparation with Terraform/Ansible. PostgreSQL/MongoDB cluster installation is outside this phase.
The test DB setup is brought up to OS, Docker, Swarm worker, config directory, and WireGuard preparation with Terraform/Ansible. PostgreSQL/MongoDB runtime services are not installed directly on the OS; they run as Docker Swarm services.
### Prod
@ -56,23 +56,25 @@ HA topology for the prod environment:
| `iklim-app-*` | 3 | Each one is a Swarm manager + app worker |
| `iklim-db-*` | 3 | DB cluster nodes |
Prod DB infrastructure will be installed manually; it will not be installed by Gitea CI/CD. Terraform prepares the DB machines and network/firewall rules; Ansible installs OS hardening and base dependencies.
Prod DB host prerequisites are prepared by Terraform/Ansible. Runtime DB services are part of the current prod Swarm stack: etcd, Patroni/PostgreSQL, and MongoDB replica set are deployed by the prod root pipeline through `docker-stack-infra_db-prod.yml`.
## Public Port Policy
Ports open to the public internet are only:
Ports open to the public internet are normally only:
- `22/tcp` SSH, only from admin IP/CIDR sources
- `80/tcp` HTTP
- `443/tcp` HTTPS
Test has one explicit exception: `51820/udp` is opened on the DB node for WireGuard VPN, authenticated cryptographically. Prod currently does not expose `51820/udp` in Terraform.
`8200/tcp` Vault will not be opened to the public internet. Vault must be reachable only from the private network or Docker overlay.
`docker-stack-infra.yml` has been aligned with this policy: only the SWAG service publishes ports 80/443; all other services such as Vault, APISIX, RabbitMQ, Prometheus, and Grafana are reachable only through the `iklimco-net` overlay.
Current prod stack behavior is aligned with this policy: `docker-stack-infra_db-prod.yml` publishes public traffic through SWAG on 80/443. Vault is deployed separately by `vault-bootstrap.sh` using `docker-stack-vault.yml`; it is not publicly exposed.
## Private Network Policy
The detailed matrix of ports that must be opened inside the private network is in `01-private-network-port-matrisi.md`. Agents must treat that file as the source when writing firewall or Ansible UFW rules.
The detailed matrix of ports that must be opened inside the private network is in `01-private-network-port-matrix.md`. Agents must treat that file as the source when writing Terraform Hetzner firewall rules and Ansible `firewalld` rules.
## Gitea Actions Runner Decision

View File

@ -1,8 +1,8 @@
# 07 - Private Network Port Matrix
# 01 - Private Network Port Matrix
This file defines the ports that must be opened inside the Hetzner private network for test and prod environments. Ports open to the public internet will only be `22/tcp`, `80/tcp`, and `443/tcp`. Vault `8200/tcp` will not be opened publicly.
This file defines the ports that must be opened inside the Hetzner private network for test and prod environments. Public ingress is limited to `22/tcp`, `80/tcp`, and `443/tcp`, with one current test-only exception: `51820/udp` is public on the test DB node for WireGuard. Vault `8200/tcp` will not be opened publicly.
This matrix must be treated as the source for Terraform Hetzner firewall and Ansible UFW rules.
This matrix must be treated as the source for Terraform Hetzner firewall and Ansible `firewalld` rules.
## Network Plan
@ -11,25 +11,25 @@ This matrix must be treated as the source for Terraform Hetzner firewall and Ans
| Subnet | CIDR | Purpose |
| --- | --- | --- |
| App/Swarm | `10.10.10.0/24` | `iklim-app-01` |
| DB | `10.10.20.0/24` | `test-db-01` |
| DB | `10.10.20.0/24` | `iklim-db-01` |
### Prod
| Subnet | CIDR | Purpose |
| --- | --- | --- |
| App/Swarm | `10.20.10.0/24` | `iklim-app-01/02/03` |
| DB | `10.20.20.0/24` | `prod-db-01/02/03` |
| DB | `10.20.20.0/24` | `iklim-db-01/02/03` |
## Public Ingress Standard
Public ingress for all environments:
Public ingress:
| Port | Protocol | Source | Target | Requirement |
| --- | --- | --- | --- | --- |
| `22` | TCP | Admin IP/CIDR | All nodes | SSH management |
| `80` | TCP | Internet | `iklim-app-01` (gateway) | HTTP / ACME redirect |
| `443` | TCP | Internet | `iklim-app-01` (gateway) | HTTPS |
| `51820` | UDP | `0.0.0.0/0`, `::/0` | `iklim-db-01` (DB node) | WireGuard VPN — authentication with cryptographic key |
| `51820` | UDP | `0.0.0.0/0`, `::/0` | `iklim-db-01` in test only | WireGuard VPN — authentication with cryptographic key |
Critical ports that will not be opened publicly:
@ -80,11 +80,11 @@ These ports will not be opened publicly. Access will be allowed only from requir
| `9090` | TCP | Prometheus UI/API | Admin CIDR or private ops | Prometheus service/node | Public closed |
| `3000` | TCP | Grafana UI | Admin CIDR or private ops | Grafana service/node | Public closed |
`docker-stack-infra.yml` has been updated so that only the SWAG service publishes ports 80/443 in host mode. All other services contain no published ports; access is provided only through the `iklimco-net` overlay. This table remains the source for private ingress decisions.
The current prod root stack is `docker-stack-infra_db-prod.yml`; Vault is deployed separately with `docker-stack-vault.yml` through `vault-bootstrap.sh`. Public traffic is expected to enter through SWAG on 80/443. Private service reachability is provided by the `iklimco-net` overlay and by the explicit host-mode DB/cluster ports listed below.
## DB Node Ports
Because DB infrastructure will be installed manually, the exact cluster technology is outside this document. Still, the default ports for firewall purposes are below.
DB runtime services are deployed as Docker Swarm services. Prod currently uses Patroni/PostgreSQL, etcd, and a MongoDB replica set in `docker-stack-infra_db-prod.yml`; the required firewall ports are below.
### PostgreSQL / PostGIS (Patroni + etcd)
@ -129,7 +129,7 @@ App subnet (swarm firewall) — traffic inside itself:
| Source | Target | Ports |
| --- | --- | --- |
| `10.20.10.0/24` | `10.20.10.0/24` | `2377/tcp`, `7946/tcp`, `7946/udp`, `4789/udp` (Swarm) |
| `10.20.10.0/24` | `10.20.10.0/24` | `8200/tcp`, `6379/tcp`, `5672/tcp`, `61613/tcp`, `15674/tcp`, `2379/tcp` (application services) |
| `10.20.10.0/24` | `10.20.10.0/24` | `8200/tcp`, `5672/tcp`, `61613/tcp`, `15674/tcp` (application services) |
| Admin CIDR or VPN | `10.20.10.0/24` | `15672/tcp`, `9180/tcp`, `9090/tcp`, `3000/tcp` |
App -> DB traffic (there is no related rule in the swarm firewall; it is allowed in the db firewall):
@ -157,7 +157,7 @@ DB -> App traffic (allowed in the swarm firewall):
- The public firewall does not open `8200/tcp`.
- DB ports are not open publicly.
- Swarm ports are open only inside the private app/swarm subnet.
- Swarm ports are open only between Swarm app and DB subnets.
- The App/Swarm subnet reaches the DB subnet only through required DB ports.
- The DB subnet is not opened to the app subnet with broad permissions.
- Admin UI ports are restricted through admin CIDR/VPN/private ops instead of public access.

View File

@ -11,8 +11,8 @@ Terraform creates the following in the test environment:
- App/Swarm subnet: `10.10.10.0/24`
- DB subnet: `10.10.20.0/24`
- Firewall:
- Public ingress: only `22/tcp`, `80/tcp`, `443/tcp`
- Private ingress: test rules in `01-private-network-port-matrisi.md`
- Public ingress: `22/tcp`, `80/tcp`, `443/tcp`, plus test DB WireGuard `51820/udp`
- Private ingress: test rules in `01-private-network-port-matrix.md`
- SSH key
- Placement group: `iklim-test-spread`
- Floating IP: stable IPv4 for the swarm entry point
@ -21,7 +21,7 @@ Terraform creates the following in the test environment:
- `iklim-db-01`
- Ansible inventory output
Terraform does not install DB software. The DB node is prepared only at the machine, network, and firewall level.
Terraform does not install DB software. The DB node is prepared at the machine, network, and firewall level; Ansible later prepares Docker, Swarm worker membership, DB config directories, and WireGuard.
## Recommended File Structure
@ -69,7 +69,7 @@ The server type decision is based on the current test environment metrics in `..
| Server | Private IP | Role |
| --- | --- | --- |
| `iklim-app-01` | `10.10.10.11` | Swarm manager + app worker + Gitea runner |
| `iklim-db-01` | `10.10.20.11` | DB node prepared for manual DB installation |
| `iklim-db-01` | `10.10.20.11` | DB node / Swarm worker for DB services |
Private IPs must be statically defined inside Terraform. Ansible inventory and firewall rules remain deterministic.
@ -91,7 +91,7 @@ Public ingress:
| `80/tcp` | `0.0.0.0/0`, `::/0` | `iklim-app-01` |
| `443/tcp` | `0.0.0.0/0`, `::/0` | `iklim-app-01` |
For public ingress, `8200/tcp`, `5432/tcp`, `27017/tcp`, `5672/tcp`, `15672/tcp`, `6379/tcp`, `2379/tcp`, `9000/tcp`, `9180/tcp`, `9090/tcp`, and `3000/tcp` will not be opened.
For public ingress, `8200/tcp`, `5432/tcp`, `27017/tcp`, `5672/tcp`, `15672/tcp`, `6379/tcp`, `2379/tcp`, `9000/tcp`, `9180/tcp`, `9090/tcp`, and `3000/tcp` will not be opened. `51820/udp` is the explicit test-only public exception for WireGuard.
### App (swarm) Firewall — Private Ingress
@ -133,9 +133,9 @@ Source from DB subnet, because `iklim-db-01` joins Swarm as a worker:
| `7946/tcp,udp` | Docker Swarm node discovery | `10.10.10.0/24` (app subnet) |
| `4789/udp` | Docker Swarm VXLAN overlay | `10.10.10.0/24` (app subnet) |
IP restriction is done in the SWAG nginx configuration, not in the Hetzner firewall. None of these ports are opened publicly from the `admin_allowed_cidrs` source.
IP restriction is done in the SWAG nginx configuration, not in the Hetzner firewall. None of these management ports are opened publicly from the `admin_allowed_cidrs` source.
For other private ingress rules, `01-private-network-port-matrisi.md` will be used as the source.
For other private ingress rules, `01-private-network-port-matrix.md` will be used as the source.
## Placement Group
@ -204,6 +204,6 @@ Each server gets `lifecycle { prevent_destroy = true }`. While this block exists
- `terraform plan` works only with the test Hetzner Project token.
- 2 servers are created after `terraform apply`.
- The two servers can reach each other through the private network.
- Only `22`, `80`, and `443` are open at firewall level from the public internet.
- Only `22`, `80`, `443`, and test WireGuard `51820/udp` are open at firewall level from the public internet.
- Vault `8200` remains closed from the public internet.
- Terraform state is not committed to the repo.

View File

@ -97,7 +97,7 @@ ansible-playbook test-bootstrap.yml --tags "hardening" --ask-vault-pass
| Host | Role |
| --- | --- |
| `iklim-app-01` | Swarm manager + app worker |
| `iklim-db-01` | OS-hardened DB node for manual DB installation |
| `iklim-db-01` | OS-hardened DB node / Swarm worker for DB services |
## Recommended File Structure
@ -281,7 +281,7 @@ Deploy prerequisites on `iklim-app-01`:
/opt/iklimco/stacks
```
Minimum for manual DB installation on the DB node:
Minimum DB-node host directories:
```text
/opt/iklimco
@ -391,7 +391,7 @@ vault_iklim_password: "IKLIM_USER_PASSWORD"
creates: "{{ storagebox_mount_point }}/.mounted_marker"
```
A marker file can be written to the directory to confirm mount success:
A marker file can be written to the directory to confirm mount success:
```yaml
- name: Write mount marker
@ -402,7 +402,7 @@ vault_iklim_password: "IKLIM_USER_PASSWORD"
6. **Create service bind mount directories**
In the test environment, the precipitation service's `image-data` volume is bind mounted on the host to `/mnt/storagebox/precipitation/images`. The directory is created by Ansible after StorageBox is mounted and left with `0755` permissions.
In the test environment, the precipitation service's `image-data` volume is bind mounted on the host to `/mnt/storagebox/precipitation/images`. The directory is created by Ansible after StorageBox is mounted and left with `0755` permissions.
```yaml
- name: Create managed StorageBox directories
@ -447,13 +447,13 @@ An ed25519 SSH key pair is generated on the server and uploaded to the StorageBo
2. **Upload the public key to StorageBox**
This step is done manually and requires the password the first time:
This step is done manually and requires the password the first time:
```bash
cat /root/.ssh/id_ed25519_storagebox.pub | ssh -p23 u469968-sub4@u469968-sub4.your-storagebox.de install-ssh-key
```
Later access works passwordlessly:
Later access works passwordlessly:
```bash
sftp -P23 u469968-sub4@u469968-sub4.your-storagebox.de
@ -461,14 +461,14 @@ An ed25519 SSH key pair is generated on the server and uploaded to the StorageBo
3. **Add private and public keys to Gitea**
Gitea -> Organization Settings -> Actions -> Secrets:
Gitea -> Organization Settings -> Actions -> Secrets:
| Secret Name | Value |
| --- | --- |
| `STORAGEBOX_SSH_PRIV` | Contents of `/root/.ssh/id_ed25519_storagebox` |
| `STORAGEBOX_SSH_PUB` | Contents of `/root/.ssh/id_ed25519_storagebox.pub` |
To get the key contents:
To get the key contents:
```bash
cat /root/.ssh/id_ed25519_storagebox

View File

@ -1,6 +1,6 @@
# 04 - Test DB Docker Installation (Swarm Worker)
# 04 - Test DB Docker Setup (Swarm Worker)
The purpose of this phase is to add the `iklim-db-01` node to Swarm as a worker and run PostgreSQL and MongoDB as Swarm services.
The purpose of this phase is to add the `iklim-db-01` node to Swarm as a worker and prepare the host for PostgreSQL and MongoDB Swarm services.
## Architecture Decision
@ -8,12 +8,12 @@ The roadmap states that DBs will be installed "manually". In the test environmen
The installation has **two phases:**
1. **Preparation (Ansible):** The `test-db-post-stack.yml` playbook sets up DB directories, the `mongod.conf` configuration, and the WireGuard VPN service.
2. **Deploy (Gitea CI/CD):** The `deploy-test.yml` workflow deploys PostgreSQL and MongoDB services to Swarm through `docker-stack-infra.yml`.
2. **Deploy (Gitea CI/CD):** The test deploy workflow deploys PostgreSQL and MongoDB services as part of the environment stack.
**Why?**
1. **Ease of management:** Version transitions and configuration management are much faster with Docker.
2. **Overlay Network:** Application services (`iklim-app-01`) can access DBs through the `iklimco-net` overlay network in an encrypted and isolated way.
3. **Data persistence:** Data is stored in Docker named volumes on `iklim-db-01`. StorageBox is used only for backups.
3. **Data persistence:** Runtime data is kept on the DB node. StorageBox is used for shared configuration, operational files, and backup-related paths, not as the primary DB data path.
## Prerequisites
@ -67,24 +67,21 @@ On `iklim-db-01`, through the `db_stack` and `wireguard` roles:
- Places the `mongod.conf` file
- Installs and configures the WireGuard VPN server (`51820/udp`)
> Deploying DB services (PostgreSQL, MongoDB) to Swarm is the responsibility of the Gitea CI/CD workflow (`deploy-test.yml`), not Ansible. This workflow deploys all services at once through `docker-stack-infra.yml`.
> Deploying DB services (PostgreSQL, MongoDB) to Swarm is the responsibility of the Gitea CI/CD workflow, not Ansible. The Ansible playbook prepares host directories, configuration, and WireGuard.
## 4. Volume and Data Structure
DB data is stored in Docker named volumes on `iklim-db-01`:
DB data is stored on `iklim-db-01` through the stack's configured volume or bind-mount layout. The Ansible `db_stack` role prepares MongoDB configuration at:
| Volume | Content |
|---|---|
| `iklim-db_postgresql_data` | PostgreSQL data files |
| `iklim-db_mongodb_data` | MongoDB data files |
```text
/opt/iklimco/db/mongodb/config/mongod.conf
```
MongoDB logs are written to stdout and can be watched with `docker logs`. Configuration: `/opt/iklimco/db/mongodb/config/mongod.conf`
> StorageBox is **not used** for DB data. It only has a role in the backup strategy.
MongoDB logs are written to stdout and can be watched with `docker logs`.
## 5. Acceptance Criteria
- `iklim-db-01` appears as Ready and Active in the `docker node ls` command.
- `docker stack services iklimco` shows both services with 1/1 replicas.
- Access from the application node is available through the `iklim-db_postgresql` and `iklim-db_mongodb` DNS names.
- Data is preserved from named volumes after reboot; verify with `docker volume ls`.
- Data is preserved after reboot according to the stack's configured DB volume/bind-mount layout.

View File

@ -8,7 +8,7 @@ A single runner is used in the test environment for cost and simplicity:
| Host | Service Name | System User | Labels |
| --- | --- | --- | --- |
| `iklim-app-01` | `gitea-act-runner` | `gitea-runner` | `ubuntu-latest`, `ubuntu-22.04`, `ubuntu-20.04`, `test-runner` |
| `iklim-app-01` | `gitea-act-runner` | `gitea-runner` | `ubuntu-latest`, `ubuntu-22.04`, `ubuntu-20.04`, `test-runner:docker://catthehacker/ubuntu:act-22.04` |
## 1. Runner User and Permissions
@ -56,14 +56,15 @@ Critical parts of the configuration:
```yaml
runner:
labels:
- "ubuntu-latest:docker://ubuntu:latest"
- "ubuntu-22.04:docker://ubuntu:22.04"
- "ubuntu-20.04:docker://ubuntu:20.04"
- "test-runner:docker://ubuntu:22.04"
- "ubuntu-latest"
- "ubuntu-22.04"
- "ubuntu-20.04"
- "test-runner:docker://catthehacker/ubuntu:act-22.04"
container:
network: "iklimco-net" # Access to DB services through overlay
options: "-v /var/run/docker.sock:/var/run/docker.sock" # For Docker commands
network: "bridge"
options: "-v /mnt/storagebox:/mnt/storagebox"
docker_host: "unix:///var/run/docker.sock"
```
Status check:
@ -94,7 +95,7 @@ The following secrets must be defined at Gitea Organization level for pipelines
## 6. Custom Image Build and Harbor Push
`docker-stack-infra.yml` and microservice stacks use private images under `registry.tarla.io/iklimco/`. These images are built and pushed to the registry with the `ops/push-harbor-custom-images.sh` script.
Environment stack files and microservice stacks use private images under `registry.tarla.io/iklimco/`. These images are built and pushed to the registry with the `ops/push-harbor-custom-images.sh` script.
APISIX config files (`build/apisix-core/config.yaml`, `build/apisix-dashboard/conf.yaml`) are generated from templates under `template/` with `envsubst`. `push-harbor-custom-images.sh` performs this generation internally; temporary files are cleaned automatically when the build finishes.
@ -114,6 +115,6 @@ bash ops/push-harbor-custom-images.sh
1. The runner labeled `test-runner` appears as **Idle** (green) on the Gitea Runners page.
2. A workflow using `runs-on: test-runner` is triggered successfully.
3. The job container can access the Docker daemon and the `iklimco-net` overlay network.
3. The job can access the Docker daemon through `docker_host`, and deploy workflows connect job containers to `iklimco-net` when overlay access is required.
4. The `8200/tcp` (Vault) port is closed to the public internet.
5. `registry.tarla.io/iklimco/custom-apisix`, `custom-apisix-dashboard`, and `custom-prometheus` images exist in Harbor and are pullable.

View File

@ -12,7 +12,7 @@ Terraform creates the following in the prod environment:
- DB subnet: `10.20.20.0/24`
- Firewall:
- Public ingress: only `22/tcp`, `80/tcp`, `443/tcp`
- Private ingress: prod rules in `01-private-network-port-matrisi.md`
- Private ingress: prod rules in `01-private-network-port-matrix.md`
- SSH key
- Placement groups:
- `iklim-prod-app-spread`
@ -145,6 +145,13 @@ The following ports will not be opened publicly in prod:
## Private Firewall
Firewall placement follows the Swarm placement model:
- DB/cluster services on `iklim-db-*` nodes: Patroni/PostgreSQL, MongoDB, and etcd.
- App/service-node infrastructure on `iklim-app-*` nodes: Vault, RabbitMQ, APISIX, Prometheus, Grafana, SWAG, and the Redis/Sentinel services from `docker-stack-infra_db-prod.yml`.
RabbitMQ ports are therefore documented under the app firewall. Redis and Redis Sentinel do not publish host-mode ports in the current prod stack; they stay on the Docker overlay network and do not need Hetzner firewall openings.
### App (swarm) Firewall — Private Ingress
Source from app subnet (`10.20.10.0/24`):
@ -340,7 +347,7 @@ Local state is used for now (`terraform.tfstate`). The state file is not committ
- Swarm nodes are inside the `iklim-prod-app-spread` placement group.
- DB nodes are inside the `iklim-prod-db-spread` placement group.
- Public firewall allows only `22`, `80`, and `443` ingress.
- Private firewall is compatible with `01-private-network-port-matrisi.md`.
- Private firewall is compatible with `01-private-network-port-matrix.md`.
- DB replication ports are accessible only from the DB subnet.
- Floating IP is created and assigned to `iklim-app-01`.
- Terraform state and secret tfvars are not committed.

View File

@ -119,6 +119,8 @@ ansible/
vars.yml
vault.yml
prod-bootstrap.yml
roles/
db_stack/
roles/
base/
hardening/
@ -131,6 +133,8 @@ ansible/
db_stack/
```
`ansible/prod/ansible.cfg` sets `roles_path = roles:../roles`. Because of that ordering, `ansible/prod/roles/db_stack` is the production-specific role that is used by `prod-bootstrap.yml`; the shared `ansible/roles/db_stack` remains the common fallback/reference implementation. Production DB behavior that writes Patroni, MongoDB, and replica-set auth files to StorageBox belongs to the prod-local role.
## Base Role
Applied to all prod nodes:
@ -200,30 +204,35 @@ Prod Swarm will be set up with 3 managers:
1. `docker swarm init` on `iklim-app-01` (Advertise/data path addr: `10.20.10.11`)
2. `iklim-app-02` and `iklim-app-03` join as managers.
3. `iklim-db-01/02/03` join as workers.
4. Overlay network is created: `iklimco-net`
4. `iklimco-net` is not created by the Ansible swarm role. It is created and owned by the Swarm stack (`docker-stack-infra_db-prod.yml`) so Docker embedded DNS works for service VIPs and aliases.
5. Node labels:
- `iklim-app-*` -> `type=service`
- `iklim-db-*` -> `role=db`, `db-index=01/02/03`, for Patroni node coordination
- `iklim-db-*` -> `role=db`
- `iklim-db-*` -> `db-index=01/02/03`, for Patroni node coordination
6. All nodes remain `AVAILABILITY=Active`.
The `db-index` labels are added through `iklim-app-01` in a separate play inside `prod-bootstrap.yml`, not by the swarm role.
Labeling is intentionally split across two automation layers:
- The shared `swarm` role adds the generic environment labels: `type=service` on app nodes and `role=db` on DB nodes.
- The production playbook adds `db-index=01/02/03` through `iklim-app-01` in a separate play inside `prod-bootstrap.yml`.
This split keeps the common Swarm role reusable while letting prod add the Patroni/MongoDB coordination labels it needs.
## Node Directory Role
On all `iklim-app-*` nodes:
```text
/opt/iklimco/ssl
/opt/iklimco/init
/opt/iklimco/stacks
/opt/iklimco/vault/data
```
`/opt/iklimco/vault/data` is the host path volume of the Vault Raft node; it must be created separately on every app node. Swarm does not manage this directory as an overlay volume; if it is missing, the Vault container will not start.
Vault data is managed by the `docker-stack-vault.yml` stack through Docker volumes. The app nodes need the local SSL directory because `cert-distributor` syncs certificates from StorageBox into `/opt/iklimco/ssl` for Vault.
On DB nodes:
```text
/opt/iklimco/db
/opt/iklimco/backup
/opt/iklimco/db/mongodb
/opt/iklimco/db/postgresql
```
## StorageBox DAVFS Mount Role
@ -256,19 +265,22 @@ Applied to `iklim-app-*` nodes. Gitea Act Runner is installed on each app node a
## DB Stack Role
Applied to `iklim-db-*` nodes. On each DB node, it creates `/opt/iklimco/db` and `/opt/iklimco/backup` directories, as well as a local reference directory for MongoDB. The actual production configuration, including node-specific `mongod.conf`, replica set auth key, and Patroni configurations, is set up on StorageBox at `/mnt/storagebox/db/mongodb-0X/config/` and `/mnt/storagebox/db/postgresql-0X/config/` in the `08-prod-db-cluster-kurulum.md` step. etcd data is stored on local Docker named volumes (not StorageBox).
Applied to `iklim-db-*` nodes. On each DB node, it creates `/opt/iklimco/db`, `/opt/iklimco/backup`, `/opt/iklimco/db/mongodb`, and `/opt/iklimco/db/postgresql`. The production configuration, including node-specific `mongod.conf`, replica set auth key, and Patroni configurations, is deployed by the Ansible `db_stack` role to StorageBox at `/mnt/storagebox/db/mongodb-0X/config/` and `/mnt/storagebox/db/postgresql-0X/config/`. etcd data is stored on local Docker named volumes.
## DB Stack Env Variables
Password variables required by the DB cluster stack (`docker-stack-db.prod.yml`) — `DATABASE_POSTGRES_ROOT_PASSWD`, `DATABASE_POSTGRES_REPLICATOR_PASSWORD`, `DATABASE_MONGODB_ROOT_PASSWD` — are stored in `prod/secrets/iklim.co/.env.secrets.shared` on StorageBox, alongside the other shared secrets. No separate file is needed.
Password variables required by the prod infra stack (`docker-stack-infra_db-prod.yml`) — including `DATABASE_POSTGRES_ROOT_PASSWD`, `DATABASE_POSTGRES_REPLICATOR_PASSWORD`, `DATABASE_MONGODB_ROOT_PASSWD`, and `ETCD_ROOT_PASSWORD` — are stored in `prod/secrets/iklim.co/.env.secrets.shared` on StorageBox, alongside the other shared secrets. No separate file is needed.
## StorageBox Directory Structure
The `storagebox` Ansible rolü `storagebox_managed_directories` (`group_vars/all/vars.yml`) aracılığıyla aşağıdaki dizinleri bootstrap sırasında **otomatik** oluşturur. Manüel adım gerekmez:
- `/mnt/storagebox/ssl``SWAG_CERT_DIR`
- `/mnt/storagebox/swag/config``SWAG_CONFIG_DIR`
- `/mnt/storagebox/swag`
- `/mnt/storagebox/swag/dns-conf``SWAG_DNS_CONFIG_DIR`
- `/mnt/storagebox/swag/site-confs``SWAG_SITE_CONFS_DIR`
- `/mnt/storagebox/swag/proxy-confs``SWAG_PROXY_CONFS_DIR`
- `/mnt/storagebox/swag/certbot`
- `/mnt/storagebox/grafana/data``GRAFANA_DATA_DIR`
- `/mnt/storagebox/precipitation/images`
@ -300,12 +312,12 @@ grep -n "swarm init\|swarm join" init/swarm-init.sh
- 3 Swarm manager nodes appear as Leader/Reachable in `docker node ls`.
- 3 DB nodes appear as Workers in `docker node ls`.
- Manager quorum is provided: 3 managers, 1 loss tolerated.
- The `iklimco-net` overlay network exists.
- The `iklimco-net` overlay network is created by the Swarm stack after `docker-stack-infra_db-prod.yml` deploy.
- Node labels (`type=service`, `role=db`, `db-index=01/02/03`) are verified with inspect.
- `swarm-init.sh` does not attempt init again in an active Swarm; it is idempotent.
- `/mnt/storagebox` is mounted on every node.
- The `/opt/iklimco/vault/data` directory exists on every app node.
- The `ssl`, `swag/config`, `swag/site-confs`, `grafana/data`, and `precipitation/images` directories exist on StorageBox.
- The `/opt/iklimco/ssl` directory exists on every app node.
- The `db`, `ssl`, `swag`, `swag/dns-conf`, `swag/site-confs`, `swag/proxy-confs`, `swag/certbot`, `grafana/data`, and `precipitation/images` directories exist on StorageBox.
- The Gitea Act Runner service is running on every app node.
- `/opt/iklimco/db` and `/opt/iklimco/backup` directories exist on DB nodes. Node-specific `mongod.conf` and other DB configurations are created on StorageBox (`/mnt/storagebox/db/...`) in the `08-prod-db-cluster-kurulum.md` step.
- `/opt/iklimco/db` and `/opt/iklimco/backup` directories exist on DB nodes. Node-specific `mongod.conf` and other DB configurations are created on StorageBox (`/mnt/storagebox/db/...`) in the `08-prod-db-cluster-setup.md` step.
- Public firewall allows only `22`, `80`, and `443` ingress.

View File

@ -27,7 +27,9 @@ iklim-db-03 (Swarm worker, 10.20.20.13)
patroni-03 [Patroni + PostgreSQL — standby]
```
DB containers discover each other through **overlay DNS aliases** (`mongodb-01`, `etcd-01`, `patroni-01`, etc.) on the shared `iklimco-net` overlay network. Each service publishes its port in `host` mode so replication traffic goes directly through the Hetzner private network while the overlay DNS resolves service names correctly. All containers are defined in the single `docker-stack-db.prod.yml` stack file at the repo root.
DB containers discover each other through **overlay DNS aliases** (`mongodb-01`, `etcd-01`, `patroni-01`, etc.) on the shared `iklimco-net` overlay network. Patroni/PostgreSQL, MongoDB, and etcd are the DB/cluster services covered by this document; they publish their cluster ports in `host` mode so replication traffic goes directly through the Hetzner private network while overlay DNS resolves service names correctly.
The current prod DB services are defined in the root `docker-stack-infra_db-prod.yml` stack file. That stack also contains non-DB infrastructure services such as Redis, Redis Sentinel, and RabbitMQ. Those services are intentionally different: they run on `node.labels.type == service` app/service nodes, do not publish host-mode ports in this stack, and communicate through the `iklimco-net` overlay network only. Do not generalize the DB host-mode rule to Redis or RabbitMQ.
## 1. Firewall Update
@ -145,6 +147,10 @@ terraform apply
## 2. Add DB Nodes to Swarm
This is handled by `Environment_Infrastructure/ansible/prod/prod-bootstrap.yml` through the `swarm` role. The role initializes Swarm on `iklim-app-01`, joins `iklim-app-02/03` as managers, joins `iklim-db-01/02/03` as workers, and labels DB nodes.
Manual equivalent, kept for troubleshooting only:
**Swarm manager'lardan birinde** (iklim-app-01) join token al:
```bash
@ -157,19 +163,35 @@ docker swarm join-token worker
docker swarm join --token <TOKEN> 10.20.10.11:2377
```
Label the nodes **on iklim-app-01**:
Label the nodes **on iklim-app-01**. In automation this is split into two phases:
- the shared `swarm` role adds `role=db` to DB nodes;
- the prod-specific `prod-bootstrap.yml` play adds `db-index=01/02/03`.
Manual equivalent:
```bash
docker node update --label-add role=db --label-add db-index=01 iklim-db-01
docker node update --label-add role=db --label-add db-index=02 iklim-db-02
docker node update --label-add role=db --label-add db-index=03 iklim-db-03
docker node update --label-add role=db iklim-db-01
docker node update --label-add role=db iklim-db-02
docker node update --label-add role=db iklim-db-03
docker node update --label-add db-index=01 iklim-db-01
docker node update --label-add db-index=02 iklim-db-02
docker node update --label-add db-index=03 iklim-db-03
docker node ls
```
## 3. StorageBox Directory Structure
DB data and logs are stored on **local Docker named volumes** (performance, WAL/compaction requirements). Only config files are placed on StorageBox. On each DB node, where `/mnt/storagebox` must already be mounted:
DB data is stored on local DB-node paths prepared by Ansible:
```text
/opt/iklimco/db/mongodb
/opt/iklimco/db/postgresql
```
Configuration files are placed on StorageBox. On each DB node, where `/mnt/storagebox` must already be mounted:
```bash
# On iklim-db-01:
@ -185,7 +207,7 @@ mkdir -p /mnt/storagebox/db/mongodb-03/config
mkdir -p /mnt/storagebox/db/postgresql-03/config
```
Config files (`mongod.conf`, `patroni.yml`) are deployed by the Ansible `db_stack` role into these directories. Named Docker volumes (`mongodb-01-data`, `etcd-01-data`, `postgresql-01-data`, etc.) are created automatically by the stack deploy.
Config files (`mongod.conf`, `patroni.yml`) and the MongoDB replica set key are deployed by the Ansible `db_stack` role into these directories. etcd uses Docker named volumes (`etcd-01-data`, `etcd-02-data`, `etcd-03-data`) from `docker-stack-infra_db-prod.yml`.
## 4. MongoDB Replica Set
@ -216,14 +238,18 @@ security:
### Replica Set Auth Key
The **same** key file must exist on all DB nodes:
The **same** key file must exist on all DB nodes. In the current production setup, this is automated by `ansible/prod/roles/db_stack/tasks/db_node.yml`:
- `iklim-db-01` generates `/mnt/storagebox/db/mongodb-01/config/rs-auth.key` if it is missing.
- the same key content is copied to `/mnt/storagebox/db/mongodb-02/config/rs-auth.key` and `/mnt/storagebox/db/mongodb-03/config/rs-auth.key`;
- permissions are set to `0400`.
Manual recovery equivalent, kept only for troubleshooting:
```bash
# Create on iklim-db-01:
openssl rand -base64 756 > /mnt/storagebox/db/mongodb-01/config/rs-auth.key
chmod 400 /mnt/storagebox/db/mongodb-01/config/rs-auth.key
# Copy the same content to the other nodes:
cat /mnt/storagebox/db/mongodb-01/config/rs-auth.key \
> /mnt/storagebox/db/mongodb-02/config/rs-auth.key
cat /mnt/storagebox/db/mongodb-01/config/rs-auth.key \
@ -234,14 +260,16 @@ chmod 400 /mnt/storagebox/db/mongodb-0{2,3}/config/rs-auth.key
### Stack File — MongoDB
MongoDB services are defined in `docker-stack-db.prod.yml` (repo root). Each service uses a named Docker volume for data and log, and a StorageBox bind mount for config:
MongoDB services are defined in `docker-stack-infra_db-prod.yml` (repo root). Each service uses a local DB-node bind mount for data and a StorageBox bind mount for config:
```yaml
mongodb-01:
image: mongo:8.3.2
image: ${IMAGE_MONGODB}
environment:
MONGO_INITDB_ROOT_USERNAME: "${DATABASE_MONGODB_ROOT_USER}"
MONGO_INITDB_ROOT_PASSWORD: "${DATABASE_MONGODB_ROOT_PASSWD}"
volumes:
- mongodb-01-data:/data/db
- mongodb-01-log:/data/log
- /opt/iklimco/db/mongodb:/data/db
- /mnt/storagebox/db/mongodb-01/config:/data/configdb
networks:
iklimco-net:
@ -260,11 +288,18 @@ mongodb-01:
- node.hostname == iklim-db-01
```
Volumes `mongodb-01-data`, `mongodb-01-log`, etc. are declared at the bottom of `docker-stack-db.prod.yml` and are created automatically on first deploy.
The same pattern is repeated for `mongodb-02` and `mongodb-03`, with node-specific StorageBox config paths and placement constraints.
### Replica Set Initialization
Run **once** after the stack is deployed:
Replica set initialization is handled by the root prod workflow step `Initialize MongoDB Replica Set`. The workflow:
1. Connects to the first host from `DATABASE_MONGODB_HOST`.
2. Runs `rs.initiate()` if the replica set is uninitialized.
3. Checks current members if the replica set already exists.
4. Runs `rs.add()` through the primary if hosts from `DATABASE_MONGODB_HOST` are missing.
Manual equivalent, kept for troubleshooting only:
```bash
# On iklim-app-01 (overlay network erişimi için):
@ -293,7 +328,7 @@ Patroni coordinates PostgreSQL primary/standby roles through etcd. If the primar
### 5.1 Custom Image (Patroni + PostGIS)
Patroni is installed on top of the `postgis/postgis:18-3.6` image. This image is pushed to Harbor and used in the stack.
Patroni is installed on top of the `postgis/postgis:18-3.6` image. This image is pushed to Harbor and used in `docker-stack-infra_db-prod.yml` via `${CUSTOM_IMAGE_REGISTRY}${IMAGE_PATRONI}`.
`build/patroni-postgis/Dockerfile`:
@ -335,13 +370,13 @@ docker push registry.tarla.io/iklimco/custom-patroni-postgis:18-3.6
### 5.2 etcd Cluster
etcd services are defined in `docker-stack-db.prod.yml`. Each service uses a named Docker volume for data and has an overlay DNS alias. Environment variables reference peer URLs by alias, not by hardcoded IP:
etcd services are defined in `docker-stack-infra_db-prod.yml`. Each service uses a named Docker volume for data and has an overlay DNS alias. Environment variables reference peer URLs by alias, not by hardcoded IP:
```yaml
etcd-01:
image: bitnami/etcd:3
image: ${IMAGE_ETCD}
environment:
ALLOW_NONE_AUTHENTICATION: "yes"
ALLOW_NONE_AUTHENTICATION: "no"
ETCD_NAME: etcd-01
ETCD_INITIAL_ADVERTISE_PEER_URLS: http://etcd-01:2380
ETCD_LISTEN_PEER_URLS: http://0.0.0.0:2380
@ -350,6 +385,7 @@ etcd-01:
ETCD_INITIAL_CLUSTER: "etcd-01=http://etcd-01:2380,etcd-02=http://etcd-02:2380,etcd-03=http://etcd-03:2380"
ETCD_INITIAL_CLUSTER_STATE: new
ETCD_INITIAL_CLUSTER_TOKEN: iklimco-etcd-prod
ETCD_ROOT_PASSWORD: "${ETCD_ROOT_PASSWORD}"
volumes:
- etcd-01-data:/bitnami/etcd/data
networks:
@ -366,7 +402,7 @@ etcd-01:
**APISIX etcd usage:** In prod, APISIX shares this etcd cluster with the `/apisix` prefix. Patroni uses the `/service/` prefix and APISIX uses the `/apisix/` prefix — no collision. The overlay DNS names (`etcd-01:2379`, `etcd-02:2379`, `etcd-03:2379`) are reachable from app nodes via the `iklimco-net` overlay. Therefore, the app subnet → DB nodes port 2379 firewall rule is mandatory; it was added in Section 1.
**Important:** `ETCD_INITIAL_CLUSTER_STATE` must be `new` on the first deploy and `existing` on all later deploys. The deploy steps in Section 6 detect this automatically; no manual update is required.
**Important:** `ETCD_INITIAL_CLUSTER_STATE` is currently defined in `docker-stack-infra_db-prod.yml`. When changing etcd cluster membership, do not blindly expand `ETCD_INITIAL_CLUSTER` on a running cluster; add members through etcd membership operations first.
### 5.3 Patroni Configuration
@ -447,17 +483,19 @@ For Node 02 and 03, only `name`, `restapi.connect_address`, and `postgresql.conn
### 5.4 Stack File — Patroni
Patroni services are defined in `docker-stack-db.prod.yml`. Each service uses the custom image, a named Docker volume for data, a StorageBox bind mount for the config file, and overlay DNS aliases:
Patroni services are defined in `docker-stack-infra_db-prod.yml`. Each service uses the custom image, a local DB-node bind mount for data, a StorageBox bind mount for the config file, and overlay DNS aliases:
```yaml
patroni-01:
image: registry.tarla.io/iklimco/custom-patroni-postgis:18-3.6
image: ${CUSTOM_IMAGE_REGISTRY}${IMAGE_PATRONI}
environment:
DATABASE_POSTGRES_ROOT_PASSWD: "${DATABASE_POSTGRES_ROOT_PASSWD}"
DATABASE_POSTGRES_REPLICATOR_PASSWORD: "${DATABASE_POSTGRES_REPLICATOR_PASSWORD}"
POSTGRES_USER: "${DATABASE_POSTGRES_ROOT_USER}"
POSTGRES_PASSWORD: "${DATABASE_POSTGRES_ROOT_PASSWD}"
REPLICATOR_PASSWORD: "${DATABASE_POSTGRES_REPLICATOR_PASSWORD}"
ETCD_ROOT_PASSWORD: "${ETCD_ROOT_PASSWORD}"
TZ: "Europe/Istanbul"
volumes:
- postgresql-01-data:/var/lib/postgresql/data
- /opt/iklimco/db/postgresql:/var/lib/postgresql/data
- /mnt/storagebox/db/postgresql-01/config/patroni.yml:/etc/patroni/patroni.yml:ro
networks:
iklimco-net:
@ -480,7 +518,7 @@ patroni-01:
- node.hostname == iklim-db-01
```
Volumes `postgresql-01-data`, `postgresql-02-data`, `postgresql-03-data` are declared at the bottom of `docker-stack-db.prod.yml` and created automatically on first deploy.
The same pattern is repeated for `patroni-02` and `patroni-03`, with node-specific StorageBox config paths and placement constraints.
### 5.5 Status Check
@ -508,11 +546,11 @@ docker exec -it $(docker ps -q -f name=iklimco_patroni-01 | head -1) \
## 6. Deploy
All DB services (etcd, MongoDB, Patroni) are in the single `docker-stack-db.prod.yml` stack. Deploy from `iklim-app-01` in the repo working directory.
All DB services (etcd, MongoDB, Patroni) are in the current root prod stack `docker-stack-infra_db-prod.yml`. Normal deployment is done by `.gitea/workflows/deploy-prod.yml`, not by running a separate DB stack manually.
### .env File
DB stack password variables (`DATABASE_POSTGRES_ROOT_PASSWD`, `DATABASE_POSTGRES_REPLICATOR_PASSWORD`, `DATABASE_MONGODB_ROOT_PASSWD`) are stored in `prod/secrets/iklim.co/.env.secrets.shared` on StorageBox. Fetch it to `iklim-app-01` before deploy:
DB stack password variables (`DATABASE_POSTGRES_ROOT_PASSWD`, `DATABASE_POSTGRES_REPLICATOR_PASSWORD`, `DATABASE_MONGODB_ROOT_PASSWD`, `ETCD_ROOT_PASSWORD`) are stored in `prod/secrets/iklim.co/.env.secrets.shared` on StorageBox. The workflow fetches this file automatically.
```bash
scp -P 23 STORAGEBOX_USER@STORAGEBOX_USER.your-storagebox.de:prod/secrets/iklim.co/.env.secrets.shared \
@ -522,44 +560,18 @@ chmod 600 /tmp/.env.secrets.shared
### Deploy Steps
The root prod workflow deploys the stack with:
```bash
# On iklim-app-01, in the repo working directory:
set -a; . /tmp/.env.secrets.shared; set +a
# Automatic ETCD_INITIAL_CLUSTER_STATE detection:
DEPLOY_FILE="docker-stack-db.prod.yml"
if docker service ls --filter name=iklimco_etcd-01 -q 2>/dev/null | grep -q .; then
echo " etcd services mevcut, 'existing' ile deploy ediliyor..."
DEPLOY_FILE=$(mktemp /tmp/docker-stack-db.XXXXXX.yml)
sed "s/ETCD_INITIAL_CLUSTER_STATE: new/ETCD_INITIAL_CLUSTER_STATE: existing/g" \
docker-stack-db.prod.yml > "$DEPLOY_FILE"
else
echo " İlk deploy, 'new' state kullanılıyor..."
fi
docker stack deploy \
--with-registry-auth \
-c "$DEPLOY_FILE" \
--resolve-image changed \
-c docker-stack-infra_db-prod.yml \
iklimco
[ "$DEPLOY_FILE" != "docker-stack-db.prod.yml" ] && rm -f "$DEPLOY_FILE"
# Wait for etcd cluster to be ready:
echo "⏳ etcd bekleniyor..."
for i in $(seq 1 18); do
if docker run --rm --network iklimco-net alpine \
sh -c "wget -qO- http://etcd-01:2379/health 2>/dev/null | grep -q '\"health\":\"true\"'"; then
echo "✅ etcd ready"
break
fi
[ "$i" -eq 18 ] && echo "❌ etcd timeout" && exit 1
echo " attempt $i/18 — 10s bekleniyor..."
sleep 10
done
docker stack services iklimco
```
After the stack deploy, the workflow waits for etcd, initializes APISIX, initializes the MongoDB replica set, and runs PostgreSQL/MongoDB init scripts.
### DB Node Placement Check
```bash
@ -572,7 +584,7 @@ All tasks must run on the expected `iklim-db-*` nodes.
### MongoDB Replica Set Initialization
Run once after the stack is deployed:
Handled by the workflow. Manual form for troubleshooting:
```bash
# From iklim-app-01 via overlay network:
@ -596,7 +608,7 @@ App containers connect to DB services through the `iklimco-net` overlay network
### MongoDB Replica Set Connection String
Variables in `env-prod/.env`:
Variables in StorageBox `prod/secrets/iklim.co/.env`:
```bash
DATABASE_MONGODB_HOST=mongodb-01:27017,mongodb-02:27017,mongodb-03:27017
@ -613,7 +625,7 @@ mongodb://<user>:<password>@mongodb-01:27017,mongodb-02:27017,mongodb-03:27017/<
### PostgreSQL — Patroni
Variables in `env-prod/.env`:
Variables in StorageBox `prod/secrets/iklim.co/.env`:
```bash
DATABASE_POSTGRES_HOST=patroni-01:5432,patroni-02:5432,patroni-03:5432
@ -647,8 +659,7 @@ curl -s http://patroni-01:8008/primary
Prod cluster yapısında `pg-proxy` veya `mongo-proxy` **kullanılmaz**. Ofis bilgisayarından erişim için doğrudan DB subnet'i hedef alınır.
### WireGuard Ayarı
Ofis bilgisayarındaki `.conf` dosyasında `AllowedIPs` güncellenmelidir:
`AllowedIPs = 10.8.0.1/32, 10.20.20.0/24`
Ofis bilgisayarındaki `.conf` dosyasında `AllowedIPs` güncellenmelidir: `AllowedIPs = 10.8.0.1/32, 10.20.20.0/24`
### Bağlantı Parametreleri (Multi-Host)
Modern veritabanı araçları (DBeaver, Compass vb.) küme farkındalıklı bağlantı kurmalıdır:
@ -660,7 +671,7 @@ Modern veritabanı araçları (DBeaver, Compass vb.) küme farkındalıklı bağ
## Acceptance Criteria
- `docker stack services iklimco`9 services visible (etcd-01/02/03, mongodb-01/02/03, patroni-01/02/03), all `1/1`
- `docker stack services iklimco`etcd-01/02/03, mongodb-01/02/03, patroni-01/02/03 are visible and all target replicas are healthy
- `docker service ps iklimco_patroni-01/02/03` — each task runs on its expected `iklim-db-*` node
- `docker service ps iklimco_mongodb-01/02/03` — each task runs on its expected `iklim-db-*` node
- `docker service ps iklimco_etcd-01/02/03` — each task runs on its expected `iklim-db-*` node

View File

@ -16,7 +16,7 @@ In this model, if any manager/runner is lost, the other runners can pick up pipe
## Runner Installation Model
The runner will not run as a Docker container. There is no Docker socket mount.
The runner will not run as a Docker container. It runs as a systemd service on the app nodes. Job containers start on Docker `bridge`; deploy workflows connect the job container to `iklimco-net` after the stack creates that network.
Installation:
@ -33,7 +33,7 @@ If runner jobs use Docker CLI for deploy, the `gitea-runner` user needs access t
Shared labels on all prod runners:
```text
prod-runner
prod-runner:docker://catthehacker/ubuntu:act-22.04
ubuntu-24.04
```
@ -86,20 +86,19 @@ For the GoDaddy API key: https://developer.godaddy.com/keys — create a **Produ
### Gitea `PROD_FLOATING_IP` Variable
For DNS automation, `PROD_FLOATING_IP` must be defined as a Gitea project variable. See the "Gitea Variable: PROD_FLOATING_IP" step in `06-prod-terraform-iaac.md`.
For DNS automation, `PROD_FLOATING_IP` must be defined as a Gitea project variable. See the "Gitea Variable: PROD_FLOATING_IP" step in `06-prod-terraform-iac.md`.
### Docker Secrets
Before the infra stack is deployed, the following Docker secrets must be created on `iklim-app-01`. These secrets are referenced by `docker-stack-infra.prod.yml`; if they do not exist, stack deploy fails.
Before the infra stack is deployed, `rabbitmq_erlang_cookie` must exist as a Docker secret. The current prod workflow creates it in the `Create Infrastructure Docker Secrets` step if it is missing.
```bash
# RabbitMQ Erlang cluster cookie; must be the same on all RabbitMQ nodes:
# RabbitMQ Erlang cluster cookie; must be the same on all RabbitMQ nodes.
# The workflow does this automatically if the secret is missing:
openssl rand -hex 32 | docker secret create rabbitmq_erlang_cookie -
```
> The `vault_unseal_key` secret is created after Vault is started for the first time; see `roadmap/prod-env/07-vault-raft-plan.md` Step 3. It is not required for the first infra stack deploy; it is waited for until the health check is triggered.
>
> This secret is also used during Vault restarts triggered by cert-reloader: when `cert-reloader` detects a certificate change, it runs `docker service update --force iklimco_vault`; while Vault containers restart, they read from the `vault_unseal_key` Docker secret and automatically unseal. If the secret is missing, Vault remains sealed after every certificate renewal.
> The `vault_unseal_key` secret is managed by `init/vault/vault-bootstrap.sh`. The bootstrap script creates a placeholder on first deploy, deploys `docker-stack-vault.yml`, initializes/unseals Vault, and rotates the secret to the real unseal key.
Verify secrets:
@ -120,7 +119,7 @@ Before the deploy pipeline runs, the following template files must exist in the
These files are created in the test environment (`test-env/04-swag-nginx-configs.md`); they are not created separately for prod. Template files are shared by both environments; prod-specific values are injected with environment variables during deploy.
Verify that the `prod/secrets/iklim.co/.env.prod` file on StorageBox contains the following variables:
Verify that the `prod/secrets/iklim.co/.env` file on StorageBox contains the following variables:
```bash
API_SUBDOMAIN=api.iklim.co
@ -129,11 +128,12 @@ RABBITMQ_SUBDOMAIN=rabbitmq.iklim.co
GRAFANA_SUBDOMAIN=grafana.iklim.co
RESTRICTED_IPS="78.187.87.109/32,95.70.151.248/32"
SWAG_CERT_DIR=/mnt/storagebox/ssl
SWAG_CONFIG_DIR=/mnt/storagebox/swag/config
SWAG_DNS_CONFIG_DIR=/mnt/storagebox/swag/dns-conf
SWAG_SITE_CONFS_DIR=/mnt/storagebox/swag/site-confs
SWAG_PROXY_CONFS_DIR=/mnt/storagebox/swag/proxy-confs
```
The pipeline sources these variables and renders the template files into the `$SWAG_SITE_CONFS_DIR` (`/mnt/storagebox/swag/site-confs`) directory. Because StorageBox is mounted commonly on all app nodes, even if the configuration is created on a single runner, SWAG containers on other nodes access the same files. Detail: `roadmap/prod-env/04-swag-nginx-configs.md`.
The pipeline sources these variables and renders the template files into the `$SWAG_SITE_CONFS_DIR` (`/mnt/storagebox/swag/site-confs`) directory. Because StorageBox is mounted commonly on all app nodes, even if the configuration is created on a single runner, SWAG containers on other nodes access the same files.
### APISIX Configuration
@ -194,27 +194,41 @@ All prod deploy workflows, including infra and microservices, must use the same
| 2 | Prepare Folders | |
| 3 | Set up SSH Key and Add to known_hosts | |
| 4 | Update Apt Repository and Install Required Tools | `gettext tree jq``jq` is required for the GoDaddy DNS API |
| 5 | Fetch Service Secret Files | Fetch `.env.secrets.*` from StorageBox |
| 6 | Initialize Workspace | Fetch `.env` and `.env.secrets.shared` from StorageBox; run `init-infra-dev.sh` |
| 7 | Upload Updated Secrets to Storagebox | |
| 8 | Provision Vault AppRole IDs and Docker Secrets | |
| 9 | Upload Updated Env to Storagebox | |
| 10 | Prepare Init Files | Cert copy lines removed |
| 11 | Initialize Docker Swarm | |
| 12 | Docker Login to Harbor | |
| 13 | **Update DNS Records** * | GoDaddy API; `api/apigw/rabbitmq/grafana` A records; idempotent |
| 14 | **Prepare SWAG Directories** * | `$SWAG_CONFIG_DIR/dns-conf`; renders nginx conf templates; reloads running SWAG |
| 15 | Bootstrap Vault TLS Placeholder | |
| 16 | Deploy Swarm Stack | base + prod overlay together |
| 17 | **Wait for etcd** * | Waits until Patroni etcd (`etcd-01:2379`) is healthy |
| 18 | **Run APISIX Init** * | `SPRING_PROFILES_ACTIVE=prod`; idempotent; writes to etcd |
| 19 | **Bootstrap SWAG Certificate** * | Waits for SWAG to obtain the cert; copies it to `SWAG_CERT_DIR` |
| 20 | **Run Database Init Scripts** * | `postgresql`/`mongodb` Swarm VIP; SQL+JS init; idempotent |
| 21 | Review Environment | |
| 5 | Fetch Prod Env From Storagebox | Fetch `.env` and `.env.secrets.shared` |
| 6 | Fetch Service Secret Files | Fetch `.env.secrets.<svc>` and `.env.secrets.swag` |
| 7 | Prepare Database Init Files | Render PostgreSQL/MongoDB init templates |
| 8 | Docker Login to Harbor | |
| 9 | Prepare SWAG Directories | Render `dns-conf` and `site-confs`; reload node-local SWAG if present |
| 10 | Bootstrap Vault TLS Placeholder | Creates temporary cert only if missing |
| 11 | Create Infrastructure Docker Secrets | Creates `rabbitmq_erlang_cookie` if missing |
| 12 | Deploy Swarm Stacks | `docker-stack-infra_db-prod.yml` |
| 13 | Connect Runner to Overlay Network | Connects job container to `iklimco-net` |
| 14 | Initialize Production Infrastructure | Runs `init-infra-prod.sh`; this triggers Vault bootstrap and RabbitMQ setup |
| 15 | Wait for Infrastructure Services | Waits for `iklimco_vault` and `iklimco_rabbitmq` |
| 16 | Provision Vault AppRole IDs and Docker Secrets | Downloads service `vault-files`, runs `init/provision-all-services.sh` |
| 17 | Upload Updated Secrets to Storagebox | Uploads `.env.secrets.*` and `.env` |
| 18 | Wait for etcd | Waits for etcd health |
| 19 | Run APISIX Init | `SPRING_PROFILES_ACTIVE=prod` |
| 20 | Bootstrap SWAG Certificate | Waits for SWAG and cert-reloader output in `SWAG_CERT_DIR` |
| 21 | Initialize MongoDB Replica Set | `rs.initiate()` or missing-member `rs.add()` |
| 22 | Run Database Init Scripts | Patroni primary + MongoDB replica set; SQL+JS init |
| 23 | Update DNS Records | GoDaddy API; `api/apigw/rabbitmq/grafana` A records |
| 24 | Review Environment | |
### Removal of Cert Scp Lines
### Stack Placement Boundary
Lines removed from the `Initialize Workspace` step:
`docker-stack-infra_db-prod.yml` is intentionally a mixed infrastructure stack. The DB/cluster services in that file are placed on DB nodes and expose host-mode cluster ports:
- Patroni/PostgreSQL, MongoDB, and etcd run on `iklim-db-*` workers.
The service-node infrastructure in the same file remains overlay-only unless a reverse proxy or explicit published port is defined by the stack:
- Redis, Redis Sentinel, and RabbitMQ run on `node.labels.type == service` app/service nodes.
- Redis and RabbitMQ must not be treated as DB-node host-mode services.
### Historical Note: Removed Cert Scp Lines
Older workflow versions copied certificate files manually in an `Initialize Workspace` step. That step no longer exists in the current root prod workflow. The removed lines are kept here only as a historical reference:
```yaml
# REMOVED — manual cert copy with scp is no longer required:
@ -222,7 +236,7 @@ scp -P 23 ${{ vars.STORAGEBOX_USER }}@${{ vars.STORAGEBOX_USER }}.your-storagebo
scp -P 23 ${{ vars.STORAGEBOX_USER }}@${{ vars.STORAGEBOX_USER }}.your-storagebox.de:prod/app/iklim.co/ssl/STAR.iklim.co_key.pem ./STAR.iklim.co_key.pem
```
Line also removed from the `Prepare Init Files` step:
This line was also removed from the old `Prepare Init Files` step:
```yaml
# REMOVED:
@ -231,97 +245,55 @@ sudo cp STAR.iklim.co.full.crt STAR.iklim.co_key.pem /opt/iklimco/ssl/
The certificate is now obtained by SWAG from Let's Encrypt and written to the `SWAG_CERT_DIR` (`/mnt/storagebox/ssl/`) directory in the `Bootstrap SWAG Certificate` step. Later renewals are handled automatically by cert-reloader.
### Bootstrap SWAG Certificate (Step 19)
### Bootstrap SWAG Certificate (Step 20)
On the first deploy, SWAG obtains the Let's Encrypt certificate with the GoDaddy DNS-01 challenge. This step waits for SWAG to obtain the certificate, for up to 10 minutes, and then copies it to the `SWAG_CERT_DIR` directory:
On the first deploy, SWAG obtains the Let's Encrypt certificate with the GoDaddy DNS-01 challenge. The current step waits for the Swarm `iklimco_swag` service to be running, then waits for `cert-reloader` to write `STAR.iklim.co.full.crt` to `SWAG_CERT_DIR`.
```yaml
- name: Bootstrap SWAG Certificate
run: |
set -a; . ./.env; set +a
echo "Waiting for SWAG container to start..."
SWAG_CTR=""
for i in $(seq 1 24); do
SWAG_CTR=$(docker ps -q -f name=iklimco_swag 2>/dev/null | head -1)
[ -n "$SWAG_CTR" ] && break
sleep 10
done
if [ -z "$SWAG_CTR" ]; then
echo "❌ SWAG container did not start"
exit 1
fi
CERT_PATH="/config/etc/letsencrypt/live/iklim.co/fullchain.pem"
echo "Waiting for cert (up to 10 min)..."
for i in $(seq 1 20); do
if docker exec "$SWAG_CTR" test -f "$CERT_PATH" 2>/dev/null; then
echo "✅ Cert obtained"
break
fi
echo " attempt $i/20 — waiting 30s..."
sleep 30
done
if ! docker exec "$SWAG_CTR" test -f "$CERT_PATH" 2>/dev/null; then
echo "❌ SWAG did not obtain cert. Logs:"
docker service logs iklimco_swag --tail 50
exit 1
fi
docker exec "$SWAG_CTR" cat "$CERT_PATH" | \
docker run --rm -i -v "${SWAG_CERT_DIR}:/output" alpine \
sh -c "cat > /output/STAR.iklim.co.full.crt && chmod 644 /output/STAR.iklim.co.full.crt"
docker exec "$SWAG_CTR" cat "/config/etc/letsencrypt/live/iklim.co/privkey.pem" | \
docker run --rm -i -v "${SWAG_CERT_DIR}:/output" alpine \
sh -c "cat > /output/STAR.iklim.co_key.pem && chmod 644 /output/STAR.iklim.co_key.pem"
echo "✅ Cert bootstrapped to ${SWAG_CERT_DIR}/"
echo "Waiting for SWAG service..."
docker service ps iklimco_swag --filter 'desired-state=running'
echo "Waiting for cert-reloader output in ${SWAG_CERT_DIR}..."
docker run --rm -v "${SWAG_CERT_DIR}:/ssl:ro" alpine \
test -f /ssl/STAR.iklim.co.full.crt
working-directory: /workspace/iklim.co
```
After this step, certificate files exist inside `SWAG_CERT_DIR` (`/mnt/storagebox/ssl/`); Vault TLS reads these files. Later renewals are handled automatically by cert-reloader. When the pipeline runs again, this step only waits for the SWAG container to be ready; certificate issuance is managed by SWAG/cert-reloader within Let's Encrypt's 90-day cycle.
After this step, certificate files exist inside `SWAG_CERT_DIR` (`/mnt/storagebox/ssl/`). `cert-distributor` syncs these files to node-local `/opt/iklimco/ssl`, where Vault reads them. Later renewals are handled automatically by SWAG, cert-reloader, and cert-distributor.
### Run Database Init Scripts (Step 20)
### Run Database Init Scripts (Step 22)
PostgreSQL and MongoDB init scripts run through Swarm overlay DNS service names (`postgresql`, `mongodb`):
PostgreSQL and MongoDB init scripts run after Patroni primary and MongoDB replica set readiness:
```yaml
- name: Run Database Init Scripts
run: |
set -a; . ./.env; . ./.env.secrets.shared; set +a
echo "⏳ Waiting for PostgreSQL..."
until docker run --rm --network iklimco-net \
-e PGPASSWORD="${DATABASE_POSTGRES_ROOT_PASSWD}" \
postgis/postgis:18-3.6 \
pg_isready -h postgresql -U "${DATABASE_POSTGRES_ROOT_USER}" -q 2>/dev/null; do
sleep 5
done
PG_URI="postgresql://${DATABASE_POSTGRES_ROOT_USER}@${DATABASE_POSTGRES_HOST}/postgres?connect_timeout=5&target_session_attrs=read-write"
MONGO_URI="mongodb://${DATABASE_MONGODB_ROOT_USER}:${DATABASE_MONGODB_ROOT_PASSWD}@${DATABASE_MONGODB_HOST}/admin?${DATABASE_MONGODB_PARAMS}"
for sql_file in $(ls ./init/postgresql/*.sql 2>/dev/null | sort); do
echo "▶ $(basename "$sql_file")"
docker run --rm -i --network iklimco-net \
-e PGPASSWORD="${DATABASE_POSTGRES_ROOT_PASSWD}" \
postgis/postgis:18-3.6 \
psql -h postgresql -U "${DATABASE_POSTGRES_ROOT_USER}" < "$sql_file"
psql "$PG_URI" < "$sql_file"
done
echo "⏳ Waiting for MongoDB..."
until docker run --rm --network iklimco-net mongo:8.3.2 \
mongosh "mongodb://${DATABASE_MONGODB_ROOT_USER}:${DATABASE_MONGODB_ROOT_PASSWD}@mongodb/admin" \
--eval "db.runCommand({ping:1})" --quiet 2>/dev/null; do
sleep 5
done
for js_file in $(ls ./init/mongodb/*.js 2>/dev/null | sort); do
echo "▶ $(basename "$js_file")"
docker run --rm -i --network iklimco-net mongo:8.3.2 \
mongosh "mongodb://${DATABASE_MONGODB_ROOT_USER}:${DATABASE_MONGODB_ROOT_PASSWD}@mongodb/admin" \
--quiet < "$js_file"
docker run --rm -i --network iklimco-net "${IMAGE_MONGODB}" \
sh -c 'cat > /tmp/init.js && mongosh "$MONGO_INIT_URI" --quiet --file /tmp/init.js' \
< "$js_file"
done
echo "✅ Database init scripts completed"
working-directory: /workspace/iklim.co
```
- `postgresql` and `mongodb`: Swarm VIP service names, resolved on the `iklimco-net` overlay; Patroni primary automatic routing happens at VIP level
- `DATABASE_POSTGRES_HOST`: multi-host Patroni target; the workflow uses `target_session_attrs=read-write` to reach the primary
- `DATABASE_MONGODB_HOST`: MongoDB replica set host list
- SQL files `./init/postgresql/*.sql` and JS files `./init/mongodb/*.js` are created in the `Prepare Init Files` step by the `init_postgresql`/`init_mongodb` functions in `common-functions-prod.sh`
- Idempotent: `CREATE IF NOT EXISTS` / `createCollection` semantics; runs safely again on later deploys
@ -331,27 +303,19 @@ In prod, all 3 app nodes are manager + app worker, so services can be distribute
### Microservices
Each microservice has two stack files:
Prod microservice workflows do not rebuild application images. They read `deploy/prod.env`, promote the tested Harbor digest to a stable prod tag, and call `swarm_service_update` with `deploy/docker-stack-service.yml`.
| File | Content | Environment |
| --- | --- | --- |
| `BE-<Service>/docker-stack-service.yml` | Base definitions, `replicas: 1` | Test + Prod |
| `BE-<Service>/docker-stack-service.prod.yml` | `replicas: 3`, `max_replicas_per_node: 1` | Prod only |
Prod deploy command:
For first deploy, `swarm_service_update` exports `SERVICE_IMAGE` and runs:
```bash
docker stack deploy \
-c BE-<Service>/docker-stack-service.yml \
-c BE-<Service>/docker-stack-service.prod.yml \
iklimco
docker stack deploy --with-registry-auth -c deploy/docker-stack-service.yml iklimco
```
`max_replicas_per_node: 1` is mandatory; without it, when the Swarm node count is lower than the replica count, Swarm places more than one replica on the same node.
For existing services it performs `docker service update` with `--update-order start-first` and `--update-failure-action rollback`.
### Infra Services
`docker-stack-infra.yml` (base) and `docker-stack-infra.prod.yml` (overlay) are deployed together. The overlay overrides services such as Vault, APISIX, RabbitMQ, and Redis Sentinel with `replicas: 3` and `max_replicas_per_node: 1`. Detail: `Environment_Infrastructure/roadmap/prod-env/03-infra-stack-changes.md`.
The current prod infra stack is `docker-stack-infra_db-prod.yml`. Vault is not inside this stack; it is deployed separately by `vault-bootstrap.sh` using `docker-stack-vault.yml`.
#### cert-reloader and Vault Auto-Unseal
@ -360,53 +324,28 @@ The `cert-reloader` sidecar service runs as `replicas: 1` inside the infra stack
Certificate renewal flow:
```
SWAG renews the certificate -> writes it to SWAG_CONFIG_DIR (/mnt/storagebox/swag/config)
SWAG renews the certificate -> stores it inside the SWAG named volume
cert-reloader detects the MD5 change
-> copies it to /mnt/storagebox/ssl/ directory (common mount on all app nodes)
-> copies it to /mnt/storagebox/ssl/ directory (StorageBox)
cert-distributor syncs it to /opt/iklimco/ssl on service nodes
-> runs docker service update --force iklimco_vault
Vault (3 replicas) restarts
-> each instance reads the new certificate from the /mnt/storagebox/ssl/ mount
-> healthcheck checks sealed status every 30 seconds
-> if sealed: reads from the vault_unseal_key Docker secret and automatically unseals
-> each instance reads the new certificate from /opt/iklimco/ssl
-> entrypoint retry-unseal loop reads from the vault_unseal_key Docker secret and unseals
```
The auto-unseal mechanism is provided by the Vault healthcheck inside `docker-stack-infra.yml`:
```yaml
healthcheck:
test:
- "CMD"
- "sh"
- "-c"
- >-
vault status -format=json 2>/dev/null | grep -q '"sealed":false' ||
vault operator unseal $$(cat /run/secrets/vault_unseal_key 2>/dev/null)
interval: 30s
timeout: 10s
start_period: 15s
retries: 5
```
The 3 replicas run their own healthchecks independently; all of them unseal separately. The certificate renewal -> restart -> auto-unseal chain requires no manual intervention. Detail: `roadmap/prod-env/06-cert-reloader.md`.
The 3 Vault replicas run their own retry-unseal loop independently. The certificate renewal -> distribution -> restart -> unseal chain requires no manual intervention after bootstrap.
#### Vault Raft Configuration
Vault is defined as 3 replicas with Raft storage in the `docker-stack-infra.prod.yml` overlay:
Vault is defined as 3 replicas with Raft storage in `docker-stack-vault.yml`:
```yaml
vault:
environment:
VAULT_LOCAL_CONFIG: >-
{"api_addr":"https://vault.iklim.co:8200",
"cluster_addr":"https://{{ .Node.Hostname }}:8201",
"storage":{"raft":{"path":"/vault/file","node_id":"{{ .Node.Hostname }}"}},
"listener":[{"tcp":{"address":"0.0.0.0:8200",
"tls_cert_file":"/vault/certs/STAR.iklim.co.full.crt",
"tls_key_file":"/vault/certs/STAR.iklim.co_key.pem"}}],
"default_lease_ttl":"168h","max_lease_ttl":"720h","ui":true}
volumes:
- /opt/iklimco/vault/data:/vault/file # separate host path on each node — created with Ansible
- ${SWAG_CERT_DIR}:/vault/certs:ro # StorageBox shared — all nodes see the same path
- vault-data-vl:/vault/file
- vault-logs-vl:/vault/logs
- /opt/iklimco/ssl:/vault/certs:ro
deploy:
mode: replicated
replicas: 3
@ -416,59 +355,37 @@ vault:
- node.labels.type == service
```
`{{ .Node.Hostname }}` is a Docker Swarm Go template; it gives each Vault instance a unique `node_id` and `cluster_addr`. Because `/opt/iklimco/vault/data` is a host path volume, it is not an overlay volume; it must be created separately on each app node during Ansible bootstrap. See `07-prod-ansible-bootstrap.md` — Node Directory Role. Detail: `roadmap/prod-env/07-vault-raft-plan.md`.
The Vault stack uses `vault-template-v2.json`, `vault_unseal_key`, and the `iklimco-net` external network. Bootstrap and unseal are handled by `init/vault/vault-bootstrap.sh`.
## Vault Raft Cluster Initial Setup
After the infra stack is deployed for the first time, the Vault Raft cluster is initialized manually once. These steps are not repeated on every deploy; they are applied only during initial setup.
Vault Raft cluster setup is no longer a manual post-deploy procedure. It is handled by `init/vault/vault-bootstrap.sh`, called through `init-infra-prod.sh` by the root prod workflow.
### Step 1 — Stack Deploy
```bash
docker stack deploy -c docker-stack-infra.yml -c docker-stack-infra.prod.yml iklimco
```
The bootstrap script deploys:
3 Vault containers start. The first initialized node becomes the leader.
```bash
docker stack deploy --with-registry-auth -c docker-stack-vault.yml iklimco
```
### Step 2 — Vault Initialize (iklim-app-01)
```bash
VAULT_CTR=$(docker ps -q -f name=iklimco_vault)
docker exec -it "$VAULT_CTR" vault operator init
```
Store the unseal keys and root token from the output securely. Save the unseal key as a Docker secret:
The script runs `vault operator init -key-shares=1 -key-threshold=1` if Vault is not initialized. It stores bootstrap output under `/tmp/vault-bootstrap/main-vault-init.txt` during the run.
```bash
echo -n "<unseal-key>" | docker secret create vault_unseal_key -
echo "bootstrap" | docker secret create vault_unseal_key -
```
> After this step, the `vault_unseal_key` secret exists. During later certificate renewals, cert-reloader restarts Vault; the healthcheck reads this secret and automatically unseals, so no manual intervention is required.
Then it rotates `vault_unseal_key` to the real unseal key and unseals the leader and peers.
### Step 3 — Unseal the Leader
```bash
docker exec -it "$VAULT_CTR" vault operator unseal
```
No manual unseal command is required in the normal path.
### Step 4 — Join the Other Nodes to the Raft Cluster
The Vault containers on `iklim-app-02` and `iklim-app-03` join the cluster:
```bash
docker exec -it <vault-on-iklim-app-02> vault operator raft join \
https://vault.iklim.co:8200
docker exec -it <vault-on-iklim-app-03> vault operator raft join \
https://vault.iklim.co:8200
```
Each node is also unsealed after it joins:
```bash
docker exec -it <vault-on-iklim-app-02> vault operator unseal
docker exec -it <vault-on-iklim-app-03> vault operator unseal
```
Peer join and peer unseal are handled by `vault-bootstrap.sh`.
### Step 5 — Verify the Cluster
@ -646,20 +563,20 @@ Expected: valid JSON weather response.
- `rabbitmq_erlang_cookie` appears in `docker secret ls`.
- The `ssl`, `swag/config`, `swag/site-confs`, `grafana/data`, and `precipitation/images` directories exist on StorageBox; see `07-prod-ansible-bootstrap.md` — StorageBox Directory Structure.
- The `template/swag/site-confs/default.conf`, `api.conf.tpl`, `apigw.conf.tpl`, `rabbitmq.conf.tpl`, and `grafana.conf.tpl` template files exist in the repo.
- StorageBox `prod/secrets/iklim.co/.env.prod` has correct values for `API_SUBDOMAIN`, `APIGW_SUBDOMAIN`, `RABBITMQ_SUBDOMAIN`, `GRAFANA_SUBDOMAIN`, `RESTRICTED_IPS`, `SWAG_CERT_DIR`, `SWAG_CONFIG_DIR`, and `SWAG_SITE_CONFS_DIR`.
- StorageBox `prod/secrets/iklim.co/.env` has correct values for `API_SUBDOMAIN`, `APIGW_SUBDOMAIN`, `RABBITMQ_SUBDOMAIN`, `GRAFANA_SUBDOMAIN`, `RESTRICTED_IPS`, `SWAG_CERT_DIR`, `SWAG_DNS_CONFIG_DIR`, `SWAG_SITE_CONFS_DIR`, and `SWAG_PROXY_CONFS_DIR`.
- After the first deploy, `docker exec $(docker ps -q -f name=iklimco_swag) nginx -t` succeeds and returns `syntax is ok`.
- The output of `cat /mnt/storagebox/swag/site-confs/api.conf | grep server_name` contains `server_name api.iklim.co;`.
- The `ssls/1` PUT block does not exist inside `init/apisix-core/init.sh`.
- The `registry.tarla.io/iklimco/custom-apisix:3.12.0` image exists in Harbor and its `config.yaml` contains `real_ip_header`, `real_ip_recursive`, and `set_real_ip_from` (covering `10.0.0.0/8`, `172.16.0.0/12`, `192.168.0.0/16`) configuration.
- After the first deploy, real client IP appears in APISIX access logs, not the SWAG overlay IP: `docker exec $(docker ps -q -f name=iklimco_apisix | head -1) tail -5 /usr/local/apisix/logs/access.log`
- `docker service ps iklimco_cert-reloader` shows that the service is running.
- `docker service ls` does not contain `iklimco_etcd`, `iklimco_postgresql`, `iklimco_mongodb`, `iklimco_pg-proxy`, or `iklimco_mongo-proxy`; they are removed by the post-deploy step in `deploy-prod.yml` (base stack services superseded by the `iklim-db` stack or deprecated in prod).
- `docker service ls` contains the current prod infra services from `docker-stack-infra_db-prod.yml` and the separate `iklimco_vault` service from `docker-stack-vault.yml`; deprecated base-stack services such as `iklimco_postgresql`, `iklimco_mongodb`, `iklimco_pg-proxy`, and `iklimco_mongo-proxy` are not present.
- The output of `docker service logs iklimco_cert-reloader --tail 20` contains `[cert-reloader] started` and has no error lines.
- The `notAfter` date of the Vault TLS endpoint certificate matches `/mnt/storagebox/ssl/STAR.iklim.co.full.crt`: `docker exec $(docker ps -q -f name=iklimco_vault | head -1) sh -c 'echo | openssl s_client -connect vault.iklim.co:8200 2>/dev/null | openssl x509 -noout -dates'`
- `vault operator raft list-peers` returns 3 peers: 1 leader, 2 followers.
- The `vault_unseal_key` Docker secret exists and appears in `docker secret ls`.
- 3 Vault containers are not sealed: `docker exec $(docker ps -q -f name=iklimco_vault | head -1) vault status | grep Sealed` -> `Sealed false`.
- The first deploy pipeline successfully completes all 21 steps; the `Review Environment` step succeeds.
- The first deploy pipeline successfully completes all current root workflow steps; the `Review Environment` step succeeds.
- After the `Bootstrap SWAG Certificate` step, `ls /mnt/storagebox/ssl/` -> `STAR.iklim.co.full.crt` and `STAR.iklim.co_key.pem` exist.
- The `Run Database Init Scripts` step completes without error; PostgreSQL and MongoDB are healthy and init scripts are applied.
- In the output of `docker service ls --filter label=project=co.iklim`, all infra services show `X/X`.