docs(infra): restructure and update infrastructure setup documentation
- Anglicized setup and facts markdown file names for better consistency. - Updated 01-swarm-init-multinode.md to highlight Ansible automation of Swarm initialization and labeling. - Overhauled 03-infra-stack-changes.md to describe the single monolithic file strategy and reflect current Redis, RabbitMQ, and etcd cluster configurations. - Fixed minor overrides and typos in Patroni templates and Ansible bootstrap documents. - Restructured README and roadmap mapping to align with the renamed setup documents.
This commit is contained in:
parent
1fd752526b
commit
67dc2986dd
139
README.md
139
README.md
@ -1,64 +1,111 @@
|
|||||||
# 🌍 iklim.co Altyapı ve Sunucu Yönetimi
|
# iklim.co Altyapı ve Sunucu Yönetimi
|
||||||
|
|
||||||
Bu depo, `iklim.co` projesinin **test** ve **production** ortamlarını kurmak, yönetmek ve modernize etmek için gerekli olan Infrastructure-as-Code (IaC) varlıklarını, teknik rehberleri ve operasyonel standartları barındırır.
|
Bu depo, `iklim.co` test ve production ortamlarını provision etmek, yapılandırmak, işletmek ve modernize etmek için kullanılan Infrastructure-as-Code varlıklarını, kurulum runbook'larını, operasyonel facts dokümanlarını ve planlama notlarını içerir.
|
||||||
|
|
||||||
Altyapı yönetimi; Hetzner Cloud üzerinde Terraform ile kaynak provisioning, Ansible ile işletim sistemi yapılandırması ve Docker Swarm üzerinde mikroservis mimarisinin kurgulanması süreçlerini kapsar.
|
Altyapı yönetimi Hetzner Cloud üzerinde Terraform ile kaynak provisioning, Ansible ile işletim sistemi ve Swarm bootstrap, Docker Swarm üzerinde altyapı ve uygulama servislerinin deploy edilmesi süreçlerini kapsar.
|
||||||
|
|
||||||
---
|
## Depo Yapısı
|
||||||
|
|
||||||
## 📂 Depo Yapısı ve Temel Bölümler
|
### Terraform (`terraform/`)
|
||||||
|
|
||||||
Bu depodaki dökümantasyon ve kod varlıkları beş ana kategoriye ayrılmıştır:
|
Terraform, uzak test ve production ortamları için Hetzner Cloud kaynaklarını tanımlar:
|
||||||
|
|
||||||
### 1. 🛣️ Roadmap (`roadmap/`)
|
- `terraform/hetzner/test`: test sunucuları, network, firewall, Floating IP, placement ve outputs.
|
||||||
Ortamların (test ve prod) sıfırdan kurulması veya mevcut yapının güncellenmesi için gerekli olan **iş gereksinimlerini, teknik hedefleri ve adım adım uygulama planlarını** içerir.
|
- `terraform/hetzner/prod`: production app/service node'ları, DB node'ları, private networking, firewall'lar, placement group'lar, Floating IP ve outputs.
|
||||||
- Altyapıda yapılacak büyük değişikliklerin (örn: Redis Sentinel geçişi, APISIX konfigürasyonu, RabbitMQ Quorum Queues) stratejik dökümantasyonudur.
|
|
||||||
- [roadmap/test-env/](./roadmap/test-env/) - Test ortamı gereksinimleri ve planları.
|
|
||||||
- [roadmap/prod-env/](./roadmap/prod-env/) - Üretim ortamı HA (High Availability) ve güvenilirlik planları.
|
|
||||||
|
|
||||||
### 2. 🛠️ Setup (`setup/`)
|
Dev ortamı lokal ve Docker Compose tabanlıdır; bu Terraform stack'leri tarafından provision edilmez.
|
||||||
Altyapının fiziksel olarak ayağa kaldırılması için kullanılan **uygulama dökümanlarıdır**. Bu bölüm şunları yönetmek için kullanılır:
|
|
||||||
- **Terraform:** Bulut kaynaklarının (Server, Network, Firewall) üretilmesi.
|
|
||||||
- **Ansible:** İşletim sistemi hazırlığı, güvenlik sertleştirme (hardening), Docker/Swarm kurulumu.
|
|
||||||
- **CI/CD:** Deployment workflow'larının (Gitea Actions) ve stack manifest'lerinin oluşturulması/güncellenmesi.
|
|
||||||
- Örn: [setup/06-prod-terraform-iaac.md](./setup/06-prod-terraform-iaac.md), [setup/07-prod-ansible-bootstrap.md](./setup/07-prod-ansible-bootstrap.md)
|
|
||||||
|
|
||||||
### 3. 🗺️ Setup vs Roadmap Matrisi (`setup-vs-roadmap-map.md`)
|
### Ansible (`ansible/`)
|
||||||
İşterler doğrultusunda hazırlanan **Roadmap** dökümanları ile bu isterleri teknik olarak hayata geçiren **Setup** dökümanları arasındaki ilişkiyi açıklar.
|
|
||||||
- Hangi roadmap adımının hangi setup dökümanı ile uygulandığını gösteren bir eşleşme matrisidir.
|
|
||||||
- [setup-vs-roadmap-map.md](./setup-vs-roadmap-map.md) dökümanından detaylara ulaşılabilir.
|
|
||||||
|
|
||||||
### 4. 📊 Hetzner Sizing Report (`hetzner-sizing-report.md`)
|
Ansible, Terraform provisioning sonrası uzak host'ları hazırlar:
|
||||||
İklim altyapı servisleri (API Gateway, Microservices, Databases, Broker) için seçilen **Hetzner sunucu tiplerini, CPU/RAM kapasitelerini ve maliyet/performans analizlerini** anlatır.
|
|
||||||
- Ortam kurulumundan önce kapasite planlaması için temel referans noktasıdır.
|
|
||||||
- [hetzner-sizing-report.md](./hetzner-sizing-report.md) dökümanını inceleyin.
|
|
||||||
|
|
||||||
### 5. 💡 Facts (`facts/`)
|
- `ansible/test`: test bootstrap playbook'ları, inventory ve ortama özel değişkenler.
|
||||||
Ortam kurulumları tamamlandıktan sonra ortaya çıkan, **sistemin o anki gerçek durumunu (source of truth) ve bilinmesi gereken kritik teknik detayları** barındıran dökümanlardır.
|
- `ansible/prod`: production bootstrap playbook'ları, inventory, değişkenler ve prod'a özel rol override'ları.
|
||||||
- "Sistem şu an nasıl çalışıyor?" sorusunun cevabıdır.
|
- `ansible/roles`: `base`, `hardening`, `docker`, `swarm`, `node_dirs`, `storagebox`, `storagebox_ssh_key`, `act_runner` ve ortak `db_stack` gibi paylaşılan roller.
|
||||||
- [facts/firewall.md](./facts/firewall.md): Aktif firewall kuralları ve port matrisi.
|
|
||||||
- [facts/swarm-node-recovery-swag-failover.md](./facts/swarm-node-recovery-swag-failover.md): Node düşmesi durumunda manuel müdahale ve recovery prosedürleri.
|
|
||||||
|
|
||||||
---
|
Production, `ansible/prod/ansible.cfg` içinde `roles_path = roles:../roles` kullanır. Bu nedenle `ansible/prod/roles/db_stack` gibi prod-local roller mevcut olduğunda paylaşılan rollerden önce çalışır.
|
||||||
|
|
||||||
## 🧱 Kurulum Akışı (Kanonik Sıra)
|
### Setup Runbook'ları (`setup/`)
|
||||||
|
|
||||||
Bir ortamı sıfırdan kurarken veya majör bir güncelleme yaparken şu sırayı takip edin:
|
Setup dokümanları, ortamları ayağa kaldırmak veya büyük altyapı değişikliklerini uygulamak için kullanılan kanonik uygulama runbook'larıdır. Güncel dosyalar:
|
||||||
|
|
||||||
1. **Analiz:** [hetzner-sizing-report.md](./hetzner-sizing-report.md) ile kaynak ihtiyacını belirleyin.
|
- [setup/00-general-roadmap.md](./setup/00-general-roadmap.md)
|
||||||
2. **Planlama:** `roadmap/` altındaki ilgili ortam dökümanlarını inceleyerek yapılacak değişiklikleri anlayın.
|
- [setup/01-private-network-port-matrix.md](./setup/01-private-network-port-matrix.md)
|
||||||
3. **Hizalama:** [setup-vs-roadmap-map.md](./setup-vs-roadmap-map.md) ile hangi setup dökümanlarını kullanacağınızı netleştirin.
|
- [setup/02-test-terraform-iac.md](./setup/02-test-terraform-iac.md)
|
||||||
4. **Uygulama:** `setup/` dökümanlarını (00'dan 09'a kadar) sırasıyla takip ederek Terraform ve Ansible süreçlerini işletin.
|
- [setup/03-test-ansible-bootstrap.md](./setup/03-test-ansible-bootstrap.md)
|
||||||
5. **Doğrulama:** Kurulum sonrası sistemin çalışma prensipleri için `facts/` dökümanlarını referans alın.
|
- [setup/04-test-db-docker-setup.md](./setup/04-test-db-docker-setup.md)
|
||||||
|
- [setup/05-test-runner-and-deploy-prerequisites.md](./setup/05-test-runner-and-deploy-prerequisites.md)
|
||||||
|
- [setup/06-prod-terraform-iac.md](./setup/06-prod-terraform-iac.md)
|
||||||
|
- [setup/07-prod-ansible-bootstrap.md](./setup/07-prod-ansible-bootstrap.md)
|
||||||
|
- [setup/08-prod-db-cluster-setup.md](./setup/08-prod-db-cluster-setup.md)
|
||||||
|
- [setup/09-prod-runner-ha-and-swarm.md](./setup/09-prod-runner-ha-and-swarm.md)
|
||||||
|
|
||||||
---
|
Bu dokümanlar Terraform, Ansible, Swarm label'ları, StorageBox path'leri, runner ön koşulları, DB servisleri ve production Swarm deploy modelinin birlikte nasıl çalıştığını açıklar.
|
||||||
|
|
||||||
## ✅ Ön Koşullar ve Araçlar
|
### Roadmap (`roadmap/`)
|
||||||
|
|
||||||
- **Terraform >= 1.6**: Altyapı provisioning.
|
Roadmap dokümanları test ve production değişiklikleri için gereksinimleri, tasarım hedeflerini ve migration planlarını açıklar:
|
||||||
- **Ansible**: Konfigürasyon yönetimi.
|
|
||||||
- **Hetzner Cloud API Token**: Ortam bazlı yetkilendirme.
|
|
||||||
- **SSH Key**: Sunucu erişimi için sisteme tanımlı anahtar çifti.
|
|
||||||
|
|
||||||
---
|
- [roadmap/test-env/](./roadmap/test-env/)
|
||||||
*iklim.co Infrastructure Team - 2026*
|
- [roadmap/prod-env/](./roadmap/prod-env/)
|
||||||
|
|
||||||
|
Roadmap dokümanlarını amaç ve tasarım bağlamı için kullanın. Güncel uygulama akışı için setup runbook'larını kullanın.
|
||||||
|
|
||||||
|
### Setup vs Roadmap Map
|
||||||
|
|
||||||
|
[setup-vs-roadmap-map.md](./setup-vs-roadmap-map.md), roadmap maddelerini bu maddeleri hayata geçiren setup dokümanları ve implementation alanları ile eşler.
|
||||||
|
|
||||||
|
### Facts (`facts/`)
|
||||||
|
|
||||||
|
Facts dokümanları güncel durum detaylarını ve operasyonel geçmişi korur:
|
||||||
|
|
||||||
|
- [facts/firewall.md](./facts/firewall.md): aktif firewall ve port bilgileri.
|
||||||
|
- [facts/node-recovery-failover.md](./facts/node-recovery-failover.md): node recovery ve failover prosedürleri.
|
||||||
|
- [facts/prod-kurulum-gecmisi.md](./facts/prod-kurulum-gecmisi.md): production kurulum geçmişi ve güncel production notları.
|
||||||
|
|
||||||
|
Facts dokümanlarını “sistem şu an nasıl çalışıyor?” sorusu, tarihsel bağlam ve setup sonrası doğrulama için kullanın.
|
||||||
|
|
||||||
|
### Hetzner Sizing Report
|
||||||
|
|
||||||
|
[hetzner-sizing-report.md](./hetzner-sizing-report.md), altyapı servisleri, veritabanları, broker'lar ve uygulama workload'ları için sunucu sizing, CPU/RAM seçimleri ve maliyet/performans değerlendirmelerini açıklar.
|
||||||
|
|
||||||
|
### Confluence Export (`confluence-wiki/`)
|
||||||
|
|
||||||
|
`confluence-wiki/`, altyapı notlarının repository dışına yayınlanması veya mirror edilmesi gerektiğinde kullanılan wiki odaklı/export edilmiş dokümantasyon materyallerini içerir.
|
||||||
|
|
||||||
|
## Güncel Production Modeli
|
||||||
|
|
||||||
|
Production şu anda ayrık altyapı modeli kullanır:
|
||||||
|
|
||||||
|
- Ana infra ve DB stack: root `docker-stack-infra_db-prod.yml`.
|
||||||
|
- Vault stack: root `docker-stack-vault.yml`.
|
||||||
|
- Vault bootstrap: root `init/vault/vault-bootstrap.sh`; production deploy akışında `init-infra-prod.sh` üzerinden çağrılır.
|
||||||
|
- Production pipeline source of truth: root `.gitea/workflows/deploy-prod.yml` ve root `prod_env-ci_dc-pipeline.md`.
|
||||||
|
|
||||||
|
`docker-stack-infra_db-prod.yml` bilinçli olarak karma bir stack'tir:
|
||||||
|
|
||||||
|
- Patroni/PostgreSQL, MongoDB ve etcd gibi DB/cluster servisleri `iklim-db-*` node'larında çalışır ve gerektiği yerde host-mode cluster portları kullanır.
|
||||||
|
- Redis, Redis Sentinel ve RabbitMQ gibi service-node altyapı servisleri `node.labels.type == service` app/service node'larında çalışır ve stack veya reverse proxy tarafından açıkça expose edilmedikçe Docker overlay network üzerinde kalır.
|
||||||
|
|
||||||
|
## Kanonik Kurulum Akışı
|
||||||
|
|
||||||
|
Yeni bir ortam veya büyük bir altyapı güncellemesi için:
|
||||||
|
|
||||||
|
1. [hetzner-sizing-report.md](./hetzner-sizing-report.md) dosyasını inceleyin.
|
||||||
|
2. Tasarım amacını anlamak için ilgili `roadmap/` dokümanlarını inceleyin.
|
||||||
|
3. Her roadmap maddesinin hangi setup runbook'u ile uygulandığını görmek için [setup-vs-roadmap-map.md](./setup-vs-roadmap-map.md) dosyasını kontrol edin.
|
||||||
|
4. Hedef ortam için numaralı `setup/` runbook'larını sırayla takip edin.
|
||||||
|
5. Güncel davranışı, recovery prosedürlerini, firewall durumunu ve production geçmişini doğrulamak için `facts/` dokümanlarını kullanın.
|
||||||
|
|
||||||
|
## Gerekli Araçlar
|
||||||
|
|
||||||
|
- Terraform `>= 1.6`
|
||||||
|
- Ansible
|
||||||
|
- Hedef ortam için Hetzner Cloud API token
|
||||||
|
- Sunucu erişimi için yetkili SSH key pair
|
||||||
|
|
||||||
|
## Notlar
|
||||||
|
|
||||||
|
- Dev ortamı lokal ve Docker Compose tabanlıdır; uzak Terraform/Ansible otomasyonu test ve production ortamlarını hedefler.
|
||||||
|
- Test daha küçük bir uzak ortamdır ve single-node DB/App varsayımlarına dayanır.
|
||||||
|
- Production üç app/service node ve üç DB node içeren high-availability uzak ortamdır.
|
||||||
|
|||||||
@ -15,7 +15,7 @@ etcd3:
|
|||||||
- etcd-02:2379
|
- etcd-02:2379
|
||||||
- etcd-03:2379
|
- etcd-03:2379
|
||||||
username: root
|
username: root
|
||||||
password: "{{ vault_etcd_root_password }}"
|
password: "${ETCD_ROOT_PASSWORD}"
|
||||||
|
|
||||||
bootstrap:
|
bootstrap:
|
||||||
dcs:
|
dcs:
|
||||||
|
|||||||
@ -1,4 +1,4 @@
|
|||||||
# Docker Swarm — Node Recovery
|
# Test — Docker Swarm Node Recovery
|
||||||
|
|
||||||
Test ortamında tek manager (`iklim-app-01`) ve tek worker (`iklim-db-01`) bulunur. Hangi node'un yeniden kurulduğuna göre recovery süreci farklılaşır.
|
Test ortamında tek manager (`iklim-app-01`) ve tek worker (`iklim-db-01`) bulunur. Hangi node'un yeniden kurulduğuna göre recovery süreci farklılaşır.
|
||||||
|
|
||||||
@ -32,17 +32,19 @@ DB verileri `iklim-db-01`'deki named volume'larda korunur, kayıp yaşanmaz.
|
|||||||
|
|
||||||
Yeni `iklim-db-01` Swarm'dan habersiz başlar (`inactive`). Manager (`iklim-app-01`) eski dead node kaydını tutar.
|
Yeni `iklim-db-01` Swarm'dan habersiz başlar (`inactive`). Manager (`iklim-app-01`) eski dead node kaydını tutar.
|
||||||
|
|
||||||
|
> ⚠️ **Veri kaybı:** `iklim-db-01` yeniden kurulduğu için tüm named volume'lar silinmiştir. 3. adım öncesinde backup'tan restore yapılması zorunludur.
|
||||||
|
|
||||||
### Çözüm
|
### Çözüm
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
# 1. Ansible bootstrap — yeni node otomatik join olur
|
# 1. iklim-app-01 üzerinde — eski dead node kaydını temizle (bootstrap'tan ÖNCE yapılmalı)
|
||||||
cd ansible/test
|
|
||||||
ansible-playbook -i inventory/generated/test.yml test-bootstrap.yml --ask-vault-pass
|
|
||||||
|
|
||||||
# 2. iklim-app-01 üzerinde — eski dead node kaydını temizle
|
|
||||||
docker node ls # eski node ID'yi bul
|
docker node ls # eski node ID'yi bul
|
||||||
docker node rm <eski-node-id>
|
docker node rm <eski-node-id>
|
||||||
|
|
||||||
|
# 2. Ansible bootstrap — yeni node otomatik join olur
|
||||||
|
cd ansible/test
|
||||||
|
ansible-playbook -i inventory/generated/test.yml test-bootstrap.yml --ask-vault-pass
|
||||||
|
|
||||||
# 3. DB stack'i yeniden deploy et (backup'tan restore sonrası)
|
# 3. DB stack'i yeniden deploy et (backup'tan restore sonrası)
|
||||||
ansible-playbook -i inventory/generated/test.yml test-db-post-stack.yml --ask-vault-pass
|
ansible-playbook -i inventory/generated/test.yml test-db-post-stack.yml --ask-vault-pass
|
||||||
```
|
```
|
||||||
@ -68,7 +70,7 @@ ansible-playbook -i inventory/generated/test.yml test-db-post-stack.yml --ask-va
|
|||||||
| Senaryo | Manuel Adım | Ansible Yeterli mi? |
|
| Senaryo | Manuel Adım | Ansible Yeterli mi? |
|
||||||
|---|---|---|
|
|---|---|---|
|
||||||
| Manager (`iklim-app-01`) ölür | `docker swarm leave --force` (worker'da) | Sonrasında evet |
|
| Manager (`iklim-app-01`) ölür | `docker swarm leave --force` (worker'da) | Sonrasında evet |
|
||||||
| Worker (`iklim-db-01`) ölür | `docker node rm <id>` (manager'da) | Büyük ölçüde evet |
|
| Worker (`iklim-db-01`) ölür | `docker node rm <id>` (manager'da, bootstrap'tan önce) | Hayır — backup restore gerekir |
|
||||||
| Her ikisi ölür | Yok | Evet |
|
| Her ikisi ölür | Yok | Evet |
|
||||||
|
|
||||||
## Neden Prod'da Bu Sorun Yok
|
## Neden Prod'da Bu Sorun Yok
|
||||||
@ -81,6 +83,8 @@ Prod ortamında birden fazla manager node (en az 3) çalıştırılır. Tek mana
|
|||||||
|
|
||||||
SWAG, cert-reloader, Prometheus ve Grafana cluster-native (replicated) değildir; her zaman tek instance çalışırlar ve varsayılan olarak `iklim-app-01`'e (Floating IP node) sabitlenmişlerdir. `iklim-app-01` çöktüğünde bu servisler durur; DNS/HTTPS erişimi ve izleme (monitoring) kesilir. Swarm quorum 2 manager ile devam eder; mikroservisler ve Vault başka node'lara taşınır.
|
SWAG, cert-reloader, Prometheus ve Grafana cluster-native (replicated) değildir; her zaman tek instance çalışırlar ve varsayılan olarak `iklim-app-01`'e (Floating IP node) sabitlenmişlerdir. `iklim-app-01` çöktüğünde bu servisler durur; DNS/HTTPS erişimi ve izleme (monitoring) kesilir. Swarm quorum 2 manager ile devam eder; mikroservisler ve Vault başka node'lara taşınır.
|
||||||
|
|
||||||
|
`cert-distributor` bu kuralın dışındadır: `mode: global` ile `node.labels.type == service` olan tüm node'larda çalışır; StorageBox'tan sertifikayı node-lokal `/opt/iklimco/ssl`'e kopyalar (Vault FUSE mount kısıtlaması nedeniyle). `iklim-app-01` düştüğünde diğer node'lardaki `cert-distributor` instance'ları çalışmaya devam eder — failover gerektirmez.
|
||||||
|
|
||||||
Tüm bu servislerin verileri ve konfigürasyonları StorageBox'ta tutulur:
|
Tüm bu servislerin verileri ve konfigürasyonları StorageBox'ta tutulur:
|
||||||
- **SWAG:** `/mnt/storagebox/swag/config`
|
- **SWAG:** `/mnt/storagebox/swag/config`
|
||||||
- **SSL:** `/mnt/storagebox/ssl`
|
- **SSL:** `/mnt/storagebox/ssl`
|
||||||
@ -91,12 +95,12 @@ Tüm bu servislerin verileri ve konfigürasyonları StorageBox'ta tutulur:
|
|||||||
|
|
||||||
### 1. Servisleri Başka Node'a Taşı
|
### 1. Servisleri Başka Node'a Taşı
|
||||||
|
|
||||||
SWAG ve cert-reloader birlikte taşınmalıdır. Prometheus ve Grafana da bağımsız olarak veya aynı anda taşınabilir.
|
SWAG ve cert-reloader birlikte taşınmalıdır. Prometheus ve Grafana da bağımsız olarak veya aynı anda taşınabilir. `cert-distributor` global mode'da çalıştığından taşıma gerekmez.
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
# iklim-app-02 veya iklim-app-03 üzerinde (aktif manager):
|
# iklim-app-02 veya iklim-app-03 üzerinde (aktif manager):
|
||||||
|
|
||||||
# SWAG & Cert-Reloader taşıma
|
# SWAG & Cert-Reloader taşıma (replicas=1 olduğundan taşıma sırasında kısa kesinti yaşanır)
|
||||||
docker service update --constraint-add "node.hostname == iklim-app-02" --constraint-rm "node.hostname == iklim-app-01" iklimco_swag
|
docker service update --constraint-add "node.hostname == iklim-app-02" --constraint-rm "node.hostname == iklim-app-01" iklimco_swag
|
||||||
docker service update --constraint-add "node.hostname == iklim-app-02" --constraint-rm "node.hostname == iklim-app-01" iklimco_cert-reloader
|
docker service update --constraint-add "node.hostname == iklim-app-02" --constraint-rm "node.hostname == iklim-app-01" iklimco_cert-reloader
|
||||||
|
|
||||||
@ -121,8 +125,12 @@ hcloud floating-ip assign <floating-ip-id> <iklim-app-02-server-id>
|
|||||||
4. `iklim-prod-app-fip` satırının sağındaki **⋮** (üç nokta) menüsünü aç → **Reassign**.
|
4. `iklim-prod-app-fip` satırının sağındaki **⋮** (üç nokta) menüsünü aç → **Reassign**.
|
||||||
5. Açılan listeden **`iklim-app-02`**'yi seç → **Reassign** butonuna tıkla.
|
5. Açılan listeden **`iklim-app-02`**'yi seç → **Reassign** butonuna tıkla.
|
||||||
|
|
||||||
|
> **Not:** Floating IP Hetzner panelinde yeniden atandıktan sonra `iklim-app-02`'nin network interface'inde de aktif olması gerekir. Ansible bootstrap bu konfigürasyonu yapıyorsa otomatiktir; emin olmak için `ip addr show` ile Floating IP'nin bind edildiğini doğrula.
|
||||||
|
|
||||||
### 3. Doğrula
|
### 3. Doğrula
|
||||||
|
|
||||||
|
SWAG başlama ve sertifika kontrolü birkaç saniye sürebilir; servis `Running` görünse de ilk `curl` başarısız dönebilir. Birkaç saniye bekleyip tekrar dene.
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
docker service ls | grep -E 'swag|cert-reloader|prometheus|grafana'
|
docker service ls | grep -E 'swag|cert-reloader|prometheus|grafana'
|
||||||
curl -si https://api.iklim.co/health
|
curl -si https://api.iklim.co/health
|
||||||
@ -133,6 +141,9 @@ curl -si https://api.iklim.co/health
|
|||||||
Node Swarm'a yeniden katıldıktan sonra tüm servisleri tekrar `iklim-app-01`'e taşıyıp Floating IP'yi geri aktarabilirsiniz.
|
Node Swarm'a yeniden katıldıktan sonra tüm servisleri tekrar `iklim-app-01`'e taşıyıp Floating IP'yi geri aktarabilirsiniz.
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
|
# Önce node'un Swarm'a gerçekten katıldığını doğrula (STATUS = Ready olmalı)
|
||||||
|
docker node ls
|
||||||
|
|
||||||
# Servisleri geri taşı
|
# Servisleri geri taşı
|
||||||
for svc in iklimco_swag iklimco_cert-reloader iklimco_prometheus iklimco_grafana; do
|
for svc in iklimco_swag iklimco_cert-reloader iklimco_prometheus iklimco_grafana; do
|
||||||
docker service update --constraint-add "node.hostname == iklim-app-01" --constraint-rm "node.hostname == iklim-app-02" $svc
|
docker service update --constraint-add "node.hostname == iklim-app-01" --constraint-rm "node.hostname == iklim-app-02" $svc
|
||||||
@ -149,5 +160,62 @@ hcloud floating-ip assign <floating-ip-id> <iklim-app-01-server-id>
|
|||||||
| Swarm quorum | Otomatik — 2 manager yeterli |
|
| Swarm quorum | Otomatik — 2 manager yeterli |
|
||||||
| Vault, mikroservisler | Otomatik — `node.labels.type == service` constraint ile başka node'a schedule edilir |
|
| Vault, mikroservisler | Otomatik — `node.labels.type == service` constraint ile başka node'a schedule edilir |
|
||||||
| SWAG, cert-reloader | Manuel — `docker service update --constraint-*` + Floating IP taşıma |
|
| SWAG, cert-reloader | Manuel — `docker service update --constraint-*` + Floating IP taşıma |
|
||||||
|
| cert-distributor | Otomatik — `mode: global`, tüm servis node'larında zaten çalışır |
|
||||||
| Prometheus, Grafana | Manuel — `docker service update --constraint-*` |
|
| Prometheus, Grafana | Manuel — `docker service update --constraint-*` |
|
||||||
| Veriler & Konfig | StorageBox'ta; failover node hemen erişir, veri kaybı yaşanmaz |
|
| Veriler & Konfig | StorageBox'ta; failover node hemen erişir, veri kaybı yaşanmaz |
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
# Prod — DB Node Recovery
|
||||||
|
|
||||||
|
Her DB node'u (`iklim-db-01`, `iklim-db-02`, `iklim-db-03`) aynı servis üçlüsünü barındırır:
|
||||||
|
|
||||||
|
| Node | Servisler |
|
||||||
|
|------|-----------|
|
||||||
|
| `iklim-db-01` | `etcd-01`, `patroni-01`, `mongodb-01` |
|
||||||
|
| `iklim-db-02` | `etcd-02`, `patroni-02`, `mongodb-02` |
|
||||||
|
| `iklim-db-03` | `etcd-03`, `patroni-03`, `mongodb-03` |
|
||||||
|
|
||||||
|
## Senaryo A: Node Geçici Olarak Çöker (Volume'lar Korunur)
|
||||||
|
|
||||||
|
etcd, Patroni ve MongoDB'nin tamamı 3 üyeli HA cluster'lardır; quorum için 2 node yeterlidir.
|
||||||
|
|
||||||
|
| Servis | Etki | Otomatik İyileşme |
|
||||||
|
|--------|------|-------------------|
|
||||||
|
| etcd | 2/3 node ile quorum devam eder | Node geri dönünce cluster'a otomatik katılır |
|
||||||
|
| Patroni | Replica düşerse primary devam eder; primary düşerse etcd üzerinden yeni primary seçilir | Node geri dönünce replica olarak otomatik katılır |
|
||||||
|
| MongoDB | 2/3 node ile quorum devam eder; gerekirse yeni primary seçilir | Node geri dönünce primary'den initial sync ile güncellenir |
|
||||||
|
|
||||||
|
**Manuel adım gerekmez.** Docker Swarm `restart_policy: on-failure` servisleri otomatik başlatır.
|
||||||
|
|
||||||
|
## Senaryo B: Node Yeniden Kurulur (Volume'lar Silinir)
|
||||||
|
|
||||||
|
etcd named volume'ları node-lokal olduğundan node yeniden kurulunca kaybolur. Patroni ve MongoDB kendi kendine iyileşir; etcd manuel müdahale gerektirir.
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Aktif bir etcd container'ından — eski üyeyi cluster'dan çıkar
|
||||||
|
docker exec -it $(docker ps -q -f name=iklimco_etcd-01) \
|
||||||
|
etcdctl member list --endpoints=http://etcd-01:2379,http://etcd-02:2379,http://etcd-03:2379
|
||||||
|
# Çıktıdan yeniden kurulan node'un <member-id>'sini al:
|
||||||
|
docker exec -it $(docker ps -q -f name=iklimco_etcd-01) \
|
||||||
|
etcdctl member remove <member-id> --endpoints=http://etcd-01:2379,http://etcd-02:2379,http://etcd-03:2379
|
||||||
|
|
||||||
|
# Servisleri yeniden başlat (etcd boş volume ile existing cluster'a katılır;
|
||||||
|
# Patroni primary'den pg_basebackup ile otomatik clone alır;
|
||||||
|
# MongoDB hostname değişmediyse primary'den otomatik initial sync yapar)
|
||||||
|
docker service update --force iklimco_etcd-0N
|
||||||
|
docker service update --force iklimco_patroni-0N
|
||||||
|
docker service update --force iklimco_mongodb-0N
|
||||||
|
```
|
||||||
|
|
||||||
|
> **MongoDB hostname değişirse:** Replica set konfigürasyonu eski hostname'i tutar. `mongosh` ile `rs.remove("<eski-host>:27017")` ardından `rs.add("<yeni-host>:27017")` çalıştır.
|
||||||
|
|
||||||
|
> **etcd `ETCD_INITIAL_CLUSTER_STATE`:** Stack dosyasında `new` olarak tanımlıdır (ilk kurulum için). Yeniden kurulum senaryosunda Swarm servisi `--force` ile güncellenince etcd boş volume ile başlar ve mevcut cluster'a `existing` modunda katılmaya çalışır. Bitnami etcd image'ı bunu otomatik algılar; sorun yaşanırsa stack dosyasında ilgili node'un `ETCD_INITIAL_CLUSTER_STATE` değerini geçici olarak `existing` yapıp redeploy et, ardından geri al.
|
||||||
|
|
||||||
|
## Özet
|
||||||
|
|
||||||
|
| Servis | Geçici çöküş | Yeniden kurulum |
|
||||||
|
|--------|-------------|-----------------|
|
||||||
|
| etcd | Otomatik | Manuel: `member remove` → `service update --force` |
|
||||||
|
| Patroni | Otomatik | Otomatik: boş dir'den primary'yi clone alır |
|
||||||
|
| MongoDB | Otomatik | Otomatik (aynı hostname); hostname değişirse `rs.remove` + `rs.add` |
|
||||||
@ -2,6 +2,11 @@
|
|||||||
|
|
||||||
Prod kurulum adımları ve mevcut yapı.
|
Prod kurulum adımları ve mevcut yapı.
|
||||||
|
|
||||||
|
Bu dosya kurulum geçmişini korur. Güncel prod deploy akışı için ana kaynak
|
||||||
|
repo kökündeki `prod_env-ci_dc-pipeline.md` dosyasıdır. Aşağıdaki manuel deploy
|
||||||
|
adımları, ilk kurulum ve sorun giderme geçmişi olarak tutulur; normal prod deploy
|
||||||
|
artık root `.gitea/workflows/deploy-prod.yml` üzerinden yürür.
|
||||||
|
|
||||||
## Terraform
|
## Terraform
|
||||||
|
|
||||||
### Hetzner Cloud Yapılandırması
|
### Hetzner Cloud Yapılandırması
|
||||||
@ -166,7 +171,27 @@ ansible-playbook prod-bootstrap.yml \
|
|||||||
--vault-password-file=../.vault_pass
|
--vault-password-file=../.vault_pass
|
||||||
```
|
```
|
||||||
|
|
||||||
## DB Stack Deploy
|
## Güncel Production Deploy Kaynakları
|
||||||
|
|
||||||
|
| Alan | Güncel kaynak |
|
||||||
|
| --- | --- |
|
||||||
|
| Root prod workflow | `.gitea/workflows/deploy-prod.yml` |
|
||||||
|
| Detaylı CI/CD dokümanı | `prod_env-ci_dc-pipeline.md` |
|
||||||
|
| Ana infra stack | `docker-stack-infra_db-prod.yml` |
|
||||||
|
| Vault HA stack | `docker-stack-vault.yml` |
|
||||||
|
| Vault bootstrap script | `init/vault/vault-bootstrap.sh` |
|
||||||
|
| Prod env ve secret dosyaları | `prod/secrets/iklim.co/.env`, `.env.secrets.*` |
|
||||||
|
|
||||||
|
Güncel yapıda `.deleted` suffix'li eski stack dosyaları yoktur ve prod akışında
|
||||||
|
dikkate alınmaz. Ana infra stack `docker-stack-infra_db-prod.yml` dosyasıdır.
|
||||||
|
Vault stack'i bu dosyanın içinde değildir; `vault-bootstrap.sh` tarafından
|
||||||
|
`docker-stack-vault.yml` ile deploy edilir.
|
||||||
|
|
||||||
|
## Tarihsel Manuel DB Stack Deploy (2026-05-21)
|
||||||
|
|
||||||
|
Bu bölüm ilk prod DB/infra kurulum geçmişini korumak için bırakılmıştır. Güncel
|
||||||
|
normal akışta bu adımlar elle çalıştırılmaz; root prod workflow ana stack deploy,
|
||||||
|
Vault bootstrap, MongoDB replica set init ve DB init scriptlerini yönetir.
|
||||||
|
|
||||||
### Custom Image Build
|
### Custom Image Build
|
||||||
|
|
||||||
@ -174,6 +199,9 @@ ansible-playbook prod-bootstrap.yml \
|
|||||||
|
|
||||||
### Stack Deploy
|
### Stack Deploy
|
||||||
|
|
||||||
|
Tarihsel not: Bu komut bloğundaki `docker-stack-db-prod.yml` artık güncel stack
|
||||||
|
dosyası değildir. Güncel ana stack `docker-stack-infra_db-prod.yml` dosyasıdır.
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
# Lokal → app-01
|
# Lokal → app-01
|
||||||
scp ./docker-stack-* root@178.104.210.41:/home/iklim/
|
scp ./docker-stack-* root@178.104.210.41:/home/iklim/
|
||||||
@ -198,6 +226,10 @@ history -c && history -w
|
|||||||
|
|
||||||
### MongoDB Replica Set Init
|
### MongoDB Replica Set Init
|
||||||
|
|
||||||
|
Tarihsel not: İlk kurulumda `rs.initiate` elle verilmişti. Güncel root prod
|
||||||
|
workflow içinde `Initialize MongoDB Replica Set` adımı replica set yoksa
|
||||||
|
`rs.initiate()`, eksik üye varsa primary üzerinden `rs.add()` çalıştırır.
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
ssh root@<db-01-ip>
|
ssh root@<db-01-ip>
|
||||||
|
|
||||||
@ -242,26 +274,66 @@ history -c && history -w
|
|||||||
curl -s http://10.20.20.11:8008/cluster | python3 -m json.tool
|
curl -s http://10.20.20.11:8008/cluster | python3 -m json.tool
|
||||||
```
|
```
|
||||||
|
|
||||||
## Mevcut Durum (2026-05-21)
|
## Tarihsel Durum (2026-05-21)
|
||||||
|
|
||||||
| Adım | Durum |
|
| Adım | Durum |
|
||||||
|
| ------------------------------------------------------- | ---------- |
|
||||||
|
| Terraform — 6 sunucu, ağ, firewall, floating IP | ✅ |
|
||||||
|
| Ansible base + hardening + docker + node_dirs | ✅ |
|
||||||
|
| Ansible storagebox + storagebox_ssh_key | ✅ |
|
||||||
|
| Ansible swarm (3 manager app + 3 worker db) | ✅ |
|
||||||
|
| Ansible db_labels | ✅ |
|
||||||
|
| Ansible db_stack (StorageBox DB dizinleri + config) | ✅ |
|
||||||
|
| Ansible act_runner (3 prod runner Gitea'da Idle) | ✅ |
|
||||||
|
| DB stack deploy (etcd + MongoDB + Patroni) | ✅ |
|
||||||
|
| MongoDB replica set init (rs0: 1 primary, 2 secondary) | ✅ |
|
||||||
|
| Patroni HA cluster (1 leader, 2 replica, lag=0) | ✅ |
|
||||||
|
| Ana infra stack deploy (docker-stack-infra_db-prod.yml) | ✅ |
|
||||||
|
| MongoDB rs.initiate (ilk deploy sonrası elle) | ✅ |
|
||||||
|
| Deploy pipeline ilk çalışma | ⏳ bekliyor |
|
||||||
|
|
||||||
|
## Güncel Durum (2026-06-15)
|
||||||
|
|
||||||
|
| Alan | Güncel durum |
|
||||||
| --- | --- |
|
| --- | --- |
|
||||||
| Terraform — 6 sunucu, ağ, firewall, floating IP | ✅ |
|
| Prod deploy kaynak dokümanı | `prod_env-ci_dc-pipeline.md` |
|
||||||
| Ansible base + hardening + docker + node_dirs | ✅ |
|
| Root prod workflow | `.gitea/workflows/deploy-prod.yml` |
|
||||||
| Ansible storagebox + storagebox_ssh_key | ✅ |
|
| Ana infra stack | `docker-stack-infra_db-prod.yml` |
|
||||||
| Ansible swarm (3 manager app + 3 worker db) | ✅ |
|
| Vault HA stack | `docker-stack-vault.yml` |
|
||||||
| Ansible db_labels | ✅ |
|
| Vault deploy yöntemi | `init/vault/vault-bootstrap.sh` tarafından bootstrap/deploy |
|
||||||
| Ansible db_stack (StorageBox DB dizinleri + config) | ✅ |
|
| Eski `.deleted` stack dosyaları | Silindi, güncel akışta yok |
|
||||||
| Ansible act_runner (3 prod runner Gitea'da Idle) | ✅ |
|
| Prod env dosyası | StorageBox `prod/secrets/iklim.co/.env` -> workflow workspace `./.env` |
|
||||||
| DB stack deploy (etcd + MongoDB + Patroni) | ✅ |
|
| Shared secrets | StorageBox `prod/secrets/iklim.co/.env.secrets.shared` |
|
||||||
| MongoDB replica set init (rs0: 1 primary, 2 secondary) | ✅ |
|
| Service secrets | StorageBox `prod/secrets/iklim.co/.env.secrets.<svc>` |
|
||||||
| Patroni HA cluster (1 leader, 2 replica, lag=0) | ✅ |
|
| SWAG secrets | StorageBox `prod/secrets/iklim.co/.env.secrets.swag` |
|
||||||
| Ana infra stack deploy (docker-stack-infra_db-prod.yml) | ⏳ bekliyor |
|
| MongoDB replica set init | Workflow içinde otomatik/idempotent adım olarak yönetiliyor |
|
||||||
| MongoDB rs.initiate (ilk deploy sonrası elle) | ⏳ bekliyor |
|
| PostgreSQL init | Patroni primary beklenerek `./init/postgresql/*.sql` ile çalışıyor |
|
||||||
| Deploy pipeline ilk çalışma | ⏳ bekliyor |
|
| MongoDB init | Replica set hazırlandıktan sonra `./init/mongodb/*.js` ile çalışıyor |
|
||||||
|
| DNS update | Workflow GoDaddy API ile `api`, `apigw`, `rabbitmq`, `grafana` A kayıtlarını güncelliyor |
|
||||||
|
|
||||||
|
Güncel prod workflow ana hatlarıyla şu sırayı izler:
|
||||||
|
|
||||||
|
1. StorageBox'tan `.env`, `.env.secrets.shared`, service secret dosyaları ve `.env.secrets.swag` alınır.
|
||||||
|
2. PostgreSQL ve MongoDB init template'leri `./init/postgresql` ve `./init/mongodb` altına üretilir.
|
||||||
|
3. Harbor pull login yapılır.
|
||||||
|
4. SWAG DNS/site config dosyaları hazırlanır.
|
||||||
|
5. Vault için geçici TLS placeholder cert gerekirse oluşturulur.
|
||||||
|
6. `rabbitmq_erlang_cookie` Docker secret'ı oluşturulur veya mevcutsa korunur.
|
||||||
|
7. `docker-stack-infra_db-prod.yml` `iklimco` stack'ine deploy edilir.
|
||||||
|
8. Runner job container `iklimco-net` overlay network'üne bağlanır.
|
||||||
|
9. `init-infra-prod.sh` çalışır; bu script Vault bootstrap ve RabbitMQ prod hazırlığını yapar.
|
||||||
|
10. Vault AppRole ID/Secret ID değerleri ve Docker secrets üretilir.
|
||||||
|
11. Güncellenen `.env` ve `.env.secrets.*` dosyaları StorageBox'a yüklenir.
|
||||||
|
12. etcd, APISIX, SWAG certificate, MongoDB replica set, DB init scriptleri ve DNS kayıtları doğrulanır/güncellenir.
|
||||||
|
|
||||||
## Önemli Mimari Notlar
|
## Önemli Mimari Notlar
|
||||||
|
|
||||||
|
### Ana Infra Stack ve Vault Ayrımı (2026-06-15)
|
||||||
|
|
||||||
|
Güncel durumda ana infra stack `docker-stack-infra_db-prod.yml` dosyasıdır. Bu stack Redis master/replica/sentinel, RabbitMQ cluster, APISIX, APISIX Dashboard, Prometheus, Grafana, SWAG, cert-reloader, cert-distributor, etcd, Patroni ve MongoDB replica set servislerini içerir.
|
||||||
|
|
||||||
|
Vault ana infra stack içinde değildir. Vault HA cluster `docker-stack-vault.yml` dosyasıyla, `init/vault/vault-bootstrap.sh` tarafından deploy edilir. Bootstrap akışı placeholder `vault_unseal_key` oluşturur, `iklimco_vault` servisini deploy eder, Vault init/unseal işlemini yapar ve Docker secret'ı gerçek unseal key ile rotate eder.
|
||||||
|
|
||||||
### Tek Stack Yaklaşımı (2026-05-26)
|
### Tek Stack Yaklaşımı (2026-05-26)
|
||||||
|
|
||||||
`docker-stack-infra-prod.yml` ve `docker-stack-db-prod.yml` tek dosyada birleştirildi: `docker-stack-infra_db-prod.yml`. Her iki dosya da aynı `iklimco` stack adına deploy edildiğinden servis isimleri değişmedi.
|
`docker-stack-infra-prod.yml` ve `docker-stack-db-prod.yml` tek dosyada birleştirildi: `docker-stack-infra_db-prod.yml`. Her iki dosya da aynı `iklimco` stack adına deploy edildiğinden servis isimleri değişmedi.
|
||||||
@ -270,7 +342,9 @@ curl -s http://10.20.20.11:8008/cluster | python3 -m json.tool
|
|||||||
|
|
||||||
**Network:** `iklimco-net` artık stack tarafından oluşturulur (MTU=1400, attachable). Ansible `swarm` rolündeki network oluşturma task'ı kaldırıldı.
|
**Network:** `iklimco-net` artık stack tarafından oluşturulur (MTU=1400, attachable). Ansible `swarm` rolündeki network oluşturma task'ı kaldırıldı.
|
||||||
|
|
||||||
**MongoDB rs.initiate:** İlk deploy sonrası `rs.initiate` elle verilmeli (DB Stack Deploy bölümüne bakınız).
|
**MongoDB rs.initiate:** Bu not ilk kurulum dönemine aittir. Güncel prod workflow
|
||||||
|
`Initialize MongoDB Replica Set` adımında `rs.initiate()` ve gerektiğinde `rs.add()`
|
||||||
|
işlemlerini yönetir.
|
||||||
|
|
||||||
**Network silinirse:** Stack'i yeniden deploy et — `docker stack deploy -c docker-stack-infra_db-prod.yml iklimco`
|
**Network silinirse:** Stack'i yeniden deploy et — `docker stack deploy -c docker-stack-infra_db-prod.yml iklimco`
|
||||||
|
|
||||||
@ -278,6 +352,11 @@ curl -s http://10.20.20.11:8008/cluster | python3 -m json.tool
|
|||||||
|
|
||||||
`retry_join.leader_api_addr` olarak `iklimco_vault` (Swarm servis adı) kullanılır. Stack-owned network sayesinde Docker DNS bu VIP'i kayıt eder. `leader_tls_server_name: vault.iklim.co` ile `*.iklim.co` sertifikası TLS doğrulamasını geçer.
|
`retry_join.leader_api_addr` olarak `iklimco_vault` (Swarm servis adı) kullanılır. Stack-owned network sayesinde Docker DNS bu VIP'i kayıt eder. `leader_tls_server_name: vault.iklim.co` ile `*.iklim.co` sertifikası TLS doğrulamasını geçer.
|
||||||
|
|
||||||
|
Güncel Vault deploy akışında bu ayar `docker-stack-vault.yml` ve Vault template
|
||||||
|
dosyaları üzerinden kullanılır. Vault stack deploy'u root workflow'da doğrudan
|
||||||
|
değil, `init-infra-prod.sh` -> `init/vault/init-prod.sh` ->
|
||||||
|
`init/vault/vault-bootstrap.sh` zinciriyle yapılır.
|
||||||
|
|
||||||
### Runner / iklimco-net (2026-05-26)
|
### Runner / iklimco-net (2026-05-26)
|
||||||
|
|
||||||
Act runner config'de `container.network: "bridge"` kullanılır (önceki `iklimco-net`). Workflow'da "Connect Runner to Overlay Network" adımı "Deploy Swarm Stacks" sonrasına taşındı — böylece stack'in oluşturduğu `iklimco-net`'e runner job container bağlanabilir.
|
Act runner config'de `container.network: "bridge"` kullanılır (önceki `iklimco-net`). Workflow'da "Connect Runner to Overlay Network" adımı "Deploy Swarm Stacks" sonrasına taşındı — böylece stack'in oluşturduğu `iklimco-net`'e runner job container bağlanabilir.
|
||||||
|
|||||||
@ -41,6 +41,9 @@ This scheme is applied consistently across `docker-stack-infra.yml` and all 10 m
|
|||||||
|
|
||||||
`node.role == worker` is intentionally not used anywhere. DB nodes are Swarm workers, but targeting them via `node.role == worker` would also match any future worker-only app nodes. The explicit `node.labels.role == db` label provides precise, unambiguous targeting regardless of Swarm role.
|
`node.role == worker` is intentionally not used anywhere. DB nodes are Swarm workers, but targeting them via `node.role == worker` would also match any future worker-only app nodes. The explicit `node.labels.role == db` label provides precise, unambiguous targeting regardless of Swarm role.
|
||||||
|
|
||||||
|
## Otomasyon Notu
|
||||||
|
**ÖNEMLİ:** Aşağıda listelenen tüm Swarm ilklendirme, join token işlemleri ve node etiketleme (labeling) süreçleri artık manuel yapılmamaktadır. Bu işlemler `Environment_Infrastructure/ansible/prod/prod-bootstrap.yml` ve ortak `swarm` rolü tarafından **tamamen otomatik** olarak yürütülmektedir. Buradaki manuel bash komutları yalnızca referans, bilgi ve sorun giderme (troubleshooting) amaçlı tutulmaktadır.
|
||||||
|
|
||||||
## Step 1 — Init Swarm on iklim-app-01 (the prod-runner node)
|
## Step 1 — Init Swarm on iklim-app-01 (the prod-runner node)
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
@ -102,7 +105,7 @@ docker node update --label-add role=db --label-add db-index=03 iklim-db-03
|
|||||||
|
|
||||||
> DB nodes are Swarm **workers** only — they never become managers.
|
> DB nodes are Swarm **workers** only — they never become managers.
|
||||||
> DB services are pinned to them via `node.labels.role == db` placement constraint.
|
> DB services are pinned to them via `node.labels.role == db` placement constraint.
|
||||||
> See `08-prod-db-cluster-kurulum.md` for DB stack deployment.
|
> See `08-prod-db-cluster-setup.md` for DB stack deployment.
|
||||||
|
|
||||||
## Step 6 — Verify
|
## Step 6 — Verify
|
||||||
|
|
||||||
|
|||||||
@ -60,7 +60,7 @@ To get the Floating IP: `terraform output prod_floating_ip`
|
|||||||
|
|
||||||
Logic: for each record, pipeline queries the current value via GoDaddy API. If already correct, it skips. Otherwise it creates/updates the record.
|
Logic: for each record, pipeline queries the current value via GoDaddy API. If already correct, it skips. Otherwise it creates/updates the record.
|
||||||
|
|
||||||
> The Floating IP is assigned to `iklim-app-01` (`06-prod-terraform-iaac.md` — `floating_ip.tf`).
|
> The Floating IP is assigned to `iklim-app-01` (`06-prod-terraform-iac.md` — `floating_ip.tf`).
|
||||||
> If failover is needed, the Floating IP can be reassigned to another app node; DNS does not change.
|
> If failover is needed, the Floating IP can be reassigned to another app node; DNS does not change.
|
||||||
|
|
||||||
## Notes
|
## Notes
|
||||||
|
|||||||
@ -1,702 +1,75 @@
|
|||||||
# 03 — docker-stack-infra.yml Changes (Prod)
|
# 03 — Production Infrastructure and DB Stack Model
|
||||||
|
|
||||||
## Context
|
## Context
|
||||||
|
|
||||||
### File strategy — overlay approach
|
This document records the production infrastructure target that is now implemented by the current setup runbooks. The execution source is no longer the old base-plus-prod overlay model.
|
||||||
|
|
||||||
Prod-specific service changes are **not written directly** into `docker-stack-infra.yml`; they are kept in a separate overlay file:
|
Current references:
|
||||||
|
|
||||||
| File | Usage |
|
- Setup source: `../../setup/08-prod-db-cluster-setup.md` and `../../setup/09-prod-runner-ha-and-swarm.md`
|
||||||
|------|-------|
|
- Main infra and DB stack: root `docker-stack-infra_db-prod.yml`
|
||||||
| `docker-stack-infra.yml` | Base — works as-is for test |
|
- Vault stack: root `docker-stack-vault.yml`
|
||||||
| `docker-stack-infra.prod.yml` | Prod overlay — additional services and overrides |
|
- Vault bootstrap: root `init/vault/vault-bootstrap.sh`, called through `init-infra-prod.sh`
|
||||||
|
|
||||||
```bash
|
## Current Stack Strategy
|
||||||
# Test deploy:
|
|
||||||
docker stack deploy -c docker-stack-infra.yml iklimco
|
|
||||||
|
|
||||||
# Prod deploy (Swarm merges both files):
|
Production uses a split stack model:
|
||||||
docker stack deploy -c docker-stack-infra.yml -c docker-stack-infra.prod.yml iklimco
|
|
||||||
```
|
|
||||||
|
|
||||||
Docker Swarm merge rule: if the same service name appears in both files, the overlay wins (deploy, environment, etc.); services only present in the overlay are added.
|
- `docker-stack-infra_db-prod.yml`: APISIX, APISIX Dashboard, SWAG, cert services, Redis/Sentinel, RabbitMQ, Prometheus, Grafana, Patroni/PostgreSQL, MongoDB, and etcd.
|
||||||
|
- `docker-stack-vault.yml`: Vault Raft cluster only.
|
||||||
|
|
||||||
### Prod-specific changes summary
|
The previous `docker-stack-infra.yml` + `docker-stack-infra.prod.yml` overlay strategy is superseded for production. Do not create or deploy `docker-stack-infra.prod.yml` for the current prod environment.
|
||||||
- APISIX: 1 → 3 replicas (overlay override)
|
|
||||||
- Redis: single-instance → Sentinel cluster — 1 master + 2 replicas + 3 sentinels (overlay adds new services)
|
|
||||||
- RabbitMQ: 1 → 3-node Erlang cluster (overlay override + env)
|
|
||||||
- Vault: 1 → 3-node Raft cluster (overlay override) — see `07-vault-raft-plan.md`
|
|
||||||
- No separate APISIX etcd: Patroni etcd is shared (`/apisix` prefix)
|
|
||||||
- `init/apisix-core/init.sh`: when `PROFILE=prod`, rate limit `policy:local` → `policy:redis`
|
|
||||||
|
|
||||||
### swag-vl volume — not used in prod, not defined in overlay
|
## Placement Boundary
|
||||||
|
|
||||||
Test-env Step 9 adds the `swag-vl` named volume to the base file. In prod, SWAG mounts to the StorageBox via the `${SWAG_CONFIG_DIR}` env var, so this volume is unused by any service. No need to remove it in the overlay — Swarm does not create unused volume definitions, it remains harmless.
|
`docker-stack-infra_db-prod.yml` is intentionally a mixed stack. The placement model is the important boundary:
|
||||||
|
|
||||||
No `swag-vl` definition is made in `docker-stack-infra.prod.yml`.
|
- DB/cluster services run on `iklim-db-*`: Patroni/PostgreSQL, MongoDB, and etcd.
|
||||||
|
- App/service-node infrastructure runs on `iklim-app-*` with `node.labels.type == service`: Redis, Redis Sentinel, RabbitMQ, APISIX, APISIX Dashboard, SWAG, cert-reloader/cert-distributor, Prometheus, and Grafana.
|
||||||
|
- Redis and RabbitMQ are not DB-node host-mode services. They stay on the overlay network unless explicitly exposed by the stack or SWAG/APISIX.
|
||||||
|
|
||||||
### Monitoring Persistence
|
DB services that require direct cluster traffic publish host-mode ports where the current stack defines them. Redis and RabbitMQ must not be changed to host-mode just because they live in the same stack file.
|
||||||
|
|
||||||
Prometheus and Grafana run as single instances, but their storage profiles are different:
|
## Current Production Services
|
||||||
- **Prometheus:** keep TSDB on a local Docker volume (`prometheus-vl`). Prometheus local storage should not run on StorageBox/DAVFS because of filesystem semantics and WAL/compaction I/O.
|
|
||||||
- **Grafana:** keep `/var/lib/grafana` on StorageBox (`/mnt/storagebox/grafana/data`) so dashboards, plugins, and the SQLite database are available if the single active instance is manually moved to another node.
|
|
||||||
|
|
||||||
Grafana uses the `GRAFANA_DATA_DIR` env var with a named-volume fallback for test. Prometheus continues to use the named Docker volume. See Step 9 for implementation details.
|
| Area | Current model |
|
||||||
|
| --- | --- |
|
||||||
|
| APISIX | 3 replicas on service nodes; config stored in etcd with `/apisix` prefix |
|
||||||
|
| Redis | Sentinel model on service nodes; overlay-only |
|
||||||
|
| RabbitMQ | 3-node service-node cluster; management exposed through SWAG, restricted by IP |
|
||||||
|
| Vault | Separate 3-node Raft stack via `docker-stack-vault.yml` |
|
||||||
|
| PostgreSQL | 3-node Patroni cluster on DB nodes |
|
||||||
|
| MongoDB | 3-node replica set on DB nodes |
|
||||||
|
| etcd | 3-node cluster on DB nodes, shared by Patroni and APISIX |
|
||||||
|
| Prometheus | Single instance; local Docker volume |
|
||||||
|
| Grafana | Single instance; StorageBox-backed data path |
|
||||||
|
|
||||||
**Note:** PostgreSQL and MongoDB are not in `docker-stack-infra.yml`. See `08-prod-db-cluster-kurulum.md`.
|
## Monitoring Persistence
|
||||||
|
|
||||||
## Step 1 — Apply all test-env changes first
|
Prometheus TSDB remains on a local Docker volume because StorageBox/DAVFS is not suitable for Prometheus WAL and compaction I/O.
|
||||||
|
|
||||||
Follow every step in `test-env/03-infra-stack-changes.md`:
|
Grafana uses `/mnt/storagebox/grafana/data` through `GRAFANA_DATA_DIR` so dashboards, plugins, and the SQLite database survive manual service movement between service nodes.
|
||||||
- Add `swag` service
|
|
||||||
- Add `cert-reloader` service
|
|
||||||
- Remove published ports for vault, apisix, rabbitmq, prometheus, grafana, apisix-dashboard
|
|
||||||
- Add `swag-vl` volume
|
|
||||||
|
|
||||||
## Step 2 — Vault: 3-node Raft cluster (prod)
|
## APISIX and etcd
|
||||||
|
|
||||||
Vault starts directly with 3 replicas; the Phase 1 single-instance stage is skipped in prod.
|
APISIX uses the DB-node etcd cluster through overlay DNS aliases such as `etcd-01`, `etcd-02`, and `etcd-03`. Patroni and APISIX use different etcd prefixes, so their data does not collide.
|
||||||
See `07-vault-raft-plan.md` Phase 2 for detailed setup steps.
|
|
||||||
|
|
||||||
```yaml
|
The app subnet to DB subnet firewall rule for etcd client traffic is part of the current production firewall model. See `../../setup/06-prod-terraform-iac.md`.
|
||||||
vault:
|
|
||||||
deploy:
|
|
||||||
mode: replicated
|
|
||||||
replicas: 3
|
|
||||||
placement:
|
|
||||||
max_replicas_per_node: 1
|
|
||||||
constraints:
|
|
||||||
- node.labels.type == service
|
|
||||||
```
|
|
||||||
|
|
||||||
## Step 3 — APISIX: 3 replicas + init.sh rate limit update (prod overlay)
|
## Redis and RabbitMQ
|
||||||
|
|
||||||
Add to `docker-stack-infra.prod.yml`:
|
Redis/Sentinel and RabbitMQ are service-node infrastructure. Their placement follows `node.labels.type == service`.
|
||||||
|
|
||||||
```yaml
|
RabbitMQ-related private firewall rules belong to the app/service-node firewall model. Redis and Sentinel do not publish host-mode ports in the current prod stack and do not require Hetzner firewall openings.
|
||||||
# docker-stack-infra.prod.yml
|
|
||||||
services:
|
|
||||||
apisix:
|
|
||||||
deploy:
|
|
||||||
mode: replicated
|
|
||||||
replicas: 3
|
|
||||||
placement:
|
|
||||||
max_replicas_per_node: 1
|
|
||||||
constraints:
|
|
||||||
- node.labels.type == service
|
|
||||||
|
|
||||||
apisix-dashboard:
|
## Historical / Superseded by Setup
|
||||||
deploy:
|
|
||||||
mode: replicated
|
|
||||||
replicas: 3
|
|
||||||
placement:
|
|
||||||
max_replicas_per_node: 1
|
|
||||||
constraints:
|
|
||||||
- node.labels.type == service
|
|
||||||
```
|
|
||||||
|
|
||||||
APISIX and apisix-dashboard are stateless (config lives in Patroni etcd) — 3 replicas is safe.
|
The following earlier roadmap ideas are retained only as historical context:
|
||||||
Swarm distributes SWAG requests to APISIX replicas via VIP (IPVS round-robin).
|
|
||||||
|
|
||||||
### init.sh — rate limit policy:redis (prod)
|
- Creating `docker-stack-infra.prod.yml` as a prod overlay.
|
||||||
|
- Deploying prod with `docker stack deploy -c docker-stack-infra.yml -c docker-stack-infra.prod.yml iklimco`.
|
||||||
|
- Keeping Vault inside the prod infra overlay with `/opt/iklimco/vault/data` host-path storage.
|
||||||
|
- Treating PostgreSQL/MongoDB as separate DB stacks such as `docker-stack-db.prod.yml`.
|
||||||
|
- Validating a prod merge with `docker stack config -c docker-stack-infra.yml -c docker-stack-infra.prod.yml`.
|
||||||
|
|
||||||
With `policy:local`, each APISIX instance counts independently → the global limit effectively becomes 3× with 3 replicas.
|
For current execution, use the setup runbooks and root stack files listed in the Context section.
|
||||||
Switch to `policy:redis` for `PROFILE=prod`.
|
|
||||||
|
|
||||||
Keep the following APISIX plugin limits in `init/apisix-core/init.sh` for `test/prod` unless stated otherwise:
|
|
||||||
|
|
||||||
| Scope | Plugin | Target limit |
|
|
||||||
|-------|--------|--------------|
|
|
||||||
| WebSocket `/ws` | `limit-conn` | `conn: 5` per `remote_addr` |
|
|
||||||
| Auth routes `/v1/auth/*`, `/v1/users/*` | `limit-count` | `count: 12`, `time_window: 60` per `remote_addr` |
|
|
||||||
| Global rule | `limit-count` | `count: 60`, `time_window: 60` per `remote_addr` |
|
|
||||||
|
|
||||||
Update the rate limit and connection limit blocks in `init/apisix-core/init.sh`.
|
|
||||||
|
|
||||||
**1. Define threshold constants at the script header:**
|
|
||||||
|
|
||||||
```bash
|
|
||||||
GLOBAL_LIMIT_COUNT=60
|
|
||||||
GLOBAL_LIMIT_WINDOW=60
|
|
||||||
AUTH_LIMIT_COUNT=12
|
|
||||||
AUTH_LIMIT_WINDOW=60
|
|
||||||
WS_LIMIT_CONN=5
|
|
||||||
```
|
|
||||||
|
|
||||||
**2. Update WebSocket route plugins (test/prod):**
|
|
||||||
|
|
||||||
```bash
|
|
||||||
if [[ "$PROFILE" != "dev" ]]; then
|
|
||||||
WS_PLUGINS=',"plugins":{"limit-conn":{"conn":'"$WS_LIMIT_CONN"',"burst":2,"default_conn_delay":0.1,"key":"remote_addr","key_type":"var","rejected_code":429}}'
|
|
||||||
else
|
|
||||||
WS_PLUGINS=""
|
|
||||||
fi
|
|
||||||
```
|
|
||||||
|
|
||||||
**3. Update Auth route plugins (test/prod):**
|
|
||||||
|
|
||||||
```bash
|
|
||||||
if [[ "$PROFILE" != "dev" ]]; then
|
|
||||||
AUTH_LIMIT=',"plugins":{"limit-count":{"count":'"$AUTH_LIMIT_COUNT"',"time_window":'"$AUTH_LIMIT_WINDOW"',"key_type":"var","key":"remote_addr","rejected_code":429,"policy":"local"}}'
|
|
||||||
else
|
|
||||||
AUTH_LIMIT=""
|
|
||||||
fi
|
|
||||||
```
|
|
||||||
|
|
||||||
**4. Update Global rate limit rule (test/prod):**
|
|
||||||
|
|
||||||
```bash
|
|
||||||
if [[ "$PROFILE" != "dev" ]]; then
|
|
||||||
if [[ "$PROFILE" == "prod" ]]; then
|
|
||||||
RATE_POLICY="redis"
|
|
||||||
RATE_REDIS=',"redis_host":"redis","redis_port":6379,"redis_password":"'"$REDIS_PASSWORD"'"'
|
|
||||||
else
|
|
||||||
RATE_POLICY="local"
|
|
||||||
RATE_REDIS=""
|
|
||||||
fi
|
|
||||||
|
|
||||||
call_api "global rate limit" -X PUT "$APISIX_ADMIN_URL/global_rules/1" \
|
|
||||||
-H "X-API-KEY: $API_KEY" -H "Content-Type: application/json" \
|
|
||||||
-d '{"plugins":{"limit-count":{"count":'"$GLOBAL_LIMIT_COUNT"',"time_window":'"$GLOBAL_LIMIT_WINDOW"',"key_type":"var","key":"remote_addr","rejected_code":429,"policy":"'"$RATE_POLICY"'","allow_degradation":true'"$RATE_REDIS"'}}}'
|
|
||||||
fi
|
|
||||||
```
|
|
||||||
|
|
||||||
> APISIX's `limit-count` plugin does not natively support Redis Sentinel; `policy:redis` works with a single endpoint.
|
|
||||||
> The `redis` service name stays constant within Swarm overlay DNS. `allow_degradation: true` ensures that if Redis is
|
|
||||||
> temporarily unreachable (e.g. Sentinel failover ~10-30 s, or master rescheduling), APISIX passes requests through
|
|
||||||
> instead of returning errors — rate limiting is briefly suspended but API access is unaffected.
|
|
||||||
> Microservices use Spring Data Redis Sentinel natively and are unaffected by master changes.
|
|
||||||
> Docker Swarm has no inter-service anti-affinity; the `redis` master placement relies on Swarm's spread strategy
|
|
||||||
> to avoid co-locating with a replica. This is a known limitation — accepted in favour of operational simplicity.
|
|
||||||
|
|
||||||
## Step 4 — etcd: Separate APISIX etcd removed — Patroni etcd shared
|
|
||||||
|
|
||||||
The standalone `etcd` service in `docker-stack-infra.yml` is **not used in prod and must be disabled** by setting `replicas: 0` in the prod overlay.
|
|
||||||
APISIX uses the 3-node Patroni etcd cluster running on DB nodes, via the `/apisix` prefix.
|
|
||||||
|
|
||||||
### Why consolidated?
|
|
||||||
- A standalone single-instance etcd was a SPOF for APISIX.
|
|
||||||
- Patroni etcd is already 3-node HA — APISIX gets a more reliable config store.
|
|
||||||
- etcd supports prefix-based namespacing; Patroni uses `/service/`, APISIX uses `/apisix/` — no collision.
|
|
||||||
|
|
||||||
### APISIX etcd connection configuration
|
|
||||||
|
|
||||||
Update the etcd endpoints in the APISIX service in `docker-stack-infra.yml` to point to DB nodes:
|
|
||||||
|
|
||||||
```yaml
|
|
||||||
apisix:
|
|
||||||
environment:
|
|
||||||
APISIX_STAND_ALONE: "false"
|
|
||||||
# via apisix/conf/config.yaml or environment:
|
|
||||||
# etcd:
|
|
||||||
# host:
|
|
||||||
# - "http://etcd-01:2379"
|
|
||||||
# - "http://etcd-02:2379"
|
|
||||||
# - "http://etcd-03:2379"
|
|
||||||
# prefix: "/apisix"
|
|
||||||
```
|
|
||||||
|
|
||||||
The preferred method is mounting `config.yaml` via a Docker config or volume. etcd endpoints use **overlay DNS aliases** defined in `docker-stack-db.prod.yml` — `etcd-01`, `etcd-02`, `etcd-03` — which are reachable from app nodes via the `iklimco-net` overlay:
|
|
||||||
|
|
||||||
```yaml
|
|
||||||
# config/apisix/config.yaml
|
|
||||||
etcd:
|
|
||||||
host:
|
|
||||||
- "http://etcd-01:2379"
|
|
||||||
- "http://etcd-02:2379"
|
|
||||||
- "http://etcd-03:2379"
|
|
||||||
prefix: "/apisix"
|
|
||||||
timeout: 30
|
|
||||||
```
|
|
||||||
|
|
||||||
### Disable standalone etcd in prod overlay
|
|
||||||
|
|
||||||
Docker Swarm overlay files cannot delete services from the base stack, but `replicas: 0` stops the container entirely:
|
|
||||||
|
|
||||||
```yaml
|
|
||||||
# docker-stack-infra.prod.yml
|
|
||||||
services:
|
|
||||||
etcd:
|
|
||||||
deploy:
|
|
||||||
replicas: 0
|
|
||||||
```
|
|
||||||
|
|
||||||
### Firewall requirement
|
|
||||||
|
|
||||||
etcd access from app nodes to DB nodes must be open (port 2379, app subnet → DB subnet). Verify from an app node:
|
|
||||||
|
|
||||||
```bash
|
|
||||||
docker run --rm --network iklimco-net alpine \
|
|
||||||
sh -c "wget -qO- http://etcd-01:2379/health"
|
|
||||||
```
|
|
||||||
|
|
||||||
## Step 5 — Redis: Sentinel cluster (prod overlay)
|
|
||||||
|
|
||||||
Redis runs as a single instance in test. In prod, Sentinel provides HA.
|
|
||||||
![[redis-sentinel-vs-cluster.png]]
|
|
||||||
Bitnami images are used — all configuration is done via env vars, no separate `.conf` file needed.
|
|
||||||
|
|
||||||
### Prerequisites
|
|
||||||
|
|
||||||
```bash
|
|
||||||
# Create Docker secret for Redis password:
|
|
||||||
openssl rand -hex 32 | docker secret create redis_password -
|
|
||||||
```
|
|
||||||
|
|
||||||
### Topology
|
|
||||||
|
|
||||||
```
|
|
||||||
any app node: redis (1 replica, spread by Swarm — not pinned)
|
|
||||||
2 app nodes: redis-replica (2 replicas, max 1/node, spread across app nodes)
|
|
||||||
all app nodes: redis-sentinel (3 replicas, max 1/node, spread across all app nodes)
|
|
||||||
```
|
|
||||||
|
|
||||||
### docker-stack-infra.prod.yml — Redis services
|
|
||||||
|
|
||||||
The existing `redis` service is overridden in the prod overlay as **master**; `redis-replica` and `redis-sentinel` are added as new services. The service name (`redis`) remains unchanged so the APISIX connection config does not need updating.
|
|
||||||
|
|
||||||
```yaml
|
|
||||||
# docker-stack-infra.prod.yml
|
|
||||||
services:
|
|
||||||
redis: # override base single-instance redis → master
|
|
||||||
image: bitnamisecure/redis:latest
|
|
||||||
environment:
|
|
||||||
ALLOW_EMPTY_PASSWORD: no
|
|
||||||
REDIS_PASSWORD: ${REDIS_PASSWORD}
|
|
||||||
REDIS_REPLICATION_MODE: master
|
|
||||||
deploy:
|
|
||||||
mode: replicated
|
|
||||||
replicas: 1
|
|
||||||
placement:
|
|
||||||
constraints:
|
|
||||||
- node.labels.type == service
|
|
||||||
restart_policy:
|
|
||||||
condition: any
|
|
||||||
delay: 5s
|
|
||||||
labels:
|
|
||||||
project: co.iklim
|
|
||||||
|
|
||||||
redis-replica:
|
|
||||||
image: bitnamisecure/redis:latest
|
|
||||||
environment:
|
|
||||||
ALLOW_EMPTY_PASSWORD: no
|
|
||||||
REDIS_REPLICATION_MODE: slave
|
|
||||||
REDIS_MASTER_HOST: redis
|
|
||||||
REDIS_MASTER_PORT_NUMBER: "6379"
|
|
||||||
REDIS_MASTER_PASSWORD: ${REDIS_PASSWORD}
|
|
||||||
REDIS_PASSWORD: ${REDIS_PASSWORD}
|
|
||||||
deploy:
|
|
||||||
mode: replicated
|
|
||||||
replicas: 2
|
|
||||||
placement:
|
|
||||||
max_replicas_per_node: 1
|
|
||||||
constraints:
|
|
||||||
- node.labels.type == service
|
|
||||||
preferences:
|
|
||||||
- spread: node.hostname
|
|
||||||
restart_policy:
|
|
||||||
condition: any
|
|
||||||
delay: 5s
|
|
||||||
labels:
|
|
||||||
project: co.iklim
|
|
||||||
|
|
||||||
redis-sentinel:
|
|
||||||
image: bitnamisecure/redis-sentinel:latest
|
|
||||||
environment:
|
|
||||||
REDIS_SENTINEL_MASTER_NAME: prod-master
|
|
||||||
REDIS_MASTER_HOST: redis
|
|
||||||
REDIS_MASTER_PORT_NUMBER: "6379"
|
|
||||||
REDIS_MASTER_PASSWORD: ${REDIS_PASSWORD}
|
|
||||||
REDIS_SENTINEL_QUORUM: "2"
|
|
||||||
REDIS_SENTINEL_DOWN_AFTER_MILLISECONDS: "5000"
|
|
||||||
REDIS_SENTINEL_FAILOVER_TIMEOUT: "10000"
|
|
||||||
deploy:
|
|
||||||
mode: replicated
|
|
||||||
replicas: 3
|
|
||||||
placement:
|
|
||||||
max_replicas_per_node: 1
|
|
||||||
constraints:
|
|
||||||
- node.labels.type == service
|
|
||||||
preferences:
|
|
||||||
- spread: node.hostname
|
|
||||||
restart_policy:
|
|
||||||
condition: any
|
|
||||||
delay: 5s
|
|
||||||
labels:
|
|
||||||
project: co.iklim
|
|
||||||
```
|
|
||||||
|
|
||||||
### Microservice connection (Spring Data Redis)
|
|
||||||
|
|
||||||
Microservices must use a Sentinel-aware connection:
|
|
||||||
|
|
||||||
```yaml
|
|
||||||
# application-prod.yml
|
|
||||||
spring:
|
|
||||||
data:
|
|
||||||
redis:
|
|
||||||
sentinel:
|
|
||||||
master: prod-master
|
|
||||||
nodes:
|
|
||||||
- redis-sentinel:26379
|
|
||||||
password: ${REDIS_PASSWORD}
|
|
||||||
```
|
|
||||||
|
|
||||||
### Verification
|
|
||||||
|
|
||||||
```bash
|
|
||||||
# Query master identity:
|
|
||||||
docker exec $(docker ps -q -f name=iklimco_redis-sentinel | head -1) \
|
|
||||||
redis-cli -p 26379 SENTINEL get-master-addr-by-name prod-master
|
|
||||||
```
|
|
||||||
|
|
||||||
## Step 6 — RabbitMQ: 3-node Erlang cluster (prod overlay)
|
|
||||||
|
|
||||||
RabbitMQ runs as a 3-node cluster with one instance per app node.
|
|
||||||
|
|
||||||
### Prerequisites
|
|
||||||
|
|
||||||
```bash
|
|
||||||
# Create Docker secret for Erlang cookie (must be identical on all nodes):
|
|
||||||
openssl rand -hex 32 | docker secret create rabbitmq_erlang_cookie -
|
|
||||||
```
|
|
||||||
|
|
||||||
### docker-stack-infra.prod.yml — RabbitMQ override
|
|
||||||
|
|
||||||
```yaml
|
|
||||||
# docker-stack-infra.prod.yml (add alongside redis services)
|
|
||||||
services:
|
|
||||||
rabbitmq:
|
|
||||||
image: rabbitmq:3-management
|
|
||||||
hostname: "rabbitmq-{{.Node.Hostname}}"
|
|
||||||
environment:
|
|
||||||
RABBITMQ_ERLANG_COOKIE_FILE: /run/secrets/rabbitmq_erlang_cookie
|
|
||||||
RABBITMQ_USE_LONGNAME: "true"
|
|
||||||
RABBITMQ_NODENAME: "rabbit@rabbitmq-{{.Node.Hostname}}"
|
|
||||||
secrets:
|
|
||||||
- rabbitmq_erlang_cookie
|
|
||||||
networks:
|
|
||||||
iklimco-net:
|
|
||||||
aliases:
|
|
||||||
- "rabbitmq-{{.Node.Hostname}}"
|
|
||||||
deploy:
|
|
||||||
mode: replicated
|
|
||||||
replicas: 3
|
|
||||||
placement:
|
|
||||||
max_replicas_per_node: 1
|
|
||||||
constraints:
|
|
||||||
- node.labels.type == service
|
|
||||||
update_config:
|
|
||||||
parallelism: 1
|
|
||||||
order: stop-first
|
|
||||||
labels:
|
|
||||||
project: co.iklim
|
|
||||||
|
|
||||||
secrets:
|
|
||||||
rabbitmq_erlang_cookie:
|
|
||||||
external: true
|
|
||||||
|
|
||||||
networks:
|
|
||||||
iklimco-net:
|
|
||||||
external: true
|
|
||||||
```
|
|
||||||
|
|
||||||
### Cluster join procedure (first setup)
|
|
||||||
|
|
||||||
RabbitMQ nodes do not form a cluster automatically; manual join is required after first start:
|
|
||||||
|
|
||||||
```bash
|
|
||||||
# Find the RabbitMQ container on iklim-app-02:
|
|
||||||
CTR=$(docker ps -q -f name=iklimco_rabbitmq)
|
|
||||||
|
|
||||||
# Stop, join, start:
|
|
||||||
docker exec "$CTR" rabbitmqctl stop_app
|
|
||||||
docker exec "$CTR" rabbitmqctl join_cluster rabbit@rabbitmq-iklim-app-01
|
|
||||||
docker exec "$CTR" rabbitmqctl start_app
|
|
||||||
|
|
||||||
# Repeat for iklim-app-03
|
|
||||||
```
|
|
||||||
|
|
||||||
```bash
|
|
||||||
# Verify cluster status (from any node):
|
|
||||||
docker exec "$CTR" rabbitmqctl cluster_status
|
|
||||||
```
|
|
||||||
|
|
||||||
> **HA policy:** After the cluster is formed, set quorum queues as the default:
|
|
||||||
> ```bash
|
|
||||||
> docker exec "$CTR" rabbitmqctl set_policy ha-all ".*" \
|
|
||||||
> '{"queue-type":"quorum"}' --apply-to queues
|
|
||||||
> ```
|
|
||||||
|
|
||||||
## Step 7 — RabbitMQ WebSocket Sticky Sessions (Consistent Hash)
|
|
||||||
|
|
||||||
RabbitMQ Web STOMP (over WebSocket) requires a persistent connection. In a 3-node RabbitMQ cluster, if an APISIX instance uses the default Swarm VIP for the `rabbitmq` upstream, it may cause unnecessary inter-node traffic or connection drops if the session doesn't persist on the same node.
|
|
||||||
|
|
||||||
To optimize this, we implement **Consistent Hashing (chash)** at the APISIX layer based on the client's IP address (`remote_addr`).
|
|
||||||
|
|
||||||
### 1. Update APISIX Upstream Configuration (init.sh)
|
|
||||||
|
|
||||||
Update the `rabbitmq` upstream definition in `init/apisix-core/init.sh` to target specific cluster nodes instead of the generic service name, enabling the `chash` algorithm for prod.
|
|
||||||
|
|
||||||
```bash
|
|
||||||
# Update upstream rabbitmq block in init.sh
|
|
||||||
if [[ "$PROFILE" == "prod" ]]; then
|
|
||||||
# Direct node DNS names to bypass Swarm VIP and allow chash to work effectively
|
|
||||||
RABBITMQ_NODES='{"rabbitmq-iklim-app-01:15674":1, "rabbitmq-iklim-app-02:15674":1, "rabbitmq-iklim-app-03:15674":1}'
|
|
||||||
LB_TYPE="chash"
|
|
||||||
HASH_KEY="remote_addr"
|
|
||||||
else
|
|
||||||
RABBITMQ_NODES='{"rabbitmq:15674":1}'
|
|
||||||
LB_TYPE="roundrobin"
|
|
||||||
HASH_KEY=""
|
|
||||||
fi
|
|
||||||
|
|
||||||
call_api "upstream rabbitmq" -X PUT "$APISIX_ADMIN_URL/upstreams/rabbitmq-upstream" \
|
|
||||||
-H "X-API-KEY: $API_KEY" -H "Content-Type: application/json" \
|
|
||||||
-d '{
|
|
||||||
"name": "rabbitmq-upstream",
|
|
||||||
"type": "'"$LB_TYPE"'",
|
|
||||||
"key": "'"$HASH_KEY"'",
|
|
||||||
"nodes": '"$RABBITMQ_NODES"',
|
|
||||||
"timeout": {"connect": 10, "send": 3600, "read": 3600},
|
|
||||||
"scheme": "http",
|
|
||||||
'"$HC"'
|
|
||||||
}'
|
|
||||||
```
|
|
||||||
|
|
||||||
### 2. Enable Real IP Detection in APISIX
|
|
||||||
|
|
||||||
Consistent hashing by `remote_addr` requires APISIX to see the actual client IP, not the internal IP of the SWAG (Nginx) proxy.
|
|
||||||
|
|
||||||
> **DNS Note:** For `chash` to work with node-specific names, the RabbitMQ service must have network aliases configured for each node (e.g., `rabbitmq-{{.Node.Hostname}}`) as shown in Step 6.
|
|
||||||
|
|
||||||
In the `config.yaml` inside the custom APISIX image (`custom-apisix:3.12.0`):
|
|
||||||
|
|
||||||
```yaml
|
|
||||||
nginx_config:
|
|
||||||
http:
|
|
||||||
real_ip_header: "X-Real-IP"
|
|
||||||
set_real_ip_from: "10.0.0.0/8"
|
|
||||||
```
|
|
||||||
|
|
||||||
## Step 8 — Create `docker-stack-infra.prod.yml`
|
|
||||||
|
|
||||||
Create this file in the repo root alongside `docker-stack-infra.yml`. It combines all prod-specific overrides from Steps 2–6 (including disabling the standalone `etcd` from Step 4):
|
|
||||||
|
|
||||||
```yaml
|
|
||||||
# docker-stack-infra.prod.yml
|
|
||||||
# Prod overlay — deploy with:
|
|
||||||
# docker stack deploy -c docker-stack-infra.yml -c docker-stack-infra.prod.yml iklimco
|
|
||||||
|
|
||||||
services:
|
|
||||||
|
|
||||||
vault:
|
|
||||||
environment:
|
|
||||||
VAULT_LOCAL_CONFIG: >-
|
|
||||||
{"api_addr":"https://vault.iklim.co:8200",
|
|
||||||
"cluster_addr":"https://{{ .Node.Hostname }}:8201",
|
|
||||||
"storage":{"raft":{"path":"/vault/file","node_id":"{{ .Node.Hostname }}"}},
|
|
||||||
"listener":[{"tcp":{"address":"0.0.0.0:8200",
|
|
||||||
"tls_cert_file":"/vault/certs/STAR.iklim.co.full.crt",
|
|
||||||
"tls_key_file":"/vault/certs/STAR.iklim.co_key.pem"}}],
|
|
||||||
"default_lease_ttl":"168h","max_lease_ttl":"720h","ui":true}
|
|
||||||
volumes:
|
|
||||||
- /opt/iklimco/vault/data:/vault/file
|
|
||||||
- ${SWAG_CERT_DIR}:/vault/certs:ro
|
|
||||||
deploy:
|
|
||||||
mode: replicated
|
|
||||||
replicas: 3
|
|
||||||
placement:
|
|
||||||
max_replicas_per_node: 1
|
|
||||||
constraints:
|
|
||||||
- node.labels.type == service
|
|
||||||
|
|
||||||
apisix:
|
|
||||||
deploy:
|
|
||||||
mode: replicated
|
|
||||||
replicas: 3
|
|
||||||
placement:
|
|
||||||
max_replicas_per_node: 1
|
|
||||||
constraints:
|
|
||||||
- node.labels.type == service
|
|
||||||
|
|
||||||
apisix-dashboard:
|
|
||||||
deploy:
|
|
||||||
mode: replicated
|
|
||||||
replicas: 3
|
|
||||||
placement:
|
|
||||||
max_replicas_per_node: 1
|
|
||||||
constraints:
|
|
||||||
- node.labels.type == service
|
|
||||||
|
|
||||||
redis:
|
|
||||||
image: bitnamisecure/redis:latest
|
|
||||||
environment:
|
|
||||||
ALLOW_EMPTY_PASSWORD: no
|
|
||||||
REDIS_PASSWORD: ${REDIS_PASSWORD}
|
|
||||||
REDIS_REPLICATION_MODE: master
|
|
||||||
deploy:
|
|
||||||
mode: replicated
|
|
||||||
replicas: 1
|
|
||||||
placement:
|
|
||||||
constraints:
|
|
||||||
- node.labels.type == service
|
|
||||||
restart_policy:
|
|
||||||
condition: any
|
|
||||||
delay: 5s
|
|
||||||
labels:
|
|
||||||
project: co.iklim
|
|
||||||
|
|
||||||
redis-replica:
|
|
||||||
image: bitnamisecure/redis:latest
|
|
||||||
environment:
|
|
||||||
ALLOW_EMPTY_PASSWORD: no
|
|
||||||
REDIS_REPLICATION_MODE: slave
|
|
||||||
REDIS_MASTER_HOST: redis
|
|
||||||
REDIS_MASTER_PORT_NUMBER: "6379"
|
|
||||||
REDIS_MASTER_PASSWORD: ${REDIS_PASSWORD}
|
|
||||||
REDIS_PASSWORD: ${REDIS_PASSWORD}
|
|
||||||
deploy:
|
|
||||||
mode: replicated
|
|
||||||
replicas: 2
|
|
||||||
placement:
|
|
||||||
max_replicas_per_node: 1
|
|
||||||
constraints:
|
|
||||||
- node.labels.type == service
|
|
||||||
preferences:
|
|
||||||
- spread: node.hostname
|
|
||||||
restart_policy:
|
|
||||||
condition: any
|
|
||||||
delay: 5s
|
|
||||||
labels:
|
|
||||||
project: co.iklim
|
|
||||||
|
|
||||||
redis-sentinel:
|
|
||||||
image: bitnamisecure/redis-sentinel:latest
|
|
||||||
environment:
|
|
||||||
REDIS_SENTINEL_MASTER_NAME: prod-master
|
|
||||||
REDIS_MASTER_HOST: redis
|
|
||||||
REDIS_MASTER_PORT_NUMBER: "6379"
|
|
||||||
REDIS_MASTER_PASSWORD: ${REDIS_PASSWORD}
|
|
||||||
REDIS_SENTINEL_QUORUM: "2"
|
|
||||||
REDIS_SENTINEL_DOWN_AFTER_MILLISECONDS: "5000"
|
|
||||||
REDIS_SENTINEL_FAILOVER_TIMEOUT: "10000"
|
|
||||||
deploy:
|
|
||||||
mode: replicated
|
|
||||||
replicas: 3
|
|
||||||
placement:
|
|
||||||
max_replicas_per_node: 1
|
|
||||||
constraints:
|
|
||||||
- node.labels.type == service
|
|
||||||
preferences:
|
|
||||||
- spread: node.hostname
|
|
||||||
restart_policy:
|
|
||||||
condition: any
|
|
||||||
delay: 5s
|
|
||||||
labels:
|
|
||||||
project: co.iklim
|
|
||||||
|
|
||||||
rabbitmq:
|
|
||||||
image: rabbitmq:3-management
|
|
||||||
hostname: "rabbitmq-{{.Node.Hostname}}"
|
|
||||||
environment:
|
|
||||||
RABBITMQ_ERLANG_COOKIE_FILE: /run/secrets/rabbitmq_erlang_cookie
|
|
||||||
RABBITMQ_USE_LONGNAME: "true"
|
|
||||||
RABBITMQ_NODENAME: "rabbit@rabbitmq-{{.Node.Hostname}}"
|
|
||||||
secrets:
|
|
||||||
- rabbitmq_erlang_cookie
|
|
||||||
networks:
|
|
||||||
iklimco-net:
|
|
||||||
aliases:
|
|
||||||
- "rabbitmq-{{.Node.Hostname}}"
|
|
||||||
deploy:
|
|
||||||
mode: replicated
|
|
||||||
replicas: 3
|
|
||||||
placement:
|
|
||||||
max_replicas_per_node: 1
|
|
||||||
constraints:
|
|
||||||
- node.labels.type == service
|
|
||||||
update_config:
|
|
||||||
parallelism: 1
|
|
||||||
order: stop-first
|
|
||||||
labels:
|
|
||||||
project: co.iklim
|
|
||||||
|
|
||||||
secrets:
|
|
||||||
rabbitmq_erlang_cookie:
|
|
||||||
external: true
|
|
||||||
|
|
||||||
networks:
|
|
||||||
iklimco-net:
|
|
||||||
external: true
|
|
||||||
```
|
|
||||||
|
|
||||||
## Step 9 — Monitoring Data Persistence
|
|
||||||
|
|
||||||
Prometheus and Grafana run as single instances. Grafana data is placed on the StorageBox shared filesystem for manual failover. Prometheus TSDB stays on a local Docker volume because DAVFS/StorageBox is not suitable for Prometheus WAL and compaction I/O.
|
|
||||||
|
|
||||||
**Changes already applied to `docker-stack-infra.yml`:**
|
|
||||||
|
|
||||||
```yaml
|
|
||||||
prometheus:
|
|
||||||
volumes:
|
|
||||||
- prometheus-vl:/prometheus
|
|
||||||
|
|
||||||
grafana:
|
|
||||||
volumes:
|
|
||||||
- ${GRAFANA_DATA_DIR:-grafana-vl}:/var/lib/grafana
|
|
||||||
```
|
|
||||||
|
|
||||||
Test uses the named Docker volume fallback (`grafana-vl`) for Grafana, and Prometheus always uses the named Docker volume (`prometheus-vl`) — no test env change needed.
|
|
||||||
|
|
||||||
**Add to `prod/secrets/iklim.co/.env.prod` on storagebox** (already in `env-prod/.env`):
|
|
||||||
|
|
||||||
```bash
|
|
||||||
GRAFANA_DATA_DIR=/mnt/storagebox/grafana/data
|
|
||||||
```
|
|
||||||
|
|
||||||
> `/mnt/storagebox/grafana/data` is created automatically by the Ansible `storagebox` role during bootstrap via the `storagebox_managed_directories` variable. No manual step required.
|
|
||||||
|
|
||||||
> Grafana writes its SQLite database and dashboard JSON to `/var/lib/grafana`.
|
|
||||||
> Prometheus writes its TSDB to `/prometheus` on the local `prometheus-vl` Docker volume; it is not shared between nodes.
|
|
||||||
|
|
||||||
## Step 10 — Verify
|
|
||||||
|
|
||||||
```bash
|
|
||||||
# Base file must be valid on its own (test deploy):
|
|
||||||
docker stack config -c docker-stack-infra.yml > /dev/null && echo "base OK"
|
|
||||||
|
|
||||||
# Prod merge must be valid:
|
|
||||||
docker stack config -c docker-stack-infra.yml -c docker-stack-infra.prod.yml > /dev/null && echo "prod merge OK"
|
|
||||||
```
|
|
||||||
|
|
||||||
## Step 11 — Database Proxies and Developer Access
|
|
||||||
|
|
||||||
In the production environment, the `pg-proxy` and `mongo-proxy` services (socat-based) defined in the base `docker-stack-infra.yml` are **deprecated and will not be used**.
|
|
||||||
|
|
||||||
### Rationale
|
|
||||||
- **Leader Tracking:** Simple L4 proxies (socat) cannot track the Patroni Leader or MongoDB Primary. They point to a single service VIP, which might lead to a Read-Only replica during failover.
|
|
||||||
- **HA Connection Strings:** Modern DB drivers (JDBC, libpq, MongoClient) support multi-host connection strings, which provide native failover and load balancing without an intermediate proxy.
|
|
||||||
|
|
||||||
### Developer Access Strategy
|
|
||||||
- **Direct Subnet Access:** Developers connect via WireGuard directly to the DB subnet (`10.20.20.0/24`).
|
|
||||||
- **No Translation:** Instead of mapping ports like `15432`, the standard ports (`5432`, `27017`) are used across all cluster nodes.
|
|
||||||
|
|
||||||
## Placement and Replica Summary — prod
|
|
||||||
|
|
||||||
| Service | File | Replicas | Placement | HA Note |
|
|
||||||
| ---------------- | ------------ | -------- | ------------------------------------------- | ------------------------------------------------------------------------------------- |
|
|
||||||
| swag | base | 1 | `node.hostname == iklim-app-01` | No clustering support; Floating IP pinned to node |
|
|
||||||
| cert-reloader | base | 1 | `node.hostname == iklim-app-01` | Cron-style task; duplicate would be problematic |
|
|
||||||
| vault | prod overlay | 3 | `node.labels.type == service`; max 1/node | Raft cluster — see `07-vault-raft-plan.md` |
|
|
||||||
| apisix | prod overlay | 3 | `node.labels.type == service`; max 1/node | Stateless; config in Patroni etcd; rate limit policy:redis |
|
|
||||||
| apisix-dashboard | prod overlay | 3 | `node.labels.type == service`; max 1/node | Stateless; reads from etcd |
|
|
||||||
| redis (master) | prod overlay | 1 | `node.labels.type == service`; Swarm spread | Sentinel cluster master; not pinned — reschedules on node failure |
|
|
||||||
| redis-replica | prod overlay | 2 | `node.labels.type == service`; max 1/node | Sentinel replica; spread:hostname |
|
|
||||||
| redis-sentinel | prod overlay | 3 | `node.labels.type == service`; max 1/node | Quorum=2; failover automatic |
|
|
||||||
| rabbitmq | prod overlay | 3 | `node.labels.type == service`; max 1/node | Erlang cluster; quorum queues |
|
|
||||||
| prometheus | base | 1 | `node.labels.type == service` | No native HA; Thanos is overkill at this scale |
|
|
||||||
| grafana | base | 1 | `node.labels.type == service` | Not critical |
|
|
||||||
|
|
||||||
> PostgreSQL and MongoDB run in separate DB stacks on `iklimco-*` nodes. See `08-prod-db-cluster-kurulum.md`.
|
|
||||||
> etcd: 3-node cluster on DB nodes — APISIX shares it via `/apisix` prefix.
|
|
||||||
|
|||||||
@ -1,121 +1,83 @@
|
|||||||
# 07 — Vault: 3-Node Raft Cluster (Prod)
|
# 07 — Vault Raft Stack and Bootstrap Automation (Prod)
|
||||||
|
|
||||||
## Context
|
## Context
|
||||||
Vault starts directly as a 3-node Raft cluster in prod. The single-instance phase used in test is skipped.
|
|
||||||
|
|
||||||
Test used a single Vault instance (file storage, 1 replica on the manager node). Prod goes straight to Raft HA.
|
Production Vault is a 3-node Raft cluster, but it is no longer initialized through a manual post-deploy runbook.
|
||||||
|
|
||||||
## Vault service configuration
|
Current references:
|
||||||
|
|
||||||
- **Replicas:** 3 (one per service node)
|
- Setup source: `../../setup/09-prod-runner-ha-and-swarm.md`
|
||||||
- **Storage:** Raft integrated storage
|
- Stack file: root `docker-stack-vault.yml`
|
||||||
- **Placement:** `node.labels.type == service` (all 3 app nodes)
|
- Bootstrap script: root `init/vault/vault-bootstrap.sh`
|
||||||
- **Cert distribution:** No SSH needed — all nodes mount StorageBox, cert-reloader writes to `SWAG_CERT_DIR=/mnt/storagebox/ssl`, Vault reads from that path on every node
|
- Template: root `init/vault/vault-template-v2.json`
|
||||||
|
|
||||||
### Prerequisites
|
## Current Model
|
||||||
|
|
||||||
- [ ] All 3 service nodes are running and labeled `type=service`
|
Vault is deployed separately from `docker-stack-infra_db-prod.yml`.
|
||||||
- [ ] `/mnt/storagebox/ssl/` directory is mounted and accessible on all 3 app nodes
|
|
||||||
- [ ] Vault data directory `/opt/iklimco/vault/data/` exists on all 3 nodes (host path volumes)
|
|
||||||
|
|
||||||
### Vault service YAML (docker-stack-infra.prod.yml overlay)
|
The Vault stack uses:
|
||||||
|
|
||||||
```yaml
|
- 3 replicas, one per service node when placement allows it.
|
||||||
vault:
|
- Docker volumes such as `vault-data-vl` and `vault-logs-vl`.
|
||||||
# ... (image, secrets, healthcheck unchanged from base)
|
- `/opt/iklimco/ssl:/vault/certs:ro` for TLS certificates.
|
||||||
environment:
|
- `iklimco-net` as an external overlay network.
|
||||||
VAULT_LOCAL_CONFIG: >-
|
- `vault_unseal_key` as a Docker secret.
|
||||||
{"api_addr":"https://vault.iklim.co:8200",
|
|
||||||
"cluster_addr":"https://{{ .Node.Hostname }}:8201",
|
The production workflow calls `init-infra-prod.sh`, which calls `init/vault/vault-bootstrap.sh`. The bootstrap script handles stack deploy, initialization, unseal key secret rotation, peer join, and peer unseal.
|
||||||
"storage":{"raft":{"path":"/vault/file","node_id":"{{ .Node.Hostname }}"}},
|
|
||||||
"listener":[{"tcp":{"address":"0.0.0.0:8200",
|
## Certificate Flow
|
||||||
"tls_cert_file":"/vault/certs/STAR.iklim.co.full.crt",
|
|
||||||
"tls_key_file":"/vault/certs/STAR.iklim.co_key.pem"}}],
|
Vault does not read TLS certificates directly from `/mnt/storagebox/ssl`.
|
||||||
"default_lease_ttl":"168h","max_lease_ttl":"720h","ui":true}
|
|
||||||
volumes:
|
The current flow is:
|
||||||
- /opt/iklimco/vault/data:/vault/file # host path per node
|
|
||||||
- ${SWAG_CERT_DIR}:/vault/certs:ro # StorageBox — shared across all nodes, no SSH distribution needed
|
```text
|
||||||
deploy:
|
SWAG renews certificate
|
||||||
mode: replicated
|
cert-reloader copies renewed files to /mnt/storagebox/ssl
|
||||||
replicas: 3
|
cert-distributor syncs certificate files to /opt/iklimco/ssl on service nodes
|
||||||
placement:
|
Vault reads /opt/iklimco/ssl through the /vault/certs mount
|
||||||
max_replicas_per_node: 1
|
|
||||||
constraints:
|
|
||||||
- node.labels.type == service
|
|
||||||
```
|
```
|
||||||
|
|
||||||
> `{{ .Node.Hostname }}` is Docker Swarm's Go template for the node hostname —
|
## Bootstrap Flow
|
||||||
> gives each Vault instance a unique `node_id`.
|
|
||||||
|
|
||||||
## Raft initialization procedure (first deploy)
|
Normal production bootstrap is automated:
|
||||||
|
|
||||||
### Step 1 — Deploy the stack
|
1. Create or refresh the placeholder `vault_unseal_key` secret when needed.
|
||||||
|
2. Deploy `docker-stack-vault.yml`.
|
||||||
|
3. Initialize Vault with one key share and one threshold if it is not initialized.
|
||||||
|
4. Replace the placeholder `vault_unseal_key` secret with the real unseal key.
|
||||||
|
5. Unseal the leader.
|
||||||
|
6. Join peers to the Raft cluster.
|
||||||
|
7. Unseal peers.
|
||||||
|
8. Verify Raft peers and service health.
|
||||||
|
|
||||||
|
These operations belong to `vault-bootstrap.sh`, not to a manual operator checklist.
|
||||||
|
|
||||||
|
## Verification
|
||||||
|
|
||||||
|
Use the current setup verification flow:
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
docker stack deploy -c docker-stack-infra.yml -c docker-stack-infra.prod.yml iklimco
|
docker service ps iklimco_vault
|
||||||
|
docker exec $(docker ps -q -f name=iklimco_vault | head -1) vault status
|
||||||
|
docker exec $(docker ps -q -f name=iklimco_vault | head -1) vault operator raft list-peers
|
||||||
```
|
```
|
||||||
|
|
||||||
All 3 Vault containers start. Only the first one to initialize becomes the leader.
|
Expected state:
|
||||||
|
|
||||||
### Step 2 — Initialize Vault on the leader (iklim-app-01)
|
- Vault service has 3 running tasks.
|
||||||
|
- `vault status` reports `Sealed false`.
|
||||||
|
- Raft list shows one leader and two followers.
|
||||||
|
|
||||||
```bash
|
## Historical / Superseded by Setup
|
||||||
VAULT_CTR=$(docker ps -q -f name=iklimco_vault)
|
|
||||||
docker exec -it "$VAULT_CTR" vault operator init
|
|
||||||
```
|
|
||||||
|
|
||||||
Save the unseal keys and root token securely. Store the unseal key as a Docker secret:
|
The previous manual procedure is superseded:
|
||||||
|
|
||||||
```bash
|
- Deploying Vault through `docker-stack-infra.yml` + `docker-stack-infra.prod.yml`.
|
||||||
echo -n "<unseal-key>" | docker secret create vault_unseal_key -
|
- Creating `/opt/iklimco/vault/data` host-path directories on each app node.
|
||||||
```
|
- Running `vault operator init` manually.
|
||||||
|
- Manually copying/storing unseal keys.
|
||||||
|
- Manually running `vault operator raft join` on peers.
|
||||||
|
- Manually unsealing each peer after join.
|
||||||
|
|
||||||
### Step 3 — Unseal the leader
|
Keep those notes only as historical context. For current prod, use `docker-stack-vault.yml` and `init/vault/vault-bootstrap.sh`.
|
||||||
|
|
||||||
```bash
|
|
||||||
docker exec -it "$VAULT_CTR" vault operator unseal
|
|
||||||
```
|
|
||||||
|
|
||||||
The healthcheck auto-unseals on subsequent restarts via the `vault_unseal_key` secret.
|
|
||||||
|
|
||||||
### Step 4 — Join remaining nodes to the Raft cluster
|
|
||||||
|
|
||||||
On iklim-app-02 and iklim-app-03 containers:
|
|
||||||
|
|
||||||
```bash
|
|
||||||
docker exec -it <vault-on-iklim-app-02> vault operator raft join \
|
|
||||||
https://vault.iklim.co:8200
|
|
||||||
|
|
||||||
docker exec -it <vault-on-iklim-app-03> vault operator raft join \
|
|
||||||
https://vault.iklim.co:8200
|
|
||||||
```
|
|
||||||
|
|
||||||
Unseal each node after joining:
|
|
||||||
|
|
||||||
```bash
|
|
||||||
docker exec -it <vault-on-iklim-app-02> vault operator unseal
|
|
||||||
docker exec -it <vault-on-iklim-app-03> vault operator unseal
|
|
||||||
```
|
|
||||||
|
|
||||||
### Step 5 — Verify cluster
|
|
||||||
|
|
||||||
```bash
|
|
||||||
docker exec "$VAULT_CTR" vault operator raft list-peers
|
|
||||||
```
|
|
||||||
|
|
||||||
Expected: 3 peers, one `leader`, two `follower`.
|
|
||||||
|
|
||||||
## cert-reloader — no additional changes needed for Raft
|
|
||||||
|
|
||||||
cert-reloader writes the cert to `SWAG_CERT_DIR=/mnt/storagebox/ssl`.
|
|
||||||
Since StorageBox is mounted on all app nodes, every Vault instance already sees the same path.
|
|
||||||
|
|
||||||
The cert renewal flow works unchanged with Raft:
|
|
||||||
```
|
|
||||||
cert changed → copy to /mnt/storagebox/ssl/ → docker service update --force iklimco_vault
|
|
||||||
Vault (3 replicas) restart → each auto-unseals via healthcheck
|
|
||||||
```
|
|
||||||
|
|
||||||
## Reference
|
|
||||||
- Vault Raft storage docs: https://developer.hashicorp.com/vault/docs/configuration/storage/raft
|
|
||||||
- Vault Swarm setup: https://manjit28.medium.com/setting-up-a-secure-and-highly-available-hashicorp-vault-cluster-for-secrets-and-certificates-0ce01a370582
|
|
||||||
|
|||||||
@ -1,24 +1,23 @@
|
|||||||
# Setup Aşamaları — Roadmap Eşleştirme Tablosu
|
# Setup Aşamaları — Roadmap Eşleştirme Tablosu
|
||||||
|
|
||||||
Bu tablo, `roadmap/test-env` ve `roadmap/prod-env` klasörlerindeki yol haritası adımlarının
|
Bu tablo, `roadmap/test-env` ve `roadmap/prod-env` klasörlerindeki yol haritası adımlarının Terraform/Ansible setup aşamalarından hangisinde ele alındığını gösterir.
|
||||||
Terraform/Ansible setup aşamalarından hangisinde ele alındığını gösterir.
|
|
||||||
|
|
||||||
## TEST ortamı
|
## TEST ortamı
|
||||||
|
|
||||||
| Roadmap adımı | Hangi aşamada ele alınmalı |
|
| Roadmap adımı | Hangi aşamada ele alınmalı |
|
||||||
| --- | --- |
|
| --- | --- |
|
||||||
| Hetzner firewall (sadece 22/80/443) | **Terraform `02-test-terraform-iaac.md`** — `firewall.tf` |
|
| Hetzner firewall (sadece 22/80/443) | **Terraform `02-test-terraform-iac.md`** — `firewall.tf` |
|
||||||
| Sunucu oluşturma (`iklim-app-01`, `iklim-db-01`) | **Terraform `02-test-terraform-iaac.md`** — `servers.tf` |
|
| Sunucu oluşturma (`iklim-app-01`, `iklim-db-01`) | **Terraform `02-test-terraform-iac.md`** — `servers.tf` |
|
||||||
| Private network + placement group (`iklim-test-spread`) | **Terraform `02-test-terraform-iaac.md`** — `network.tf`, `placement.tf` |
|
| Private network + placement group (`iklim-test-spread`) | **Terraform `02-test-terraform-iac.md`** — `network.tf`, `placement.tf` |
|
||||||
| Floating IP (`iklim-test-app-fip`) | **Terraform `02-test-terraform-iaac.md`** — `floating_ip.tf` |
|
| Floating IP (`iklim-test-app-fip`) | **Terraform `02-test-terraform-iac.md`** — `floating_ip.tf` |
|
||||||
| Docker Engine kurulumu (app + db node) | **Ansible `03-test-ansible-bootstrap.md`** — `docker` role |
|
| Docker Engine kurulumu (app + db node) | **Ansible `03-test-ansible-bootstrap.md`** — `docker` role |
|
||||||
| Security hardening (SSH, firewalld, fail2ban) | **Ansible `03-test-ansible-bootstrap.md`** — `hardening` role |
|
| Security hardening (SSH, firewalld, fail2ban) | **Ansible `03-test-ansible-bootstrap.md`** — `hardening` role |
|
||||||
| Docker Swarm init + `iklim-db-01` worker join | **Ansible `03-test-ansible-bootstrap.md`** — `swarm` role |
|
| Docker Swarm init + `iklim-db-01` worker join | **Ansible `03-test-ansible-bootstrap.md`** — `swarm` role |
|
||||||
| `type=service` ve `role=db` node label'ları | **Ansible `03-test-ansible-bootstrap.md`** — `swarm` role |
|
| `type=service` ve `role=db` node label'ları | **Ansible `03-test-ansible-bootstrap.md`** — `swarm` role |
|
||||||
| `/opt/iklimco/...` dizinleri | **Ansible `03-test-ansible-bootstrap.md`** — `node_dirs` role |
|
| `/opt/iklimco/...` dizinleri | **Ansible `03-test-ansible-bootstrap.md`** — `node_dirs` role |
|
||||||
| StorageBox DAVFS mount (`u469968-sub4`) | **Ansible `03-test-ansible-bootstrap.md`** — `storagebox` role |
|
| StorageBox DAVFS mount (`u469968-sub4`) | **Ansible `03-test-ansible-bootstrap.md`** — `storagebox` role |
|
||||||
| DB stack deploy (PostgreSQL + MongoDB on `iklim-db-01`) | **Manuel `04-test-db-docker-kurulum.md`** |
|
| DB stack deploy (PostgreSQL + MongoDB on `iklim-db-01`) | **Manuel `04-test-db-docker-setup.md`** |
|
||||||
| `act_runner` systemd kurulumu | **Ansible `05-test-runner-ve-deploy-onkosullari.md`** — `act_runner` role (`test-app-post-stack.yml`) |
|
| `act_runner` systemd kurulumu | **Ansible `05-test-runner-and-deploy-prerequisites.md`** — `act_runner` role (`test-app-post-stack.yml`) |
|
||||||
| GoDaddy credentials storagebox'a yükleme | **Manuel kalır** — secret yönetimi, Terraform/Ansible dışı |
|
| GoDaddy credentials storagebox'a yükleme | **Manuel kalır** — secret yönetimi, Terraform/Ansible dışı |
|
||||||
| `docker-stack-infra.yml` port kaldırma + SWAG/cert-reloader ekleme | **Pipeline `deploy-test.yml`** + **repo değişikliği** — `roadmap/test-env/03` |
|
| `docker-stack-infra.yml` port kaldırma + SWAG/cert-reloader ekleme | **Pipeline `deploy-test.yml`** + **repo değişikliği** — `roadmap/test-env/03` |
|
||||||
| SWAG nginx proxy conf'ları (`template/swag/site-confs/*.conf.tpl`) | **Repo içinde teslim edildi** — `roadmap/test-env/04` |
|
| SWAG nginx proxy conf'ları (`template/swag/site-confs/*.conf.tpl`) | **Repo içinde teslim edildi** — `roadmap/test-env/04` |
|
||||||
@ -31,22 +30,22 @@ Terraform/Ansible setup aşamalarından hangisinde ele alındığını gösterir
|
|||||||
|
|
||||||
| Roadmap adımı | Hangi aşamada ele alınmalı |
|
| Roadmap adımı | Hangi aşamada ele alınmalı |
|
||||||
| --- | --- |
|
| --- | --- |
|
||||||
| 6 sunucu oluşturma (`iklim-app-01/02/03`, `iklim-db-01/02/03`) | **Terraform `06-prod-terraform-iaac.md`** — `servers.tf` |
|
| 6 sunucu oluşturma (`iklim-app-01/02/03`, `iklim-db-01/02/03`) | **Terraform `06-prod-terraform-iac.md`** — `servers.tf` |
|
||||||
| Private network + 2 placement group | **Terraform `06-prod-terraform-iaac.md`** — `network.tf`, `placement.tf` |
|
| Private network + 2 placement group | **Terraform `06-prod-terraform-iac.md`** — `network.tf`, `placement.tf` |
|
||||||
| Firewall (sadece 22/80/443 public; private port matrisi) | **Terraform `06-prod-terraform-iaac.md`** — `firewall.tf` |
|
| Firewall (sadece 22/80/443 public; private port matrisi) | **Terraform `06-prod-terraform-iac.md`** — `firewall.tf` |
|
||||||
| Floating IP (`iklim-prod-app-fip`, `iklim-app-01`'e atanır) | **Terraform `06-prod-terraform-iaac.md`** — `floating_ip.tf` |
|
| Floating IP (`iklim-prod-app-fip`, `iklim-app-01`'e atanır) | **Terraform `06-prod-terraform-iac.md`** — `floating_ip.tf` |
|
||||||
| Docker Engine kurulumu (tüm node'lar — app ve db) | **Ansible `07-prod-ansible-bootstrap.md`** — `docker` role |
|
| Docker Engine kurulumu (tüm node'lar — app ve db) | **Ansible `07-prod-ansible-bootstrap.md`** — `docker` role |
|
||||||
| Security hardening (tüm node'lar) | **Ansible `07-prod-ansible-bootstrap.md`** — `hardening` role |
|
| Security hardening (tüm node'lar) | **Ansible `07-prod-ansible-bootstrap.md`** — `hardening` role |
|
||||||
| Swarm init (`iklim-app-01`) + manager join (`iklim-app-02/03`) | **Ansible `07-prod-ansible-bootstrap.md`** — `swarm` role |
|
| Swarm init (`iklim-app-01`) + manager join (`iklim-app-02/03`) | **Ansible `07-prod-ansible-bootstrap.md`** — `swarm` role |
|
||||||
| `type=service` node label (3 app node) | **Ansible `07-prod-ansible-bootstrap.md`** — `swarm` role |
|
| `type=service` node label (3 app node) | **Ansible `07-prod-ansible-bootstrap.md`** — `swarm` role |
|
||||||
| `/opt/iklimco/...` dizinleri + `/opt/iklimco/stacks` | **Ansible `07-prod-ansible-bootstrap.md`** — `node_dirs` role |
|
| `/opt/iklimco/...` dizinleri + `/opt/iklimco/stacks` | **Ansible `07-prod-ansible-bootstrap.md`** — `node_dirs` role |
|
||||||
| StorageBox DAVFS mount (`u469968-sub5`) | **Ansible `07-prod-ansible-bootstrap.md`** — `storagebox` role |
|
| StorageBox DAVFS mount (`u469968-sub5`) | **Ansible `07-prod-ansible-bootstrap.md`** — `storagebox` role |
|
||||||
| DB node'larını Swarm'a worker olarak join et | **Manuel `08-prod-db-cluster-kurulum.md`** — Bölüm 2 |
|
| DB node'larını Swarm'a worker olarak join et | **Manuel `08-prod-db-cluster-setup.md`** — Bölüm 2 |
|
||||||
| `role=db` node label (3 db node) | **Manuel `08-prod-db-cluster-kurulum.md`** — Bölüm 2 |
|
| `role=db` node label (3 db node) | **Manuel `08-prod-db-cluster-setup.md`** — Bölüm 2 |
|
||||||
| etcd cluster deploy (Patroni için) | **Manuel `08-prod-db-cluster-kurulum.md`** — Bölüm 5.2 |
|
| etcd cluster deploy (Patroni için) | **Manuel `08-prod-db-cluster-setup.md`** — Bölüm 5.2 |
|
||||||
| MongoDB replica set deploy | **Manuel `08-prod-db-cluster-kurulum.md`** — Bölüm 4 |
|
| MongoDB replica set deploy | **Manuel `08-prod-db-cluster-setup.md`** — Bölüm 4 |
|
||||||
| Patroni + PostgreSQL HA deploy | **Manuel `08-prod-db-cluster-kurulum.md`** — Bölüm 5.4 |
|
| Patroni + PostgreSQL HA deploy | **Manuel `08-prod-db-cluster-setup.md`** — Bölüm 5.4 |
|
||||||
| 3× `act_runner` systemd (HA runner) | **Ansible `09-prod-runner-ha-ve-swarm.md`** — `act_runner` role |
|
| 3× `act_runner` systemd (HA runner) | **Ansible `09-prod-runner-ha-and-swarm.md`** — `act_runner` role |
|
||||||
| GoDaddy credentials storagebox'a yükleme | **Manuel kalır** — secret yönetimi, Terraform/Ansible dışı |
|
| GoDaddy credentials storagebox'a yükleme | **Manuel kalır** — secret yönetimi, Terraform/Ansible dışı |
|
||||||
| `docker-stack-infra.yml` port kaldırma + SWAG/cert-reloader ekleme | **Repo değişikliği** — `roadmap/prod-env/03` |
|
| `docker-stack-infra.yml` port kaldırma + SWAG/cert-reloader ekleme | **Repo değişikliği** — `roadmap/prod-env/03` |
|
||||||
| SWAG nginx proxy conf'ları (`template/swag/site-confs/*.conf.tpl`) | **Repo içinde teslim edildi** — `roadmap/prod-env/04` |
|
| SWAG nginx proxy conf'ları (`template/swag/site-confs/*.conf.tpl`) | **Repo içinde teslim edildi** — `roadmap/prod-env/04` |
|
||||||
@ -61,16 +60,16 @@ Terraform/Ansible setup aşamalarından hangisinde ele alındığını gösterir
|
|||||||
```
|
```
|
||||||
Environment_Infrastructure/
|
Environment_Infrastructure/
|
||||||
setup/ ← Terraform + Ansible aşama dokümanları
|
setup/ ← Terraform + Ansible aşama dokümanları
|
||||||
00-genel-yol-haritasi.md
|
00-general-roadmap.md
|
||||||
01-private-network-port-matrisi.md
|
01-private-network-port-matrix.md
|
||||||
02-test-terraform-iaac.md
|
02-test-terraform-iac.md
|
||||||
03-test-ansible-bootstrap.md
|
03-test-ansible-bootstrap.md
|
||||||
04-test-db-docker-kurulum.md
|
04-test-db-docker-setup.md
|
||||||
05-test-runner-ve-deploy-onkosullari.md
|
05-test-runner-and-deploy-prerequisites.md
|
||||||
06-prod-terraform-iaac.md
|
06-prod-terraform-iac.md
|
||||||
07-prod-ansible-bootstrap.md
|
07-prod-ansible-bootstrap.md
|
||||||
08-prod-db-cluster-kurulum.md
|
08-prod-db-cluster-setup.md
|
||||||
09-prod-runner-ha-ve-swarm.md
|
09-prod-runner-ha-and-swarm.md
|
||||||
roadmap/
|
roadmap/
|
||||||
test-env/ ← Test ortamı Roadmap adımları
|
test-env/ ← Test ortamı Roadmap adımları
|
||||||
prod-env/ ← Prod Roadmap adımları
|
prod-env/ ← Prod Roadmap adımları
|
||||||
|
|||||||
@ -43,9 +43,9 @@ Minimum topology for the test environment:
|
|||||||
| Node | Role | Note |
|
| Node | Role | Note |
|
||||||
| --- | --- | --- |
|
| --- | --- | --- |
|
||||||
| `iklim-app-01` | Swarm manager + app worker + Gitea runner | CI/CD test deploy runs through this node |
|
| `iklim-app-01` | Swarm manager + app worker + Gitea runner | CI/CD test deploy runs through this node |
|
||||||
| `iklim-db-01` | DB node | DB infrastructure will be installed manually; it will not be installed by Gitea CI/CD |
|
| `iklim-db-01` | DB node / Swarm worker | DB host prerequisites are prepared by Ansible; DB services are deployed as Swarm services by the environment stack/pipeline |
|
||||||
|
|
||||||
The test DB setup is brought only up to machine and OS preparation with Terraform/Ansible. PostgreSQL/MongoDB cluster installation is outside this phase.
|
The test DB setup is brought up to OS, Docker, Swarm worker, config directory, and WireGuard preparation with Terraform/Ansible. PostgreSQL/MongoDB runtime services are not installed directly on the OS; they run as Docker Swarm services.
|
||||||
|
|
||||||
### Prod
|
### Prod
|
||||||
|
|
||||||
@ -56,23 +56,25 @@ HA topology for the prod environment:
|
|||||||
| `iklim-app-*` | 3 | Each one is a Swarm manager + app worker |
|
| `iklim-app-*` | 3 | Each one is a Swarm manager + app worker |
|
||||||
| `iklim-db-*` | 3 | DB cluster nodes |
|
| `iklim-db-*` | 3 | DB cluster nodes |
|
||||||
|
|
||||||
Prod DB infrastructure will be installed manually; it will not be installed by Gitea CI/CD. Terraform prepares the DB machines and network/firewall rules; Ansible installs OS hardening and base dependencies.
|
Prod DB host prerequisites are prepared by Terraform/Ansible. Runtime DB services are part of the current prod Swarm stack: etcd, Patroni/PostgreSQL, and MongoDB replica set are deployed by the prod root pipeline through `docker-stack-infra_db-prod.yml`.
|
||||||
|
|
||||||
## Public Port Policy
|
## Public Port Policy
|
||||||
|
|
||||||
Ports open to the public internet are only:
|
Ports open to the public internet are normally only:
|
||||||
|
|
||||||
- `22/tcp` SSH, only from admin IP/CIDR sources
|
- `22/tcp` SSH, only from admin IP/CIDR sources
|
||||||
- `80/tcp` HTTP
|
- `80/tcp` HTTP
|
||||||
- `443/tcp` HTTPS
|
- `443/tcp` HTTPS
|
||||||
|
|
||||||
|
Test has one explicit exception: `51820/udp` is opened on the DB node for WireGuard VPN, authenticated cryptographically. Prod currently does not expose `51820/udp` in Terraform.
|
||||||
|
|
||||||
`8200/tcp` Vault will not be opened to the public internet. Vault must be reachable only from the private network or Docker overlay.
|
`8200/tcp` Vault will not be opened to the public internet. Vault must be reachable only from the private network or Docker overlay.
|
||||||
|
|
||||||
`docker-stack-infra.yml` has been aligned with this policy: only the SWAG service publishes ports 80/443; all other services such as Vault, APISIX, RabbitMQ, Prometheus, and Grafana are reachable only through the `iklimco-net` overlay.
|
Current prod stack behavior is aligned with this policy: `docker-stack-infra_db-prod.yml` publishes public traffic through SWAG on 80/443. Vault is deployed separately by `vault-bootstrap.sh` using `docker-stack-vault.yml`; it is not publicly exposed.
|
||||||
|
|
||||||
## Private Network Policy
|
## Private Network Policy
|
||||||
|
|
||||||
The detailed matrix of ports that must be opened inside the private network is in `01-private-network-port-matrisi.md`. Agents must treat that file as the source when writing firewall or Ansible UFW rules.
|
The detailed matrix of ports that must be opened inside the private network is in `01-private-network-port-matrix.md`. Agents must treat that file as the source when writing Terraform Hetzner firewall rules and Ansible `firewalld` rules.
|
||||||
|
|
||||||
## Gitea Actions Runner Decision
|
## Gitea Actions Runner Decision
|
||||||
|
|
||||||
@ -1,8 +1,8 @@
|
|||||||
# 07 - Private Network Port Matrix
|
# 01 - Private Network Port Matrix
|
||||||
|
|
||||||
This file defines the ports that must be opened inside the Hetzner private network for test and prod environments. Ports open to the public internet will only be `22/tcp`, `80/tcp`, and `443/tcp`. Vault `8200/tcp` will not be opened publicly.
|
This file defines the ports that must be opened inside the Hetzner private network for test and prod environments. Public ingress is limited to `22/tcp`, `80/tcp`, and `443/tcp`, with one current test-only exception: `51820/udp` is public on the test DB node for WireGuard. Vault `8200/tcp` will not be opened publicly.
|
||||||
|
|
||||||
This matrix must be treated as the source for Terraform Hetzner firewall and Ansible UFW rules.
|
This matrix must be treated as the source for Terraform Hetzner firewall and Ansible `firewalld` rules.
|
||||||
|
|
||||||
## Network Plan
|
## Network Plan
|
||||||
|
|
||||||
@ -11,25 +11,25 @@ This matrix must be treated as the source for Terraform Hetzner firewall and Ans
|
|||||||
| Subnet | CIDR | Purpose |
|
| Subnet | CIDR | Purpose |
|
||||||
| --- | --- | --- |
|
| --- | --- | --- |
|
||||||
| App/Swarm | `10.10.10.0/24` | `iklim-app-01` |
|
| App/Swarm | `10.10.10.0/24` | `iklim-app-01` |
|
||||||
| DB | `10.10.20.0/24` | `test-db-01` |
|
| DB | `10.10.20.0/24` | `iklim-db-01` |
|
||||||
|
|
||||||
### Prod
|
### Prod
|
||||||
|
|
||||||
| Subnet | CIDR | Purpose |
|
| Subnet | CIDR | Purpose |
|
||||||
| --- | --- | --- |
|
| --- | --- | --- |
|
||||||
| App/Swarm | `10.20.10.0/24` | `iklim-app-01/02/03` |
|
| App/Swarm | `10.20.10.0/24` | `iklim-app-01/02/03` |
|
||||||
| DB | `10.20.20.0/24` | `prod-db-01/02/03` |
|
| DB | `10.20.20.0/24` | `iklim-db-01/02/03` |
|
||||||
|
|
||||||
## Public Ingress Standard
|
## Public Ingress Standard
|
||||||
|
|
||||||
Public ingress for all environments:
|
Public ingress:
|
||||||
|
|
||||||
| Port | Protocol | Source | Target | Requirement |
|
| Port | Protocol | Source | Target | Requirement |
|
||||||
| --- | --- | --- | --- | --- |
|
| --- | --- | --- | --- | --- |
|
||||||
| `22` | TCP | Admin IP/CIDR | All nodes | SSH management |
|
| `22` | TCP | Admin IP/CIDR | All nodes | SSH management |
|
||||||
| `80` | TCP | Internet | `iklim-app-01` (gateway) | HTTP / ACME redirect |
|
| `80` | TCP | Internet | `iklim-app-01` (gateway) | HTTP / ACME redirect |
|
||||||
| `443` | TCP | Internet | `iklim-app-01` (gateway) | HTTPS |
|
| `443` | TCP | Internet | `iklim-app-01` (gateway) | HTTPS |
|
||||||
| `51820` | UDP | `0.0.0.0/0`, `::/0` | `iklim-db-01` (DB node) | WireGuard VPN — authentication with cryptographic key |
|
| `51820` | UDP | `0.0.0.0/0`, `::/0` | `iklim-db-01` in test only | WireGuard VPN — authentication with cryptographic key |
|
||||||
|
|
||||||
Critical ports that will not be opened publicly:
|
Critical ports that will not be opened publicly:
|
||||||
|
|
||||||
@ -80,11 +80,11 @@ These ports will not be opened publicly. Access will be allowed only from requir
|
|||||||
| `9090` | TCP | Prometheus UI/API | Admin CIDR or private ops | Prometheus service/node | Public closed |
|
| `9090` | TCP | Prometheus UI/API | Admin CIDR or private ops | Prometheus service/node | Public closed |
|
||||||
| `3000` | TCP | Grafana UI | Admin CIDR or private ops | Grafana service/node | Public closed |
|
| `3000` | TCP | Grafana UI | Admin CIDR or private ops | Grafana service/node | Public closed |
|
||||||
|
|
||||||
`docker-stack-infra.yml` has been updated so that only the SWAG service publishes ports 80/443 in host mode. All other services contain no published ports; access is provided only through the `iklimco-net` overlay. This table remains the source for private ingress decisions.
|
The current prod root stack is `docker-stack-infra_db-prod.yml`; Vault is deployed separately with `docker-stack-vault.yml` through `vault-bootstrap.sh`. Public traffic is expected to enter through SWAG on 80/443. Private service reachability is provided by the `iklimco-net` overlay and by the explicit host-mode DB/cluster ports listed below.
|
||||||
|
|
||||||
## DB Node Ports
|
## DB Node Ports
|
||||||
|
|
||||||
Because DB infrastructure will be installed manually, the exact cluster technology is outside this document. Still, the default ports for firewall purposes are below.
|
DB runtime services are deployed as Docker Swarm services. Prod currently uses Patroni/PostgreSQL, etcd, and a MongoDB replica set in `docker-stack-infra_db-prod.yml`; the required firewall ports are below.
|
||||||
|
|
||||||
### PostgreSQL / PostGIS (Patroni + etcd)
|
### PostgreSQL / PostGIS (Patroni + etcd)
|
||||||
|
|
||||||
@ -129,7 +129,7 @@ App subnet (swarm firewall) — traffic inside itself:
|
|||||||
| Source | Target | Ports |
|
| Source | Target | Ports |
|
||||||
| --- | --- | --- |
|
| --- | --- | --- |
|
||||||
| `10.20.10.0/24` | `10.20.10.0/24` | `2377/tcp`, `7946/tcp`, `7946/udp`, `4789/udp` (Swarm) |
|
| `10.20.10.0/24` | `10.20.10.0/24` | `2377/tcp`, `7946/tcp`, `7946/udp`, `4789/udp` (Swarm) |
|
||||||
| `10.20.10.0/24` | `10.20.10.0/24` | `8200/tcp`, `6379/tcp`, `5672/tcp`, `61613/tcp`, `15674/tcp`, `2379/tcp` (application services) |
|
| `10.20.10.0/24` | `10.20.10.0/24` | `8200/tcp`, `5672/tcp`, `61613/tcp`, `15674/tcp` (application services) |
|
||||||
| Admin CIDR or VPN | `10.20.10.0/24` | `15672/tcp`, `9180/tcp`, `9090/tcp`, `3000/tcp` |
|
| Admin CIDR or VPN | `10.20.10.0/24` | `15672/tcp`, `9180/tcp`, `9090/tcp`, `3000/tcp` |
|
||||||
|
|
||||||
App -> DB traffic (there is no related rule in the swarm firewall; it is allowed in the db firewall):
|
App -> DB traffic (there is no related rule in the swarm firewall; it is allowed in the db firewall):
|
||||||
@ -157,7 +157,7 @@ DB -> App traffic (allowed in the swarm firewall):
|
|||||||
|
|
||||||
- The public firewall does not open `8200/tcp`.
|
- The public firewall does not open `8200/tcp`.
|
||||||
- DB ports are not open publicly.
|
- DB ports are not open publicly.
|
||||||
- Swarm ports are open only inside the private app/swarm subnet.
|
- Swarm ports are open only between Swarm app and DB subnets.
|
||||||
- The App/Swarm subnet reaches the DB subnet only through required DB ports.
|
- The App/Swarm subnet reaches the DB subnet only through required DB ports.
|
||||||
- The DB subnet is not opened to the app subnet with broad permissions.
|
- The DB subnet is not opened to the app subnet with broad permissions.
|
||||||
- Admin UI ports are restricted through admin CIDR/VPN/private ops instead of public access.
|
- Admin UI ports are restricted through admin CIDR/VPN/private ops instead of public access.
|
||||||
@ -11,8 +11,8 @@ Terraform creates the following in the test environment:
|
|||||||
- App/Swarm subnet: `10.10.10.0/24`
|
- App/Swarm subnet: `10.10.10.0/24`
|
||||||
- DB subnet: `10.10.20.0/24`
|
- DB subnet: `10.10.20.0/24`
|
||||||
- Firewall:
|
- Firewall:
|
||||||
- Public ingress: only `22/tcp`, `80/tcp`, `443/tcp`
|
- Public ingress: `22/tcp`, `80/tcp`, `443/tcp`, plus test DB WireGuard `51820/udp`
|
||||||
- Private ingress: test rules in `01-private-network-port-matrisi.md`
|
- Private ingress: test rules in `01-private-network-port-matrix.md`
|
||||||
- SSH key
|
- SSH key
|
||||||
- Placement group: `iklim-test-spread`
|
- Placement group: `iklim-test-spread`
|
||||||
- Floating IP: stable IPv4 for the swarm entry point
|
- Floating IP: stable IPv4 for the swarm entry point
|
||||||
@ -21,7 +21,7 @@ Terraform creates the following in the test environment:
|
|||||||
- `iklim-db-01`
|
- `iklim-db-01`
|
||||||
- Ansible inventory output
|
- Ansible inventory output
|
||||||
|
|
||||||
Terraform does not install DB software. The DB node is prepared only at the machine, network, and firewall level.
|
Terraform does not install DB software. The DB node is prepared at the machine, network, and firewall level; Ansible later prepares Docker, Swarm worker membership, DB config directories, and WireGuard.
|
||||||
|
|
||||||
## Recommended File Structure
|
## Recommended File Structure
|
||||||
|
|
||||||
@ -69,7 +69,7 @@ The server type decision is based on the current test environment metrics in `..
|
|||||||
| Server | Private IP | Role |
|
| Server | Private IP | Role |
|
||||||
| --- | --- | --- |
|
| --- | --- | --- |
|
||||||
| `iklim-app-01` | `10.10.10.11` | Swarm manager + app worker + Gitea runner |
|
| `iklim-app-01` | `10.10.10.11` | Swarm manager + app worker + Gitea runner |
|
||||||
| `iklim-db-01` | `10.10.20.11` | DB node prepared for manual DB installation |
|
| `iklim-db-01` | `10.10.20.11` | DB node / Swarm worker for DB services |
|
||||||
|
|
||||||
Private IPs must be statically defined inside Terraform. Ansible inventory and firewall rules remain deterministic.
|
Private IPs must be statically defined inside Terraform. Ansible inventory and firewall rules remain deterministic.
|
||||||
|
|
||||||
@ -91,7 +91,7 @@ Public ingress:
|
|||||||
| `80/tcp` | `0.0.0.0/0`, `::/0` | `iklim-app-01` |
|
| `80/tcp` | `0.0.0.0/0`, `::/0` | `iklim-app-01` |
|
||||||
| `443/tcp` | `0.0.0.0/0`, `::/0` | `iklim-app-01` |
|
| `443/tcp` | `0.0.0.0/0`, `::/0` | `iklim-app-01` |
|
||||||
|
|
||||||
For public ingress, `8200/tcp`, `5432/tcp`, `27017/tcp`, `5672/tcp`, `15672/tcp`, `6379/tcp`, `2379/tcp`, `9000/tcp`, `9180/tcp`, `9090/tcp`, and `3000/tcp` will not be opened.
|
For public ingress, `8200/tcp`, `5432/tcp`, `27017/tcp`, `5672/tcp`, `15672/tcp`, `6379/tcp`, `2379/tcp`, `9000/tcp`, `9180/tcp`, `9090/tcp`, and `3000/tcp` will not be opened. `51820/udp` is the explicit test-only public exception for WireGuard.
|
||||||
|
|
||||||
### App (swarm) Firewall — Private Ingress
|
### App (swarm) Firewall — Private Ingress
|
||||||
|
|
||||||
@ -133,9 +133,9 @@ Source from DB subnet, because `iklim-db-01` joins Swarm as a worker:
|
|||||||
| `7946/tcp,udp` | Docker Swarm node discovery | `10.10.10.0/24` (app subnet) |
|
| `7946/tcp,udp` | Docker Swarm node discovery | `10.10.10.0/24` (app subnet) |
|
||||||
| `4789/udp` | Docker Swarm VXLAN overlay | `10.10.10.0/24` (app subnet) |
|
| `4789/udp` | Docker Swarm VXLAN overlay | `10.10.10.0/24` (app subnet) |
|
||||||
|
|
||||||
IP restriction is done in the SWAG nginx configuration, not in the Hetzner firewall. None of these ports are opened publicly from the `admin_allowed_cidrs` source.
|
IP restriction is done in the SWAG nginx configuration, not in the Hetzner firewall. None of these management ports are opened publicly from the `admin_allowed_cidrs` source.
|
||||||
|
|
||||||
For other private ingress rules, `01-private-network-port-matrisi.md` will be used as the source.
|
For other private ingress rules, `01-private-network-port-matrix.md` will be used as the source.
|
||||||
|
|
||||||
## Placement Group
|
## Placement Group
|
||||||
|
|
||||||
@ -204,6 +204,6 @@ Each server gets `lifecycle { prevent_destroy = true }`. While this block exists
|
|||||||
- `terraform plan` works only with the test Hetzner Project token.
|
- `terraform plan` works only with the test Hetzner Project token.
|
||||||
- 2 servers are created after `terraform apply`.
|
- 2 servers are created after `terraform apply`.
|
||||||
- The two servers can reach each other through the private network.
|
- The two servers can reach each other through the private network.
|
||||||
- Only `22`, `80`, and `443` are open at firewall level from the public internet.
|
- Only `22`, `80`, `443`, and test WireGuard `51820/udp` are open at firewall level from the public internet.
|
||||||
- Vault `8200` remains closed from the public internet.
|
- Vault `8200` remains closed from the public internet.
|
||||||
- Terraform state is not committed to the repo.
|
- Terraform state is not committed to the repo.
|
||||||
@ -97,7 +97,7 @@ ansible-playbook test-bootstrap.yml --tags "hardening" --ask-vault-pass
|
|||||||
| Host | Role |
|
| Host | Role |
|
||||||
| --- | --- |
|
| --- | --- |
|
||||||
| `iklim-app-01` | Swarm manager + app worker |
|
| `iklim-app-01` | Swarm manager + app worker |
|
||||||
| `iklim-db-01` | OS-hardened DB node for manual DB installation |
|
| `iklim-db-01` | OS-hardened DB node / Swarm worker for DB services |
|
||||||
|
|
||||||
## Recommended File Structure
|
## Recommended File Structure
|
||||||
|
|
||||||
@ -281,7 +281,7 @@ Deploy prerequisites on `iklim-app-01`:
|
|||||||
/opt/iklimco/stacks
|
/opt/iklimco/stacks
|
||||||
```
|
```
|
||||||
|
|
||||||
Minimum for manual DB installation on the DB node:
|
Minimum DB-node host directories:
|
||||||
|
|
||||||
```text
|
```text
|
||||||
/opt/iklimco
|
/opt/iklimco
|
||||||
@ -391,7 +391,7 @@ vault_iklim_password: "IKLIM_USER_PASSWORD"
|
|||||||
creates: "{{ storagebox_mount_point }}/.mounted_marker"
|
creates: "{{ storagebox_mount_point }}/.mounted_marker"
|
||||||
```
|
```
|
||||||
|
|
||||||
A marker file can be written to the directory to confirm mount success:
|
A marker file can be written to the directory to confirm mount success:
|
||||||
|
|
||||||
```yaml
|
```yaml
|
||||||
- name: Write mount marker
|
- name: Write mount marker
|
||||||
@ -402,7 +402,7 @@ vault_iklim_password: "IKLIM_USER_PASSWORD"
|
|||||||
|
|
||||||
6. **Create service bind mount directories**
|
6. **Create service bind mount directories**
|
||||||
|
|
||||||
In the test environment, the precipitation service's `image-data` volume is bind mounted on the host to `/mnt/storagebox/precipitation/images`. The directory is created by Ansible after StorageBox is mounted and left with `0755` permissions.
|
In the test environment, the precipitation service's `image-data` volume is bind mounted on the host to `/mnt/storagebox/precipitation/images`. The directory is created by Ansible after StorageBox is mounted and left with `0755` permissions.
|
||||||
|
|
||||||
```yaml
|
```yaml
|
||||||
- name: Create managed StorageBox directories
|
- name: Create managed StorageBox directories
|
||||||
@ -447,13 +447,13 @@ An ed25519 SSH key pair is generated on the server and uploaded to the StorageBo
|
|||||||
|
|
||||||
2. **Upload the public key to StorageBox**
|
2. **Upload the public key to StorageBox**
|
||||||
|
|
||||||
This step is done manually and requires the password the first time:
|
This step is done manually and requires the password the first time:
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
cat /root/.ssh/id_ed25519_storagebox.pub | ssh -p23 u469968-sub4@u469968-sub4.your-storagebox.de install-ssh-key
|
cat /root/.ssh/id_ed25519_storagebox.pub | ssh -p23 u469968-sub4@u469968-sub4.your-storagebox.de install-ssh-key
|
||||||
```
|
```
|
||||||
|
|
||||||
Later access works passwordlessly:
|
Later access works passwordlessly:
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
sftp -P23 u469968-sub4@u469968-sub4.your-storagebox.de
|
sftp -P23 u469968-sub4@u469968-sub4.your-storagebox.de
|
||||||
@ -461,14 +461,14 @@ An ed25519 SSH key pair is generated on the server and uploaded to the StorageBo
|
|||||||
|
|
||||||
3. **Add private and public keys to Gitea**
|
3. **Add private and public keys to Gitea**
|
||||||
|
|
||||||
Gitea -> Organization Settings -> Actions -> Secrets:
|
Gitea -> Organization Settings -> Actions -> Secrets:
|
||||||
|
|
||||||
| Secret Name | Value |
|
| Secret Name | Value |
|
||||||
| --- | --- |
|
| --- | --- |
|
||||||
| `STORAGEBOX_SSH_PRIV` | Contents of `/root/.ssh/id_ed25519_storagebox` |
|
| `STORAGEBOX_SSH_PRIV` | Contents of `/root/.ssh/id_ed25519_storagebox` |
|
||||||
| `STORAGEBOX_SSH_PUB` | Contents of `/root/.ssh/id_ed25519_storagebox.pub` |
|
| `STORAGEBOX_SSH_PUB` | Contents of `/root/.ssh/id_ed25519_storagebox.pub` |
|
||||||
|
|
||||||
To get the key contents:
|
To get the key contents:
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
cat /root/.ssh/id_ed25519_storagebox
|
cat /root/.ssh/id_ed25519_storagebox
|
||||||
|
|||||||
@ -1,6 +1,6 @@
|
|||||||
# 04 - Test DB Docker Installation (Swarm Worker)
|
# 04 - Test DB Docker Setup (Swarm Worker)
|
||||||
|
|
||||||
The purpose of this phase is to add the `iklim-db-01` node to Swarm as a worker and run PostgreSQL and MongoDB as Swarm services.
|
The purpose of this phase is to add the `iklim-db-01` node to Swarm as a worker and prepare the host for PostgreSQL and MongoDB Swarm services.
|
||||||
|
|
||||||
## Architecture Decision
|
## Architecture Decision
|
||||||
|
|
||||||
@ -8,12 +8,12 @@ The roadmap states that DBs will be installed "manually". In the test environmen
|
|||||||
|
|
||||||
The installation has **two phases:**
|
The installation has **two phases:**
|
||||||
1. **Preparation (Ansible):** The `test-db-post-stack.yml` playbook sets up DB directories, the `mongod.conf` configuration, and the WireGuard VPN service.
|
1. **Preparation (Ansible):** The `test-db-post-stack.yml` playbook sets up DB directories, the `mongod.conf` configuration, and the WireGuard VPN service.
|
||||||
2. **Deploy (Gitea CI/CD):** The `deploy-test.yml` workflow deploys PostgreSQL and MongoDB services to Swarm through `docker-stack-infra.yml`.
|
2. **Deploy (Gitea CI/CD):** The test deploy workflow deploys PostgreSQL and MongoDB services as part of the environment stack.
|
||||||
|
|
||||||
**Why?**
|
**Why?**
|
||||||
1. **Ease of management:** Version transitions and configuration management are much faster with Docker.
|
1. **Ease of management:** Version transitions and configuration management are much faster with Docker.
|
||||||
2. **Overlay Network:** Application services (`iklim-app-01`) can access DBs through the `iklimco-net` overlay network in an encrypted and isolated way.
|
2. **Overlay Network:** Application services (`iklim-app-01`) can access DBs through the `iklimco-net` overlay network in an encrypted and isolated way.
|
||||||
3. **Data persistence:** Data is stored in Docker named volumes on `iklim-db-01`. StorageBox is used only for backups.
|
3. **Data persistence:** Runtime data is kept on the DB node. StorageBox is used for shared configuration, operational files, and backup-related paths, not as the primary DB data path.
|
||||||
|
|
||||||
## Prerequisites
|
## Prerequisites
|
||||||
|
|
||||||
@ -67,24 +67,21 @@ On `iklim-db-01`, through the `db_stack` and `wireguard` roles:
|
|||||||
- Places the `mongod.conf` file
|
- Places the `mongod.conf` file
|
||||||
- Installs and configures the WireGuard VPN server (`51820/udp`)
|
- Installs and configures the WireGuard VPN server (`51820/udp`)
|
||||||
|
|
||||||
> Deploying DB services (PostgreSQL, MongoDB) to Swarm is the responsibility of the Gitea CI/CD workflow (`deploy-test.yml`), not Ansible. This workflow deploys all services at once through `docker-stack-infra.yml`.
|
> Deploying DB services (PostgreSQL, MongoDB) to Swarm is the responsibility of the Gitea CI/CD workflow, not Ansible. The Ansible playbook prepares host directories, configuration, and WireGuard.
|
||||||
|
|
||||||
## 4. Volume and Data Structure
|
## 4. Volume and Data Structure
|
||||||
|
|
||||||
DB data is stored in Docker named volumes on `iklim-db-01`:
|
DB data is stored on `iklim-db-01` through the stack's configured volume or bind-mount layout. The Ansible `db_stack` role prepares MongoDB configuration at:
|
||||||
|
|
||||||
| Volume | Content |
|
```text
|
||||||
|---|---|
|
/opt/iklimco/db/mongodb/config/mongod.conf
|
||||||
| `iklim-db_postgresql_data` | PostgreSQL data files |
|
```
|
||||||
| `iklim-db_mongodb_data` | MongoDB data files |
|
|
||||||
|
|
||||||
MongoDB logs are written to stdout and can be watched with `docker logs`. Configuration: `/opt/iklimco/db/mongodb/config/mongod.conf`
|
MongoDB logs are written to stdout and can be watched with `docker logs`.
|
||||||
|
|
||||||
> StorageBox is **not used** for DB data. It only has a role in the backup strategy.
|
|
||||||
|
|
||||||
## 5. Acceptance Criteria
|
## 5. Acceptance Criteria
|
||||||
|
|
||||||
- `iklim-db-01` appears as Ready and Active in the `docker node ls` command.
|
- `iklim-db-01` appears as Ready and Active in the `docker node ls` command.
|
||||||
- `docker stack services iklimco` shows both services with 1/1 replicas.
|
- `docker stack services iklimco` shows both services with 1/1 replicas.
|
||||||
- Access from the application node is available through the `iklim-db_postgresql` and `iklim-db_mongodb` DNS names.
|
- Access from the application node is available through the `iklim-db_postgresql` and `iklim-db_mongodb` DNS names.
|
||||||
- Data is preserved from named volumes after reboot; verify with `docker volume ls`.
|
- Data is preserved after reboot according to the stack's configured DB volume/bind-mount layout.
|
||||||
@ -8,7 +8,7 @@ A single runner is used in the test environment for cost and simplicity:
|
|||||||
|
|
||||||
| Host | Service Name | System User | Labels |
|
| Host | Service Name | System User | Labels |
|
||||||
| --- | --- | --- | --- |
|
| --- | --- | --- | --- |
|
||||||
| `iklim-app-01` | `gitea-act-runner` | `gitea-runner` | `ubuntu-latest`, `ubuntu-22.04`, `ubuntu-20.04`, `test-runner` |
|
| `iklim-app-01` | `gitea-act-runner` | `gitea-runner` | `ubuntu-latest`, `ubuntu-22.04`, `ubuntu-20.04`, `test-runner:docker://catthehacker/ubuntu:act-22.04` |
|
||||||
|
|
||||||
## 1. Runner User and Permissions
|
## 1. Runner User and Permissions
|
||||||
|
|
||||||
@ -56,14 +56,15 @@ Critical parts of the configuration:
|
|||||||
```yaml
|
```yaml
|
||||||
runner:
|
runner:
|
||||||
labels:
|
labels:
|
||||||
- "ubuntu-latest:docker://ubuntu:latest"
|
- "ubuntu-latest"
|
||||||
- "ubuntu-22.04:docker://ubuntu:22.04"
|
- "ubuntu-22.04"
|
||||||
- "ubuntu-20.04:docker://ubuntu:20.04"
|
- "ubuntu-20.04"
|
||||||
- "test-runner:docker://ubuntu:22.04"
|
- "test-runner:docker://catthehacker/ubuntu:act-22.04"
|
||||||
|
|
||||||
container:
|
container:
|
||||||
network: "iklimco-net" # Access to DB services through overlay
|
network: "bridge"
|
||||||
options: "-v /var/run/docker.sock:/var/run/docker.sock" # For Docker commands
|
options: "-v /mnt/storagebox:/mnt/storagebox"
|
||||||
|
docker_host: "unix:///var/run/docker.sock"
|
||||||
```
|
```
|
||||||
|
|
||||||
Status check:
|
Status check:
|
||||||
@ -94,7 +95,7 @@ The following secrets must be defined at Gitea Organization level for pipelines
|
|||||||
|
|
||||||
## 6. Custom Image Build and Harbor Push
|
## 6. Custom Image Build and Harbor Push
|
||||||
|
|
||||||
`docker-stack-infra.yml` and microservice stacks use private images under `registry.tarla.io/iklimco/`. These images are built and pushed to the registry with the `ops/push-harbor-custom-images.sh` script.
|
Environment stack files and microservice stacks use private images under `registry.tarla.io/iklimco/`. These images are built and pushed to the registry with the `ops/push-harbor-custom-images.sh` script.
|
||||||
|
|
||||||
APISIX config files (`build/apisix-core/config.yaml`, `build/apisix-dashboard/conf.yaml`) are generated from templates under `template/` with `envsubst`. `push-harbor-custom-images.sh` performs this generation internally; temporary files are cleaned automatically when the build finishes.
|
APISIX config files (`build/apisix-core/config.yaml`, `build/apisix-dashboard/conf.yaml`) are generated from templates under `template/` with `envsubst`. `push-harbor-custom-images.sh` performs this generation internally; temporary files are cleaned automatically when the build finishes.
|
||||||
|
|
||||||
@ -114,6 +115,6 @@ bash ops/push-harbor-custom-images.sh
|
|||||||
|
|
||||||
1. The runner labeled `test-runner` appears as **Idle** (green) on the Gitea Runners page.
|
1. The runner labeled `test-runner` appears as **Idle** (green) on the Gitea Runners page.
|
||||||
2. A workflow using `runs-on: test-runner` is triggered successfully.
|
2. A workflow using `runs-on: test-runner` is triggered successfully.
|
||||||
3. The job container can access the Docker daemon and the `iklimco-net` overlay network.
|
3. The job can access the Docker daemon through `docker_host`, and deploy workflows connect job containers to `iklimco-net` when overlay access is required.
|
||||||
4. The `8200/tcp` (Vault) port is closed to the public internet.
|
4. The `8200/tcp` (Vault) port is closed to the public internet.
|
||||||
5. `registry.tarla.io/iklimco/custom-apisix`, `custom-apisix-dashboard`, and `custom-prometheus` images exist in Harbor and are pullable.
|
5. `registry.tarla.io/iklimco/custom-apisix`, `custom-apisix-dashboard`, and `custom-prometheus` images exist in Harbor and are pullable.
|
||||||
@ -12,7 +12,7 @@ Terraform creates the following in the prod environment:
|
|||||||
- DB subnet: `10.20.20.0/24`
|
- DB subnet: `10.20.20.0/24`
|
||||||
- Firewall:
|
- Firewall:
|
||||||
- Public ingress: only `22/tcp`, `80/tcp`, `443/tcp`
|
- Public ingress: only `22/tcp`, `80/tcp`, `443/tcp`
|
||||||
- Private ingress: prod rules in `01-private-network-port-matrisi.md`
|
- Private ingress: prod rules in `01-private-network-port-matrix.md`
|
||||||
- SSH key
|
- SSH key
|
||||||
- Placement groups:
|
- Placement groups:
|
||||||
- `iklim-prod-app-spread`
|
- `iklim-prod-app-spread`
|
||||||
@ -145,6 +145,13 @@ The following ports will not be opened publicly in prod:
|
|||||||
|
|
||||||
## Private Firewall
|
## Private Firewall
|
||||||
|
|
||||||
|
Firewall placement follows the Swarm placement model:
|
||||||
|
|
||||||
|
- DB/cluster services on `iklim-db-*` nodes: Patroni/PostgreSQL, MongoDB, and etcd.
|
||||||
|
- App/service-node infrastructure on `iklim-app-*` nodes: Vault, RabbitMQ, APISIX, Prometheus, Grafana, SWAG, and the Redis/Sentinel services from `docker-stack-infra_db-prod.yml`.
|
||||||
|
|
||||||
|
RabbitMQ ports are therefore documented under the app firewall. Redis and Redis Sentinel do not publish host-mode ports in the current prod stack; they stay on the Docker overlay network and do not need Hetzner firewall openings.
|
||||||
|
|
||||||
### App (swarm) Firewall — Private Ingress
|
### App (swarm) Firewall — Private Ingress
|
||||||
|
|
||||||
Source from app subnet (`10.20.10.0/24`):
|
Source from app subnet (`10.20.10.0/24`):
|
||||||
@ -340,7 +347,7 @@ Local state is used for now (`terraform.tfstate`). The state file is not committ
|
|||||||
- Swarm nodes are inside the `iklim-prod-app-spread` placement group.
|
- Swarm nodes are inside the `iklim-prod-app-spread` placement group.
|
||||||
- DB nodes are inside the `iklim-prod-db-spread` placement group.
|
- DB nodes are inside the `iklim-prod-db-spread` placement group.
|
||||||
- Public firewall allows only `22`, `80`, and `443` ingress.
|
- Public firewall allows only `22`, `80`, and `443` ingress.
|
||||||
- Private firewall is compatible with `01-private-network-port-matrisi.md`.
|
- Private firewall is compatible with `01-private-network-port-matrix.md`.
|
||||||
- DB replication ports are accessible only from the DB subnet.
|
- DB replication ports are accessible only from the DB subnet.
|
||||||
- Floating IP is created and assigned to `iklim-app-01`.
|
- Floating IP is created and assigned to `iklim-app-01`.
|
||||||
- Terraform state and secret tfvars are not committed.
|
- Terraform state and secret tfvars are not committed.
|
||||||
@ -119,6 +119,8 @@ ansible/
|
|||||||
vars.yml
|
vars.yml
|
||||||
vault.yml
|
vault.yml
|
||||||
prod-bootstrap.yml
|
prod-bootstrap.yml
|
||||||
|
roles/
|
||||||
|
db_stack/
|
||||||
roles/
|
roles/
|
||||||
base/
|
base/
|
||||||
hardening/
|
hardening/
|
||||||
@ -131,6 +133,8 @@ ansible/
|
|||||||
db_stack/
|
db_stack/
|
||||||
```
|
```
|
||||||
|
|
||||||
|
`ansible/prod/ansible.cfg` sets `roles_path = roles:../roles`. Because of that ordering, `ansible/prod/roles/db_stack` is the production-specific role that is used by `prod-bootstrap.yml`; the shared `ansible/roles/db_stack` remains the common fallback/reference implementation. Production DB behavior that writes Patroni, MongoDB, and replica-set auth files to StorageBox belongs to the prod-local role.
|
||||||
|
|
||||||
## Base Role
|
## Base Role
|
||||||
|
|
||||||
Applied to all prod nodes:
|
Applied to all prod nodes:
|
||||||
@ -200,30 +204,35 @@ Prod Swarm will be set up with 3 managers:
|
|||||||
1. `docker swarm init` on `iklim-app-01` (Advertise/data path addr: `10.20.10.11`)
|
1. `docker swarm init` on `iklim-app-01` (Advertise/data path addr: `10.20.10.11`)
|
||||||
2. `iklim-app-02` and `iklim-app-03` join as managers.
|
2. `iklim-app-02` and `iklim-app-03` join as managers.
|
||||||
3. `iklim-db-01/02/03` join as workers.
|
3. `iklim-db-01/02/03` join as workers.
|
||||||
4. Overlay network is created: `iklimco-net`
|
4. `iklimco-net` is not created by the Ansible swarm role. It is created and owned by the Swarm stack (`docker-stack-infra_db-prod.yml`) so Docker embedded DNS works for service VIPs and aliases.
|
||||||
5. Node labels:
|
5. Node labels:
|
||||||
- `iklim-app-*` -> `type=service`
|
- `iklim-app-*` -> `type=service`
|
||||||
- `iklim-db-*` -> `role=db`, `db-index=01/02/03`, for Patroni node coordination
|
- `iklim-db-*` -> `role=db`
|
||||||
|
- `iklim-db-*` -> `db-index=01/02/03`, for Patroni node coordination
|
||||||
6. All nodes remain `AVAILABILITY=Active`.
|
6. All nodes remain `AVAILABILITY=Active`.
|
||||||
|
|
||||||
The `db-index` labels are added through `iklim-app-01` in a separate play inside `prod-bootstrap.yml`, not by the swarm role.
|
Labeling is intentionally split across two automation layers:
|
||||||
|
|
||||||
|
- The shared `swarm` role adds the generic environment labels: `type=service` on app nodes and `role=db` on DB nodes.
|
||||||
|
- The production playbook adds `db-index=01/02/03` through `iklim-app-01` in a separate play inside `prod-bootstrap.yml`.
|
||||||
|
|
||||||
|
This split keeps the common Swarm role reusable while letting prod add the Patroni/MongoDB coordination labels it needs.
|
||||||
|
|
||||||
## Node Directory Role
|
## Node Directory Role
|
||||||
|
|
||||||
On all `iklim-app-*` nodes:
|
On all `iklim-app-*` nodes:
|
||||||
```text
|
```text
|
||||||
/opt/iklimco/ssl
|
/opt/iklimco/ssl
|
||||||
/opt/iklimco/init
|
|
||||||
/opt/iklimco/stacks
|
|
||||||
/opt/iklimco/vault/data
|
|
||||||
```
|
```
|
||||||
|
|
||||||
`/opt/iklimco/vault/data` is the host path volume of the Vault Raft node; it must be created separately on every app node. Swarm does not manage this directory as an overlay volume; if it is missing, the Vault container will not start.
|
Vault data is managed by the `docker-stack-vault.yml` stack through Docker volumes. The app nodes need the local SSL directory because `cert-distributor` syncs certificates from StorageBox into `/opt/iklimco/ssl` for Vault.
|
||||||
|
|
||||||
On DB nodes:
|
On DB nodes:
|
||||||
```text
|
```text
|
||||||
/opt/iklimco/db
|
/opt/iklimco/db
|
||||||
/opt/iklimco/backup
|
/opt/iklimco/backup
|
||||||
|
/opt/iklimco/db/mongodb
|
||||||
|
/opt/iklimco/db/postgresql
|
||||||
```
|
```
|
||||||
|
|
||||||
## StorageBox DAVFS Mount Role
|
## StorageBox DAVFS Mount Role
|
||||||
@ -256,19 +265,22 @@ Applied to `iklim-app-*` nodes. Gitea Act Runner is installed on each app node a
|
|||||||
|
|
||||||
## DB Stack Role
|
## DB Stack Role
|
||||||
|
|
||||||
Applied to `iklim-db-*` nodes. On each DB node, it creates `/opt/iklimco/db` and `/opt/iklimco/backup` directories, as well as a local reference directory for MongoDB. The actual production configuration, including node-specific `mongod.conf`, replica set auth key, and Patroni configurations, is set up on StorageBox at `/mnt/storagebox/db/mongodb-0X/config/` and `/mnt/storagebox/db/postgresql-0X/config/` in the `08-prod-db-cluster-kurulum.md` step. etcd data is stored on local Docker named volumes (not StorageBox).
|
Applied to `iklim-db-*` nodes. On each DB node, it creates `/opt/iklimco/db`, `/opt/iklimco/backup`, `/opt/iklimco/db/mongodb`, and `/opt/iklimco/db/postgresql`. The production configuration, including node-specific `mongod.conf`, replica set auth key, and Patroni configurations, is deployed by the Ansible `db_stack` role to StorageBox at `/mnt/storagebox/db/mongodb-0X/config/` and `/mnt/storagebox/db/postgresql-0X/config/`. etcd data is stored on local Docker named volumes.
|
||||||
|
|
||||||
## DB Stack Env Variables
|
## DB Stack Env Variables
|
||||||
|
|
||||||
Password variables required by the DB cluster stack (`docker-stack-db.prod.yml`) — `DATABASE_POSTGRES_ROOT_PASSWD`, `DATABASE_POSTGRES_REPLICATOR_PASSWORD`, `DATABASE_MONGODB_ROOT_PASSWD` — are stored in `prod/secrets/iklim.co/.env.secrets.shared` on StorageBox, alongside the other shared secrets. No separate file is needed.
|
Password variables required by the prod infra stack (`docker-stack-infra_db-prod.yml`) — including `DATABASE_POSTGRES_ROOT_PASSWD`, `DATABASE_POSTGRES_REPLICATOR_PASSWORD`, `DATABASE_MONGODB_ROOT_PASSWD`, and `ETCD_ROOT_PASSWORD` — are stored in `prod/secrets/iklim.co/.env.secrets.shared` on StorageBox, alongside the other shared secrets. No separate file is needed.
|
||||||
|
|
||||||
## StorageBox Directory Structure
|
## StorageBox Directory Structure
|
||||||
|
|
||||||
The `storagebox` Ansible rolü `storagebox_managed_directories` (`group_vars/all/vars.yml`) aracılığıyla aşağıdaki dizinleri bootstrap sırasında **otomatik** oluşturur. Manüel adım gerekmez:
|
The `storagebox` Ansible rolü `storagebox_managed_directories` (`group_vars/all/vars.yml`) aracılığıyla aşağıdaki dizinleri bootstrap sırasında **otomatik** oluşturur. Manüel adım gerekmez:
|
||||||
|
|
||||||
- `/mnt/storagebox/ssl` → `SWAG_CERT_DIR`
|
- `/mnt/storagebox/ssl` → `SWAG_CERT_DIR`
|
||||||
- `/mnt/storagebox/swag/config` → `SWAG_CONFIG_DIR`
|
- `/mnt/storagebox/swag`
|
||||||
|
- `/mnt/storagebox/swag/dns-conf` → `SWAG_DNS_CONFIG_DIR`
|
||||||
- `/mnt/storagebox/swag/site-confs` → `SWAG_SITE_CONFS_DIR`
|
- `/mnt/storagebox/swag/site-confs` → `SWAG_SITE_CONFS_DIR`
|
||||||
|
- `/mnt/storagebox/swag/proxy-confs` → `SWAG_PROXY_CONFS_DIR`
|
||||||
|
- `/mnt/storagebox/swag/certbot`
|
||||||
- `/mnt/storagebox/grafana/data` → `GRAFANA_DATA_DIR`
|
- `/mnt/storagebox/grafana/data` → `GRAFANA_DATA_DIR`
|
||||||
- `/mnt/storagebox/precipitation/images`
|
- `/mnt/storagebox/precipitation/images`
|
||||||
|
|
||||||
@ -300,12 +312,12 @@ grep -n "swarm init\|swarm join" init/swarm-init.sh
|
|||||||
- 3 Swarm manager nodes appear as Leader/Reachable in `docker node ls`.
|
- 3 Swarm manager nodes appear as Leader/Reachable in `docker node ls`.
|
||||||
- 3 DB nodes appear as Workers in `docker node ls`.
|
- 3 DB nodes appear as Workers in `docker node ls`.
|
||||||
- Manager quorum is provided: 3 managers, 1 loss tolerated.
|
- Manager quorum is provided: 3 managers, 1 loss tolerated.
|
||||||
- The `iklimco-net` overlay network exists.
|
- The `iklimco-net` overlay network is created by the Swarm stack after `docker-stack-infra_db-prod.yml` deploy.
|
||||||
- Node labels (`type=service`, `role=db`, `db-index=01/02/03`) are verified with inspect.
|
- Node labels (`type=service`, `role=db`, `db-index=01/02/03`) are verified with inspect.
|
||||||
- `swarm-init.sh` does not attempt init again in an active Swarm; it is idempotent.
|
- `swarm-init.sh` does not attempt init again in an active Swarm; it is idempotent.
|
||||||
- `/mnt/storagebox` is mounted on every node.
|
- `/mnt/storagebox` is mounted on every node.
|
||||||
- The `/opt/iklimco/vault/data` directory exists on every app node.
|
- The `/opt/iklimco/ssl` directory exists on every app node.
|
||||||
- The `ssl`, `swag/config`, `swag/site-confs`, `grafana/data`, and `precipitation/images` directories exist on StorageBox.
|
- The `db`, `ssl`, `swag`, `swag/dns-conf`, `swag/site-confs`, `swag/proxy-confs`, `swag/certbot`, `grafana/data`, and `precipitation/images` directories exist on StorageBox.
|
||||||
- The Gitea Act Runner service is running on every app node.
|
- The Gitea Act Runner service is running on every app node.
|
||||||
- `/opt/iklimco/db` and `/opt/iklimco/backup` directories exist on DB nodes. Node-specific `mongod.conf` and other DB configurations are created on StorageBox (`/mnt/storagebox/db/...`) in the `08-prod-db-cluster-kurulum.md` step.
|
- `/opt/iklimco/db` and `/opt/iklimco/backup` directories exist on DB nodes. Node-specific `mongod.conf` and other DB configurations are created on StorageBox (`/mnt/storagebox/db/...`) in the `08-prod-db-cluster-setup.md` step.
|
||||||
- Public firewall allows only `22`, `80`, and `443` ingress.
|
- Public firewall allows only `22`, `80`, and `443` ingress.
|
||||||
|
|||||||
@ -27,7 +27,9 @@ iklim-db-03 (Swarm worker, 10.20.20.13)
|
|||||||
patroni-03 [Patroni + PostgreSQL — standby]
|
patroni-03 [Patroni + PostgreSQL — standby]
|
||||||
```
|
```
|
||||||
|
|
||||||
DB containers discover each other through **overlay DNS aliases** (`mongodb-01`, `etcd-01`, `patroni-01`, etc.) on the shared `iklimco-net` overlay network. Each service publishes its port in `host` mode so replication traffic goes directly through the Hetzner private network while the overlay DNS resolves service names correctly. All containers are defined in the single `docker-stack-db.prod.yml` stack file at the repo root.
|
DB containers discover each other through **overlay DNS aliases** (`mongodb-01`, `etcd-01`, `patroni-01`, etc.) on the shared `iklimco-net` overlay network. Patroni/PostgreSQL, MongoDB, and etcd are the DB/cluster services covered by this document; they publish their cluster ports in `host` mode so replication traffic goes directly through the Hetzner private network while overlay DNS resolves service names correctly.
|
||||||
|
|
||||||
|
The current prod DB services are defined in the root `docker-stack-infra_db-prod.yml` stack file. That stack also contains non-DB infrastructure services such as Redis, Redis Sentinel, and RabbitMQ. Those services are intentionally different: they run on `node.labels.type == service` app/service nodes, do not publish host-mode ports in this stack, and communicate through the `iklimco-net` overlay network only. Do not generalize the DB host-mode rule to Redis or RabbitMQ.
|
||||||
|
|
||||||
## 1. Firewall Update
|
## 1. Firewall Update
|
||||||
|
|
||||||
@ -145,6 +147,10 @@ terraform apply
|
|||||||
|
|
||||||
## 2. Add DB Nodes to Swarm
|
## 2. Add DB Nodes to Swarm
|
||||||
|
|
||||||
|
This is handled by `Environment_Infrastructure/ansible/prod/prod-bootstrap.yml` through the `swarm` role. The role initializes Swarm on `iklim-app-01`, joins `iklim-app-02/03` as managers, joins `iklim-db-01/02/03` as workers, and labels DB nodes.
|
||||||
|
|
||||||
|
Manual equivalent, kept for troubleshooting only:
|
||||||
|
|
||||||
**Swarm manager'lardan birinde** (iklim-app-01) join token al:
|
**Swarm manager'lardan birinde** (iklim-app-01) join token al:
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
@ -157,19 +163,35 @@ docker swarm join-token worker
|
|||||||
docker swarm join --token <TOKEN> 10.20.10.11:2377
|
docker swarm join --token <TOKEN> 10.20.10.11:2377
|
||||||
```
|
```
|
||||||
|
|
||||||
Label the nodes **on iklim-app-01**:
|
Label the nodes **on iklim-app-01**. In automation this is split into two phases:
|
||||||
|
|
||||||
|
- the shared `swarm` role adds `role=db` to DB nodes;
|
||||||
|
- the prod-specific `prod-bootstrap.yml` play adds `db-index=01/02/03`.
|
||||||
|
|
||||||
|
Manual equivalent:
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
docker node update --label-add role=db --label-add db-index=01 iklim-db-01
|
docker node update --label-add role=db iklim-db-01
|
||||||
docker node update --label-add role=db --label-add db-index=02 iklim-db-02
|
docker node update --label-add role=db iklim-db-02
|
||||||
docker node update --label-add role=db --label-add db-index=03 iklim-db-03
|
docker node update --label-add role=db iklim-db-03
|
||||||
|
|
||||||
|
docker node update --label-add db-index=01 iklim-db-01
|
||||||
|
docker node update --label-add db-index=02 iklim-db-02
|
||||||
|
docker node update --label-add db-index=03 iklim-db-03
|
||||||
|
|
||||||
docker node ls
|
docker node ls
|
||||||
```
|
```
|
||||||
|
|
||||||
## 3. StorageBox Directory Structure
|
## 3. StorageBox Directory Structure
|
||||||
|
|
||||||
DB data and logs are stored on **local Docker named volumes** (performance, WAL/compaction requirements). Only config files are placed on StorageBox. On each DB node, where `/mnt/storagebox` must already be mounted:
|
DB data is stored on local DB-node paths prepared by Ansible:
|
||||||
|
|
||||||
|
```text
|
||||||
|
/opt/iklimco/db/mongodb
|
||||||
|
/opt/iklimco/db/postgresql
|
||||||
|
```
|
||||||
|
|
||||||
|
Configuration files are placed on StorageBox. On each DB node, where `/mnt/storagebox` must already be mounted:
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
# On iklim-db-01:
|
# On iklim-db-01:
|
||||||
@ -185,7 +207,7 @@ mkdir -p /mnt/storagebox/db/mongodb-03/config
|
|||||||
mkdir -p /mnt/storagebox/db/postgresql-03/config
|
mkdir -p /mnt/storagebox/db/postgresql-03/config
|
||||||
```
|
```
|
||||||
|
|
||||||
Config files (`mongod.conf`, `patroni.yml`) are deployed by the Ansible `db_stack` role into these directories. Named Docker volumes (`mongodb-01-data`, `etcd-01-data`, `postgresql-01-data`, etc.) are created automatically by the stack deploy.
|
Config files (`mongod.conf`, `patroni.yml`) and the MongoDB replica set key are deployed by the Ansible `db_stack` role into these directories. etcd uses Docker named volumes (`etcd-01-data`, `etcd-02-data`, `etcd-03-data`) from `docker-stack-infra_db-prod.yml`.
|
||||||
|
|
||||||
## 4. MongoDB Replica Set
|
## 4. MongoDB Replica Set
|
||||||
|
|
||||||
@ -216,14 +238,18 @@ security:
|
|||||||
|
|
||||||
### Replica Set Auth Key
|
### Replica Set Auth Key
|
||||||
|
|
||||||
The **same** key file must exist on all DB nodes:
|
The **same** key file must exist on all DB nodes. In the current production setup, this is automated by `ansible/prod/roles/db_stack/tasks/db_node.yml`:
|
||||||
|
|
||||||
|
- `iklim-db-01` generates `/mnt/storagebox/db/mongodb-01/config/rs-auth.key` if it is missing.
|
||||||
|
- the same key content is copied to `/mnt/storagebox/db/mongodb-02/config/rs-auth.key` and `/mnt/storagebox/db/mongodb-03/config/rs-auth.key`;
|
||||||
|
- permissions are set to `0400`.
|
||||||
|
|
||||||
|
Manual recovery equivalent, kept only for troubleshooting:
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
# Create on iklim-db-01:
|
|
||||||
openssl rand -base64 756 > /mnt/storagebox/db/mongodb-01/config/rs-auth.key
|
openssl rand -base64 756 > /mnt/storagebox/db/mongodb-01/config/rs-auth.key
|
||||||
chmod 400 /mnt/storagebox/db/mongodb-01/config/rs-auth.key
|
chmod 400 /mnt/storagebox/db/mongodb-01/config/rs-auth.key
|
||||||
|
|
||||||
# Copy the same content to the other nodes:
|
|
||||||
cat /mnt/storagebox/db/mongodb-01/config/rs-auth.key \
|
cat /mnt/storagebox/db/mongodb-01/config/rs-auth.key \
|
||||||
> /mnt/storagebox/db/mongodb-02/config/rs-auth.key
|
> /mnt/storagebox/db/mongodb-02/config/rs-auth.key
|
||||||
cat /mnt/storagebox/db/mongodb-01/config/rs-auth.key \
|
cat /mnt/storagebox/db/mongodb-01/config/rs-auth.key \
|
||||||
@ -234,14 +260,16 @@ chmod 400 /mnt/storagebox/db/mongodb-0{2,3}/config/rs-auth.key
|
|||||||
|
|
||||||
### Stack File — MongoDB
|
### Stack File — MongoDB
|
||||||
|
|
||||||
MongoDB services are defined in `docker-stack-db.prod.yml` (repo root). Each service uses a named Docker volume for data and log, and a StorageBox bind mount for config:
|
MongoDB services are defined in `docker-stack-infra_db-prod.yml` (repo root). Each service uses a local DB-node bind mount for data and a StorageBox bind mount for config:
|
||||||
|
|
||||||
```yaml
|
```yaml
|
||||||
mongodb-01:
|
mongodb-01:
|
||||||
image: mongo:8.3.2
|
image: ${IMAGE_MONGODB}
|
||||||
|
environment:
|
||||||
|
MONGO_INITDB_ROOT_USERNAME: "${DATABASE_MONGODB_ROOT_USER}"
|
||||||
|
MONGO_INITDB_ROOT_PASSWORD: "${DATABASE_MONGODB_ROOT_PASSWD}"
|
||||||
volumes:
|
volumes:
|
||||||
- mongodb-01-data:/data/db
|
- /opt/iklimco/db/mongodb:/data/db
|
||||||
- mongodb-01-log:/data/log
|
|
||||||
- /mnt/storagebox/db/mongodb-01/config:/data/configdb
|
- /mnt/storagebox/db/mongodb-01/config:/data/configdb
|
||||||
networks:
|
networks:
|
||||||
iklimco-net:
|
iklimco-net:
|
||||||
@ -260,11 +288,18 @@ mongodb-01:
|
|||||||
- node.hostname == iklim-db-01
|
- node.hostname == iklim-db-01
|
||||||
```
|
```
|
||||||
|
|
||||||
Volumes `mongodb-01-data`, `mongodb-01-log`, etc. are declared at the bottom of `docker-stack-db.prod.yml` and are created automatically on first deploy.
|
The same pattern is repeated for `mongodb-02` and `mongodb-03`, with node-specific StorageBox config paths and placement constraints.
|
||||||
|
|
||||||
### Replica Set Initialization
|
### Replica Set Initialization
|
||||||
|
|
||||||
Run **once** after the stack is deployed:
|
Replica set initialization is handled by the root prod workflow step `Initialize MongoDB Replica Set`. The workflow:
|
||||||
|
|
||||||
|
1. Connects to the first host from `DATABASE_MONGODB_HOST`.
|
||||||
|
2. Runs `rs.initiate()` if the replica set is uninitialized.
|
||||||
|
3. Checks current members if the replica set already exists.
|
||||||
|
4. Runs `rs.add()` through the primary if hosts from `DATABASE_MONGODB_HOST` are missing.
|
||||||
|
|
||||||
|
Manual equivalent, kept for troubleshooting only:
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
# On iklim-app-01 (overlay network erişimi için):
|
# On iklim-app-01 (overlay network erişimi için):
|
||||||
@ -293,7 +328,7 @@ Patroni coordinates PostgreSQL primary/standby roles through etcd. If the primar
|
|||||||
|
|
||||||
### 5.1 Custom Image (Patroni + PostGIS)
|
### 5.1 Custom Image (Patroni + PostGIS)
|
||||||
|
|
||||||
Patroni is installed on top of the `postgis/postgis:18-3.6` image. This image is pushed to Harbor and used in the stack.
|
Patroni is installed on top of the `postgis/postgis:18-3.6` image. This image is pushed to Harbor and used in `docker-stack-infra_db-prod.yml` via `${CUSTOM_IMAGE_REGISTRY}${IMAGE_PATRONI}`.
|
||||||
|
|
||||||
`build/patroni-postgis/Dockerfile`:
|
`build/patroni-postgis/Dockerfile`:
|
||||||
|
|
||||||
@ -335,13 +370,13 @@ docker push registry.tarla.io/iklimco/custom-patroni-postgis:18-3.6
|
|||||||
|
|
||||||
### 5.2 etcd Cluster
|
### 5.2 etcd Cluster
|
||||||
|
|
||||||
etcd services are defined in `docker-stack-db.prod.yml`. Each service uses a named Docker volume for data and has an overlay DNS alias. Environment variables reference peer URLs by alias, not by hardcoded IP:
|
etcd services are defined in `docker-stack-infra_db-prod.yml`. Each service uses a named Docker volume for data and has an overlay DNS alias. Environment variables reference peer URLs by alias, not by hardcoded IP:
|
||||||
|
|
||||||
```yaml
|
```yaml
|
||||||
etcd-01:
|
etcd-01:
|
||||||
image: bitnami/etcd:3
|
image: ${IMAGE_ETCD}
|
||||||
environment:
|
environment:
|
||||||
ALLOW_NONE_AUTHENTICATION: "yes"
|
ALLOW_NONE_AUTHENTICATION: "no"
|
||||||
ETCD_NAME: etcd-01
|
ETCD_NAME: etcd-01
|
||||||
ETCD_INITIAL_ADVERTISE_PEER_URLS: http://etcd-01:2380
|
ETCD_INITIAL_ADVERTISE_PEER_URLS: http://etcd-01:2380
|
||||||
ETCD_LISTEN_PEER_URLS: http://0.0.0.0:2380
|
ETCD_LISTEN_PEER_URLS: http://0.0.0.0:2380
|
||||||
@ -350,6 +385,7 @@ etcd-01:
|
|||||||
ETCD_INITIAL_CLUSTER: "etcd-01=http://etcd-01:2380,etcd-02=http://etcd-02:2380,etcd-03=http://etcd-03:2380"
|
ETCD_INITIAL_CLUSTER: "etcd-01=http://etcd-01:2380,etcd-02=http://etcd-02:2380,etcd-03=http://etcd-03:2380"
|
||||||
ETCD_INITIAL_CLUSTER_STATE: new
|
ETCD_INITIAL_CLUSTER_STATE: new
|
||||||
ETCD_INITIAL_CLUSTER_TOKEN: iklimco-etcd-prod
|
ETCD_INITIAL_CLUSTER_TOKEN: iklimco-etcd-prod
|
||||||
|
ETCD_ROOT_PASSWORD: "${ETCD_ROOT_PASSWORD}"
|
||||||
volumes:
|
volumes:
|
||||||
- etcd-01-data:/bitnami/etcd/data
|
- etcd-01-data:/bitnami/etcd/data
|
||||||
networks:
|
networks:
|
||||||
@ -366,7 +402,7 @@ etcd-01:
|
|||||||
|
|
||||||
**APISIX etcd usage:** In prod, APISIX shares this etcd cluster with the `/apisix` prefix. Patroni uses the `/service/` prefix and APISIX uses the `/apisix/` prefix — no collision. The overlay DNS names (`etcd-01:2379`, `etcd-02:2379`, `etcd-03:2379`) are reachable from app nodes via the `iklimco-net` overlay. Therefore, the app subnet → DB nodes port 2379 firewall rule is mandatory; it was added in Section 1.
|
**APISIX etcd usage:** In prod, APISIX shares this etcd cluster with the `/apisix` prefix. Patroni uses the `/service/` prefix and APISIX uses the `/apisix/` prefix — no collision. The overlay DNS names (`etcd-01:2379`, `etcd-02:2379`, `etcd-03:2379`) are reachable from app nodes via the `iklimco-net` overlay. Therefore, the app subnet → DB nodes port 2379 firewall rule is mandatory; it was added in Section 1.
|
||||||
|
|
||||||
**Important:** `ETCD_INITIAL_CLUSTER_STATE` must be `new` on the first deploy and `existing` on all later deploys. The deploy steps in Section 6 detect this automatically; no manual update is required.
|
**Important:** `ETCD_INITIAL_CLUSTER_STATE` is currently defined in `docker-stack-infra_db-prod.yml`. When changing etcd cluster membership, do not blindly expand `ETCD_INITIAL_CLUSTER` on a running cluster; add members through etcd membership operations first.
|
||||||
|
|
||||||
### 5.3 Patroni Configuration
|
### 5.3 Patroni Configuration
|
||||||
|
|
||||||
@ -447,17 +483,19 @@ For Node 02 and 03, only `name`, `restapi.connect_address`, and `postgresql.conn
|
|||||||
|
|
||||||
### 5.4 Stack File — Patroni
|
### 5.4 Stack File — Patroni
|
||||||
|
|
||||||
Patroni services are defined in `docker-stack-db.prod.yml`. Each service uses the custom image, a named Docker volume for data, a StorageBox bind mount for the config file, and overlay DNS aliases:
|
Patroni services are defined in `docker-stack-infra_db-prod.yml`. Each service uses the custom image, a local DB-node bind mount for data, a StorageBox bind mount for the config file, and overlay DNS aliases:
|
||||||
|
|
||||||
```yaml
|
```yaml
|
||||||
patroni-01:
|
patroni-01:
|
||||||
image: registry.tarla.io/iklimco/custom-patroni-postgis:18-3.6
|
image: ${CUSTOM_IMAGE_REGISTRY}${IMAGE_PATRONI}
|
||||||
environment:
|
environment:
|
||||||
DATABASE_POSTGRES_ROOT_PASSWD: "${DATABASE_POSTGRES_ROOT_PASSWD}"
|
POSTGRES_USER: "${DATABASE_POSTGRES_ROOT_USER}"
|
||||||
DATABASE_POSTGRES_REPLICATOR_PASSWORD: "${DATABASE_POSTGRES_REPLICATOR_PASSWORD}"
|
POSTGRES_PASSWORD: "${DATABASE_POSTGRES_ROOT_PASSWD}"
|
||||||
|
REPLICATOR_PASSWORD: "${DATABASE_POSTGRES_REPLICATOR_PASSWORD}"
|
||||||
|
ETCD_ROOT_PASSWORD: "${ETCD_ROOT_PASSWORD}"
|
||||||
TZ: "Europe/Istanbul"
|
TZ: "Europe/Istanbul"
|
||||||
volumes:
|
volumes:
|
||||||
- postgresql-01-data:/var/lib/postgresql/data
|
- /opt/iklimco/db/postgresql:/var/lib/postgresql/data
|
||||||
- /mnt/storagebox/db/postgresql-01/config/patroni.yml:/etc/patroni/patroni.yml:ro
|
- /mnt/storagebox/db/postgresql-01/config/patroni.yml:/etc/patroni/patroni.yml:ro
|
||||||
networks:
|
networks:
|
||||||
iklimco-net:
|
iklimco-net:
|
||||||
@ -480,7 +518,7 @@ patroni-01:
|
|||||||
- node.hostname == iklim-db-01
|
- node.hostname == iklim-db-01
|
||||||
```
|
```
|
||||||
|
|
||||||
Volumes `postgresql-01-data`, `postgresql-02-data`, `postgresql-03-data` are declared at the bottom of `docker-stack-db.prod.yml` and created automatically on first deploy.
|
The same pattern is repeated for `patroni-02` and `patroni-03`, with node-specific StorageBox config paths and placement constraints.
|
||||||
|
|
||||||
### 5.5 Status Check
|
### 5.5 Status Check
|
||||||
|
|
||||||
@ -508,11 +546,11 @@ docker exec -it $(docker ps -q -f name=iklimco_patroni-01 | head -1) \
|
|||||||
|
|
||||||
## 6. Deploy
|
## 6. Deploy
|
||||||
|
|
||||||
All DB services (etcd, MongoDB, Patroni) are in the single `docker-stack-db.prod.yml` stack. Deploy from `iklim-app-01` in the repo working directory.
|
All DB services (etcd, MongoDB, Patroni) are in the current root prod stack `docker-stack-infra_db-prod.yml`. Normal deployment is done by `.gitea/workflows/deploy-prod.yml`, not by running a separate DB stack manually.
|
||||||
|
|
||||||
### .env File
|
### .env File
|
||||||
|
|
||||||
DB stack password variables (`DATABASE_POSTGRES_ROOT_PASSWD`, `DATABASE_POSTGRES_REPLICATOR_PASSWORD`, `DATABASE_MONGODB_ROOT_PASSWD`) are stored in `prod/secrets/iklim.co/.env.secrets.shared` on StorageBox. Fetch it to `iklim-app-01` before deploy:
|
DB stack password variables (`DATABASE_POSTGRES_ROOT_PASSWD`, `DATABASE_POSTGRES_REPLICATOR_PASSWORD`, `DATABASE_MONGODB_ROOT_PASSWD`, `ETCD_ROOT_PASSWORD`) are stored in `prod/secrets/iklim.co/.env.secrets.shared` on StorageBox. The workflow fetches this file automatically.
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
scp -P 23 STORAGEBOX_USER@STORAGEBOX_USER.your-storagebox.de:prod/secrets/iklim.co/.env.secrets.shared \
|
scp -P 23 STORAGEBOX_USER@STORAGEBOX_USER.your-storagebox.de:prod/secrets/iklim.co/.env.secrets.shared \
|
||||||
@ -522,44 +560,18 @@ chmod 600 /tmp/.env.secrets.shared
|
|||||||
|
|
||||||
### Deploy Steps
|
### Deploy Steps
|
||||||
|
|
||||||
|
The root prod workflow deploys the stack with:
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
# On iklim-app-01, in the repo working directory:
|
|
||||||
set -a; . /tmp/.env.secrets.shared; set +a
|
|
||||||
|
|
||||||
# Automatic ETCD_INITIAL_CLUSTER_STATE detection:
|
|
||||||
DEPLOY_FILE="docker-stack-db.prod.yml"
|
|
||||||
if docker service ls --filter name=iklimco_etcd-01 -q 2>/dev/null | grep -q .; then
|
|
||||||
echo "ℹ️ etcd services mevcut, 'existing' ile deploy ediliyor..."
|
|
||||||
DEPLOY_FILE=$(mktemp /tmp/docker-stack-db.XXXXXX.yml)
|
|
||||||
sed "s/ETCD_INITIAL_CLUSTER_STATE: new/ETCD_INITIAL_CLUSTER_STATE: existing/g" \
|
|
||||||
docker-stack-db.prod.yml > "$DEPLOY_FILE"
|
|
||||||
else
|
|
||||||
echo "ℹ️ İlk deploy, 'new' state kullanılıyor..."
|
|
||||||
fi
|
|
||||||
|
|
||||||
docker stack deploy \
|
docker stack deploy \
|
||||||
--with-registry-auth \
|
--with-registry-auth \
|
||||||
-c "$DEPLOY_FILE" \
|
--resolve-image changed \
|
||||||
|
-c docker-stack-infra_db-prod.yml \
|
||||||
iklimco
|
iklimco
|
||||||
|
|
||||||
[ "$DEPLOY_FILE" != "docker-stack-db.prod.yml" ] && rm -f "$DEPLOY_FILE"
|
|
||||||
|
|
||||||
# Wait for etcd cluster to be ready:
|
|
||||||
echo "⏳ etcd bekleniyor..."
|
|
||||||
for i in $(seq 1 18); do
|
|
||||||
if docker run --rm --network iklimco-net alpine \
|
|
||||||
sh -c "wget -qO- http://etcd-01:2379/health 2>/dev/null | grep -q '\"health\":\"true\"'"; then
|
|
||||||
echo "✅ etcd ready"
|
|
||||||
break
|
|
||||||
fi
|
|
||||||
[ "$i" -eq 18 ] && echo "❌ etcd timeout" && exit 1
|
|
||||||
echo " attempt $i/18 — 10s bekleniyor..."
|
|
||||||
sleep 10
|
|
||||||
done
|
|
||||||
|
|
||||||
docker stack services iklimco
|
|
||||||
```
|
```
|
||||||
|
|
||||||
|
After the stack deploy, the workflow waits for etcd, initializes APISIX, initializes the MongoDB replica set, and runs PostgreSQL/MongoDB init scripts.
|
||||||
|
|
||||||
### DB Node Placement Check
|
### DB Node Placement Check
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
@ -572,7 +584,7 @@ All tasks must run on the expected `iklim-db-*` nodes.
|
|||||||
|
|
||||||
### MongoDB Replica Set Initialization
|
### MongoDB Replica Set Initialization
|
||||||
|
|
||||||
Run once after the stack is deployed:
|
Handled by the workflow. Manual form for troubleshooting:
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
# From iklim-app-01 via overlay network:
|
# From iklim-app-01 via overlay network:
|
||||||
@ -596,7 +608,7 @@ App containers connect to DB services through the `iklimco-net` overlay network
|
|||||||
|
|
||||||
### MongoDB Replica Set Connection String
|
### MongoDB Replica Set Connection String
|
||||||
|
|
||||||
Variables in `env-prod/.env`:
|
Variables in StorageBox `prod/secrets/iklim.co/.env`:
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
DATABASE_MONGODB_HOST=mongodb-01:27017,mongodb-02:27017,mongodb-03:27017
|
DATABASE_MONGODB_HOST=mongodb-01:27017,mongodb-02:27017,mongodb-03:27017
|
||||||
@ -613,7 +625,7 @@ mongodb://<user>:<password>@mongodb-01:27017,mongodb-02:27017,mongodb-03:27017/<
|
|||||||
|
|
||||||
### PostgreSQL — Patroni
|
### PostgreSQL — Patroni
|
||||||
|
|
||||||
Variables in `env-prod/.env`:
|
Variables in StorageBox `prod/secrets/iklim.co/.env`:
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
DATABASE_POSTGRES_HOST=patroni-01:5432,patroni-02:5432,patroni-03:5432
|
DATABASE_POSTGRES_HOST=patroni-01:5432,patroni-02:5432,patroni-03:5432
|
||||||
@ -647,8 +659,7 @@ curl -s http://patroni-01:8008/primary
|
|||||||
Prod cluster yapısında `pg-proxy` veya `mongo-proxy` **kullanılmaz**. Ofis bilgisayarından erişim için doğrudan DB subnet'i hedef alınır.
|
Prod cluster yapısında `pg-proxy` veya `mongo-proxy` **kullanılmaz**. Ofis bilgisayarından erişim için doğrudan DB subnet'i hedef alınır.
|
||||||
|
|
||||||
### WireGuard Ayarı
|
### WireGuard Ayarı
|
||||||
Ofis bilgisayarındaki `.conf` dosyasında `AllowedIPs` güncellenmelidir:
|
Ofis bilgisayarındaki `.conf` dosyasında `AllowedIPs` güncellenmelidir: `AllowedIPs = 10.8.0.1/32, 10.20.20.0/24`
|
||||||
`AllowedIPs = 10.8.0.1/32, 10.20.20.0/24`
|
|
||||||
|
|
||||||
### Bağlantı Parametreleri (Multi-Host)
|
### Bağlantı Parametreleri (Multi-Host)
|
||||||
Modern veritabanı araçları (DBeaver, Compass vb.) küme farkındalıklı bağlantı kurmalıdır:
|
Modern veritabanı araçları (DBeaver, Compass vb.) küme farkındalıklı bağlantı kurmalıdır:
|
||||||
@ -660,7 +671,7 @@ Modern veritabanı araçları (DBeaver, Compass vb.) küme farkındalıklı bağ
|
|||||||
|
|
||||||
## Acceptance Criteria
|
## Acceptance Criteria
|
||||||
|
|
||||||
- `docker stack services iklimco` — 9 services visible (etcd-01/02/03, mongodb-01/02/03, patroni-01/02/03), all `1/1`
|
- `docker stack services iklimco` — etcd-01/02/03, mongodb-01/02/03, patroni-01/02/03 are visible and all target replicas are healthy
|
||||||
- `docker service ps iklimco_patroni-01/02/03` — each task runs on its expected `iklim-db-*` node
|
- `docker service ps iklimco_patroni-01/02/03` — each task runs on its expected `iklim-db-*` node
|
||||||
- `docker service ps iklimco_mongodb-01/02/03` — each task runs on its expected `iklim-db-*` node
|
- `docker service ps iklimco_mongodb-01/02/03` — each task runs on its expected `iklim-db-*` node
|
||||||
- `docker service ps iklimco_etcd-01/02/03` — each task runs on its expected `iklim-db-*` node
|
- `docker service ps iklimco_etcd-01/02/03` — each task runs on its expected `iklim-db-*` node
|
||||||
@ -16,7 +16,7 @@ In this model, if any manager/runner is lost, the other runners can pick up pipe
|
|||||||
|
|
||||||
## Runner Installation Model
|
## Runner Installation Model
|
||||||
|
|
||||||
The runner will not run as a Docker container. There is no Docker socket mount.
|
The runner will not run as a Docker container. It runs as a systemd service on the app nodes. Job containers start on Docker `bridge`; deploy workflows connect the job container to `iklimco-net` after the stack creates that network.
|
||||||
|
|
||||||
Installation:
|
Installation:
|
||||||
|
|
||||||
@ -33,7 +33,7 @@ If runner jobs use Docker CLI for deploy, the `gitea-runner` user needs access t
|
|||||||
Shared labels on all prod runners:
|
Shared labels on all prod runners:
|
||||||
|
|
||||||
```text
|
```text
|
||||||
prod-runner
|
prod-runner:docker://catthehacker/ubuntu:act-22.04
|
||||||
ubuntu-24.04
|
ubuntu-24.04
|
||||||
```
|
```
|
||||||
|
|
||||||
@ -86,20 +86,19 @@ For the GoDaddy API key: https://developer.godaddy.com/keys — create a **Produ
|
|||||||
|
|
||||||
### Gitea `PROD_FLOATING_IP` Variable
|
### Gitea `PROD_FLOATING_IP` Variable
|
||||||
|
|
||||||
For DNS automation, `PROD_FLOATING_IP` must be defined as a Gitea project variable. See the "Gitea Variable: PROD_FLOATING_IP" step in `06-prod-terraform-iaac.md`.
|
For DNS automation, `PROD_FLOATING_IP` must be defined as a Gitea project variable. See the "Gitea Variable: PROD_FLOATING_IP" step in `06-prod-terraform-iac.md`.
|
||||||
|
|
||||||
### Docker Secrets
|
### Docker Secrets
|
||||||
|
|
||||||
Before the infra stack is deployed, the following Docker secrets must be created on `iklim-app-01`. These secrets are referenced by `docker-stack-infra.prod.yml`; if they do not exist, stack deploy fails.
|
Before the infra stack is deployed, `rabbitmq_erlang_cookie` must exist as a Docker secret. The current prod workflow creates it in the `Create Infrastructure Docker Secrets` step if it is missing.
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
# RabbitMQ Erlang cluster cookie; must be the same on all RabbitMQ nodes:
|
# RabbitMQ Erlang cluster cookie; must be the same on all RabbitMQ nodes.
|
||||||
|
# The workflow does this automatically if the secret is missing:
|
||||||
openssl rand -hex 32 | docker secret create rabbitmq_erlang_cookie -
|
openssl rand -hex 32 | docker secret create rabbitmq_erlang_cookie -
|
||||||
```
|
```
|
||||||
|
|
||||||
> The `vault_unseal_key` secret is created after Vault is started for the first time; see `roadmap/prod-env/07-vault-raft-plan.md` Step 3. It is not required for the first infra stack deploy; it is waited for until the health check is triggered.
|
> The `vault_unseal_key` secret is managed by `init/vault/vault-bootstrap.sh`. The bootstrap script creates a placeholder on first deploy, deploys `docker-stack-vault.yml`, initializes/unseals Vault, and rotates the secret to the real unseal key.
|
||||||
>
|
|
||||||
> This secret is also used during Vault restarts triggered by cert-reloader: when `cert-reloader` detects a certificate change, it runs `docker service update --force iklimco_vault`; while Vault containers restart, they read from the `vault_unseal_key` Docker secret and automatically unseal. If the secret is missing, Vault remains sealed after every certificate renewal.
|
|
||||||
|
|
||||||
Verify secrets:
|
Verify secrets:
|
||||||
|
|
||||||
@ -120,7 +119,7 @@ Before the deploy pipeline runs, the following template files must exist in the
|
|||||||
|
|
||||||
These files are created in the test environment (`test-env/04-swag-nginx-configs.md`); they are not created separately for prod. Template files are shared by both environments; prod-specific values are injected with environment variables during deploy.
|
These files are created in the test environment (`test-env/04-swag-nginx-configs.md`); they are not created separately for prod. Template files are shared by both environments; prod-specific values are injected with environment variables during deploy.
|
||||||
|
|
||||||
Verify that the `prod/secrets/iklim.co/.env.prod` file on StorageBox contains the following variables:
|
Verify that the `prod/secrets/iklim.co/.env` file on StorageBox contains the following variables:
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
API_SUBDOMAIN=api.iklim.co
|
API_SUBDOMAIN=api.iklim.co
|
||||||
@ -129,11 +128,12 @@ RABBITMQ_SUBDOMAIN=rabbitmq.iklim.co
|
|||||||
GRAFANA_SUBDOMAIN=grafana.iklim.co
|
GRAFANA_SUBDOMAIN=grafana.iklim.co
|
||||||
RESTRICTED_IPS="78.187.87.109/32,95.70.151.248/32"
|
RESTRICTED_IPS="78.187.87.109/32,95.70.151.248/32"
|
||||||
SWAG_CERT_DIR=/mnt/storagebox/ssl
|
SWAG_CERT_DIR=/mnt/storagebox/ssl
|
||||||
SWAG_CONFIG_DIR=/mnt/storagebox/swag/config
|
SWAG_DNS_CONFIG_DIR=/mnt/storagebox/swag/dns-conf
|
||||||
SWAG_SITE_CONFS_DIR=/mnt/storagebox/swag/site-confs
|
SWAG_SITE_CONFS_DIR=/mnt/storagebox/swag/site-confs
|
||||||
|
SWAG_PROXY_CONFS_DIR=/mnt/storagebox/swag/proxy-confs
|
||||||
```
|
```
|
||||||
|
|
||||||
The pipeline sources these variables and renders the template files into the `$SWAG_SITE_CONFS_DIR` (`/mnt/storagebox/swag/site-confs`) directory. Because StorageBox is mounted commonly on all app nodes, even if the configuration is created on a single runner, SWAG containers on other nodes access the same files. Detail: `roadmap/prod-env/04-swag-nginx-configs.md`.
|
The pipeline sources these variables and renders the template files into the `$SWAG_SITE_CONFS_DIR` (`/mnt/storagebox/swag/site-confs`) directory. Because StorageBox is mounted commonly on all app nodes, even if the configuration is created on a single runner, SWAG containers on other nodes access the same files.
|
||||||
|
|
||||||
### APISIX Configuration
|
### APISIX Configuration
|
||||||
|
|
||||||
@ -194,27 +194,41 @@ All prod deploy workflows, including infra and microservices, must use the same
|
|||||||
| 2 | Prepare Folders | |
|
| 2 | Prepare Folders | |
|
||||||
| 3 | Set up SSH Key and Add to known_hosts | |
|
| 3 | Set up SSH Key and Add to known_hosts | |
|
||||||
| 4 | Update Apt Repository and Install Required Tools | `gettext tree jq` — `jq` is required for the GoDaddy DNS API |
|
| 4 | Update Apt Repository and Install Required Tools | `gettext tree jq` — `jq` is required for the GoDaddy DNS API |
|
||||||
| 5 | Fetch Service Secret Files | Fetch `.env.secrets.*` from StorageBox |
|
| 5 | Fetch Prod Env From Storagebox | Fetch `.env` and `.env.secrets.shared` |
|
||||||
| 6 | Initialize Workspace | Fetch `.env` and `.env.secrets.shared` from StorageBox; run `init-infra-dev.sh` |
|
| 6 | Fetch Service Secret Files | Fetch `.env.secrets.<svc>` and `.env.secrets.swag` |
|
||||||
| 7 | Upload Updated Secrets to Storagebox | |
|
| 7 | Prepare Database Init Files | Render PostgreSQL/MongoDB init templates |
|
||||||
| 8 | Provision Vault AppRole IDs and Docker Secrets | |
|
| 8 | Docker Login to Harbor | |
|
||||||
| 9 | Upload Updated Env to Storagebox | |
|
| 9 | Prepare SWAG Directories | Render `dns-conf` and `site-confs`; reload node-local SWAG if present |
|
||||||
| 10 | Prepare Init Files | Cert copy lines removed |
|
| 10 | Bootstrap Vault TLS Placeholder | Creates temporary cert only if missing |
|
||||||
| 11 | Initialize Docker Swarm | |
|
| 11 | Create Infrastructure Docker Secrets | Creates `rabbitmq_erlang_cookie` if missing |
|
||||||
| 12 | Docker Login to Harbor | |
|
| 12 | Deploy Swarm Stacks | `docker-stack-infra_db-prod.yml` |
|
||||||
| 13 | **Update DNS Records** * | GoDaddy API; `api/apigw/rabbitmq/grafana` A records; idempotent |
|
| 13 | Connect Runner to Overlay Network | Connects job container to `iklimco-net` |
|
||||||
| 14 | **Prepare SWAG Directories** * | `$SWAG_CONFIG_DIR/dns-conf`; renders nginx conf templates; reloads running SWAG |
|
| 14 | Initialize Production Infrastructure | Runs `init-infra-prod.sh`; this triggers Vault bootstrap and RabbitMQ setup |
|
||||||
| 15 | Bootstrap Vault TLS Placeholder | |
|
| 15 | Wait for Infrastructure Services | Waits for `iklimco_vault` and `iklimco_rabbitmq` |
|
||||||
| 16 | Deploy Swarm Stack | base + prod overlay together |
|
| 16 | Provision Vault AppRole IDs and Docker Secrets | Downloads service `vault-files`, runs `init/provision-all-services.sh` |
|
||||||
| 17 | **Wait for etcd** * | Waits until Patroni etcd (`etcd-01:2379`) is healthy |
|
| 17 | Upload Updated Secrets to Storagebox | Uploads `.env.secrets.*` and `.env` |
|
||||||
| 18 | **Run APISIX Init** * | `SPRING_PROFILES_ACTIVE=prod`; idempotent; writes to etcd |
|
| 18 | Wait for etcd | Waits for etcd health |
|
||||||
| 19 | **Bootstrap SWAG Certificate** * | Waits for SWAG to obtain the cert; copies it to `SWAG_CERT_DIR` |
|
| 19 | Run APISIX Init | `SPRING_PROFILES_ACTIVE=prod` |
|
||||||
| 20 | **Run Database Init Scripts** * | `postgresql`/`mongodb` Swarm VIP; SQL+JS init; idempotent |
|
| 20 | Bootstrap SWAG Certificate | Waits for SWAG and cert-reloader output in `SWAG_CERT_DIR` |
|
||||||
| 21 | Review Environment | |
|
| 21 | Initialize MongoDB Replica Set | `rs.initiate()` or missing-member `rs.add()` |
|
||||||
|
| 22 | Run Database Init Scripts | Patroni primary + MongoDB replica set; SQL+JS init |
|
||||||
|
| 23 | Update DNS Records | GoDaddy API; `api/apigw/rabbitmq/grafana` A records |
|
||||||
|
| 24 | Review Environment | |
|
||||||
|
|
||||||
### Removal of Cert Scp Lines
|
### Stack Placement Boundary
|
||||||
|
|
||||||
Lines removed from the `Initialize Workspace` step:
|
`docker-stack-infra_db-prod.yml` is intentionally a mixed infrastructure stack. The DB/cluster services in that file are placed on DB nodes and expose host-mode cluster ports:
|
||||||
|
|
||||||
|
- Patroni/PostgreSQL, MongoDB, and etcd run on `iklim-db-*` workers.
|
||||||
|
|
||||||
|
The service-node infrastructure in the same file remains overlay-only unless a reverse proxy or explicit published port is defined by the stack:
|
||||||
|
|
||||||
|
- Redis, Redis Sentinel, and RabbitMQ run on `node.labels.type == service` app/service nodes.
|
||||||
|
- Redis and RabbitMQ must not be treated as DB-node host-mode services.
|
||||||
|
|
||||||
|
### Historical Note: Removed Cert Scp Lines
|
||||||
|
|
||||||
|
Older workflow versions copied certificate files manually in an `Initialize Workspace` step. That step no longer exists in the current root prod workflow. The removed lines are kept here only as a historical reference:
|
||||||
|
|
||||||
```yaml
|
```yaml
|
||||||
# REMOVED — manual cert copy with scp is no longer required:
|
# REMOVED — manual cert copy with scp is no longer required:
|
||||||
@ -222,7 +236,7 @@ scp -P 23 ${{ vars.STORAGEBOX_USER }}@${{ vars.STORAGEBOX_USER }}.your-storagebo
|
|||||||
scp -P 23 ${{ vars.STORAGEBOX_USER }}@${{ vars.STORAGEBOX_USER }}.your-storagebox.de:prod/app/iklim.co/ssl/STAR.iklim.co_key.pem ./STAR.iklim.co_key.pem
|
scp -P 23 ${{ vars.STORAGEBOX_USER }}@${{ vars.STORAGEBOX_USER }}.your-storagebox.de:prod/app/iklim.co/ssl/STAR.iklim.co_key.pem ./STAR.iklim.co_key.pem
|
||||||
```
|
```
|
||||||
|
|
||||||
Line also removed from the `Prepare Init Files` step:
|
This line was also removed from the old `Prepare Init Files` step:
|
||||||
|
|
||||||
```yaml
|
```yaml
|
||||||
# REMOVED:
|
# REMOVED:
|
||||||
@ -231,97 +245,55 @@ sudo cp STAR.iklim.co.full.crt STAR.iklim.co_key.pem /opt/iklimco/ssl/
|
|||||||
|
|
||||||
The certificate is now obtained by SWAG from Let's Encrypt and written to the `SWAG_CERT_DIR` (`/mnt/storagebox/ssl/`) directory in the `Bootstrap SWAG Certificate` step. Later renewals are handled automatically by cert-reloader.
|
The certificate is now obtained by SWAG from Let's Encrypt and written to the `SWAG_CERT_DIR` (`/mnt/storagebox/ssl/`) directory in the `Bootstrap SWAG Certificate` step. Later renewals are handled automatically by cert-reloader.
|
||||||
|
|
||||||
### Bootstrap SWAG Certificate (Step 19)
|
### Bootstrap SWAG Certificate (Step 20)
|
||||||
|
|
||||||
On the first deploy, SWAG obtains the Let's Encrypt certificate with the GoDaddy DNS-01 challenge. This step waits for SWAG to obtain the certificate, for up to 10 minutes, and then copies it to the `SWAG_CERT_DIR` directory:
|
On the first deploy, SWAG obtains the Let's Encrypt certificate with the GoDaddy DNS-01 challenge. The current step waits for the Swarm `iklimco_swag` service to be running, then waits for `cert-reloader` to write `STAR.iklim.co.full.crt` to `SWAG_CERT_DIR`.
|
||||||
|
|
||||||
```yaml
|
```yaml
|
||||||
- name: Bootstrap SWAG Certificate
|
- name: Bootstrap SWAG Certificate
|
||||||
run: |
|
run: |
|
||||||
set -a; . ./.env; set +a
|
set -a; . ./.env; set +a
|
||||||
echo "Waiting for SWAG container to start..."
|
echo "Waiting for SWAG service..."
|
||||||
SWAG_CTR=""
|
docker service ps iklimco_swag --filter 'desired-state=running'
|
||||||
for i in $(seq 1 24); do
|
echo "Waiting for cert-reloader output in ${SWAG_CERT_DIR}..."
|
||||||
SWAG_CTR=$(docker ps -q -f name=iklimco_swag 2>/dev/null | head -1)
|
docker run --rm -v "${SWAG_CERT_DIR}:/ssl:ro" alpine \
|
||||||
[ -n "$SWAG_CTR" ] && break
|
test -f /ssl/STAR.iklim.co.full.crt
|
||||||
sleep 10
|
|
||||||
done
|
|
||||||
|
|
||||||
if [ -z "$SWAG_CTR" ]; then
|
|
||||||
echo "❌ SWAG container did not start"
|
|
||||||
exit 1
|
|
||||||
fi
|
|
||||||
|
|
||||||
CERT_PATH="/config/etc/letsencrypt/live/iklim.co/fullchain.pem"
|
|
||||||
echo "Waiting for cert (up to 10 min)..."
|
|
||||||
for i in $(seq 1 20); do
|
|
||||||
if docker exec "$SWAG_CTR" test -f "$CERT_PATH" 2>/dev/null; then
|
|
||||||
echo "✅ Cert obtained"
|
|
||||||
break
|
|
||||||
fi
|
|
||||||
echo " attempt $i/20 — waiting 30s..."
|
|
||||||
sleep 30
|
|
||||||
done
|
|
||||||
|
|
||||||
if ! docker exec "$SWAG_CTR" test -f "$CERT_PATH" 2>/dev/null; then
|
|
||||||
echo "❌ SWAG did not obtain cert. Logs:"
|
|
||||||
docker service logs iklimco_swag --tail 50
|
|
||||||
exit 1
|
|
||||||
fi
|
|
||||||
|
|
||||||
docker exec "$SWAG_CTR" cat "$CERT_PATH" | \
|
|
||||||
docker run --rm -i -v "${SWAG_CERT_DIR}:/output" alpine \
|
|
||||||
sh -c "cat > /output/STAR.iklim.co.full.crt && chmod 644 /output/STAR.iklim.co.full.crt"
|
|
||||||
docker exec "$SWAG_CTR" cat "/config/etc/letsencrypt/live/iklim.co/privkey.pem" | \
|
|
||||||
docker run --rm -i -v "${SWAG_CERT_DIR}:/output" alpine \
|
|
||||||
sh -c "cat > /output/STAR.iklim.co_key.pem && chmod 644 /output/STAR.iklim.co_key.pem"
|
|
||||||
echo "✅ Cert bootstrapped to ${SWAG_CERT_DIR}/"
|
|
||||||
working-directory: /workspace/iklim.co
|
working-directory: /workspace/iklim.co
|
||||||
```
|
```
|
||||||
|
|
||||||
After this step, certificate files exist inside `SWAG_CERT_DIR` (`/mnt/storagebox/ssl/`); Vault TLS reads these files. Later renewals are handled automatically by cert-reloader. When the pipeline runs again, this step only waits for the SWAG container to be ready; certificate issuance is managed by SWAG/cert-reloader within Let's Encrypt's 90-day cycle.
|
After this step, certificate files exist inside `SWAG_CERT_DIR` (`/mnt/storagebox/ssl/`). `cert-distributor` syncs these files to node-local `/opt/iklimco/ssl`, where Vault reads them. Later renewals are handled automatically by SWAG, cert-reloader, and cert-distributor.
|
||||||
|
|
||||||
### Run Database Init Scripts (Step 20)
|
### Run Database Init Scripts (Step 22)
|
||||||
|
|
||||||
PostgreSQL and MongoDB init scripts run through Swarm overlay DNS service names (`postgresql`, `mongodb`):
|
PostgreSQL and MongoDB init scripts run after Patroni primary and MongoDB replica set readiness:
|
||||||
|
|
||||||
```yaml
|
```yaml
|
||||||
- name: Run Database Init Scripts
|
- name: Run Database Init Scripts
|
||||||
run: |
|
run: |
|
||||||
set -a; . ./.env; . ./.env.secrets.shared; set +a
|
set -a; . ./.env; . ./.env.secrets.shared; set +a
|
||||||
|
|
||||||
echo "⏳ Waiting for PostgreSQL..."
|
PG_URI="postgresql://${DATABASE_POSTGRES_ROOT_USER}@${DATABASE_POSTGRES_HOST}/postgres?connect_timeout=5&target_session_attrs=read-write"
|
||||||
until docker run --rm --network iklimco-net \
|
MONGO_URI="mongodb://${DATABASE_MONGODB_ROOT_USER}:${DATABASE_MONGODB_ROOT_PASSWD}@${DATABASE_MONGODB_HOST}/admin?${DATABASE_MONGODB_PARAMS}"
|
||||||
-e PGPASSWORD="${DATABASE_POSTGRES_ROOT_PASSWD}" \
|
|
||||||
postgis/postgis:18-3.6 \
|
|
||||||
pg_isready -h postgresql -U "${DATABASE_POSTGRES_ROOT_USER}" -q 2>/dev/null; do
|
|
||||||
sleep 5
|
|
||||||
done
|
|
||||||
for sql_file in $(ls ./init/postgresql/*.sql 2>/dev/null | sort); do
|
for sql_file in $(ls ./init/postgresql/*.sql 2>/dev/null | sort); do
|
||||||
echo "▶ $(basename "$sql_file")"
|
echo "▶ $(basename "$sql_file")"
|
||||||
docker run --rm -i --network iklimco-net \
|
docker run --rm -i --network iklimco-net \
|
||||||
-e PGPASSWORD="${DATABASE_POSTGRES_ROOT_PASSWD}" \
|
-e PGPASSWORD="${DATABASE_POSTGRES_ROOT_PASSWD}" \
|
||||||
postgis/postgis:18-3.6 \
|
postgis/postgis:18-3.6 \
|
||||||
psql -h postgresql -U "${DATABASE_POSTGRES_ROOT_USER}" < "$sql_file"
|
psql "$PG_URI" < "$sql_file"
|
||||||
done
|
done
|
||||||
|
|
||||||
echo "⏳ Waiting for MongoDB..."
|
|
||||||
until docker run --rm --network iklimco-net mongo:8.3.2 \
|
|
||||||
mongosh "mongodb://${DATABASE_MONGODB_ROOT_USER}:${DATABASE_MONGODB_ROOT_PASSWD}@mongodb/admin" \
|
|
||||||
--eval "db.runCommand({ping:1})" --quiet 2>/dev/null; do
|
|
||||||
sleep 5
|
|
||||||
done
|
|
||||||
for js_file in $(ls ./init/mongodb/*.js 2>/dev/null | sort); do
|
for js_file in $(ls ./init/mongodb/*.js 2>/dev/null | sort); do
|
||||||
echo "▶ $(basename "$js_file")"
|
echo "▶ $(basename "$js_file")"
|
||||||
docker run --rm -i --network iklimco-net mongo:8.3.2 \
|
docker run --rm -i --network iklimco-net "${IMAGE_MONGODB}" \
|
||||||
mongosh "mongodb://${DATABASE_MONGODB_ROOT_USER}:${DATABASE_MONGODB_ROOT_PASSWD}@mongodb/admin" \
|
sh -c 'cat > /tmp/init.js && mongosh "$MONGO_INIT_URI" --quiet --file /tmp/init.js' \
|
||||||
--quiet < "$js_file"
|
< "$js_file"
|
||||||
done
|
done
|
||||||
echo "✅ Database init scripts completed"
|
echo "✅ Database init scripts completed"
|
||||||
working-directory: /workspace/iklim.co
|
working-directory: /workspace/iklim.co
|
||||||
```
|
```
|
||||||
|
|
||||||
- `postgresql` and `mongodb`: Swarm VIP service names, resolved on the `iklimco-net` overlay; Patroni primary automatic routing happens at VIP level
|
- `DATABASE_POSTGRES_HOST`: multi-host Patroni target; the workflow uses `target_session_attrs=read-write` to reach the primary
|
||||||
|
- `DATABASE_MONGODB_HOST`: MongoDB replica set host list
|
||||||
- SQL files `./init/postgresql/*.sql` and JS files `./init/mongodb/*.js` are created in the `Prepare Init Files` step by the `init_postgresql`/`init_mongodb` functions in `common-functions-prod.sh`
|
- SQL files `./init/postgresql/*.sql` and JS files `./init/mongodb/*.js` are created in the `Prepare Init Files` step by the `init_postgresql`/`init_mongodb` functions in `common-functions-prod.sh`
|
||||||
- Idempotent: `CREATE IF NOT EXISTS` / `createCollection` semantics; runs safely again on later deploys
|
- Idempotent: `CREATE IF NOT EXISTS` / `createCollection` semantics; runs safely again on later deploys
|
||||||
|
|
||||||
@ -331,27 +303,19 @@ In prod, all 3 app nodes are manager + app worker, so services can be distribute
|
|||||||
|
|
||||||
### Microservices
|
### Microservices
|
||||||
|
|
||||||
Each microservice has two stack files:
|
Prod microservice workflows do not rebuild application images. They read `deploy/prod.env`, promote the tested Harbor digest to a stable prod tag, and call `swarm_service_update` with `deploy/docker-stack-service.yml`.
|
||||||
|
|
||||||
| File | Content | Environment |
|
For first deploy, `swarm_service_update` exports `SERVICE_IMAGE` and runs:
|
||||||
| --- | --- | --- |
|
|
||||||
| `BE-<Service>/docker-stack-service.yml` | Base definitions, `replicas: 1` | Test + Prod |
|
|
||||||
| `BE-<Service>/docker-stack-service.prod.yml` | `replicas: 3`, `max_replicas_per_node: 1` | Prod only |
|
|
||||||
|
|
||||||
Prod deploy command:
|
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
docker stack deploy \
|
docker stack deploy --with-registry-auth -c deploy/docker-stack-service.yml iklimco
|
||||||
-c BE-<Service>/docker-stack-service.yml \
|
|
||||||
-c BE-<Service>/docker-stack-service.prod.yml \
|
|
||||||
iklimco
|
|
||||||
```
|
```
|
||||||
|
|
||||||
`max_replicas_per_node: 1` is mandatory; without it, when the Swarm node count is lower than the replica count, Swarm places more than one replica on the same node.
|
For existing services it performs `docker service update` with `--update-order start-first` and `--update-failure-action rollback`.
|
||||||
|
|
||||||
### Infra Services
|
### Infra Services
|
||||||
|
|
||||||
`docker-stack-infra.yml` (base) and `docker-stack-infra.prod.yml` (overlay) are deployed together. The overlay overrides services such as Vault, APISIX, RabbitMQ, and Redis Sentinel with `replicas: 3` and `max_replicas_per_node: 1`. Detail: `Environment_Infrastructure/roadmap/prod-env/03-infra-stack-changes.md`.
|
The current prod infra stack is `docker-stack-infra_db-prod.yml`. Vault is not inside this stack; it is deployed separately by `vault-bootstrap.sh` using `docker-stack-vault.yml`.
|
||||||
|
|
||||||
#### cert-reloader and Vault Auto-Unseal
|
#### cert-reloader and Vault Auto-Unseal
|
||||||
|
|
||||||
@ -360,53 +324,28 @@ The `cert-reloader` sidecar service runs as `replicas: 1` inside the infra stack
|
|||||||
Certificate renewal flow:
|
Certificate renewal flow:
|
||||||
|
|
||||||
```
|
```
|
||||||
SWAG renews the certificate -> writes it to SWAG_CONFIG_DIR (/mnt/storagebox/swag/config)
|
SWAG renews the certificate -> stores it inside the SWAG named volume
|
||||||
cert-reloader detects the MD5 change
|
cert-reloader detects the MD5 change
|
||||||
-> copies it to /mnt/storagebox/ssl/ directory (common mount on all app nodes)
|
-> copies it to /mnt/storagebox/ssl/ directory (StorageBox)
|
||||||
|
cert-distributor syncs it to /opt/iklimco/ssl on service nodes
|
||||||
-> runs docker service update --force iklimco_vault
|
-> runs docker service update --force iklimco_vault
|
||||||
Vault (3 replicas) restarts
|
Vault (3 replicas) restarts
|
||||||
-> each instance reads the new certificate from the /mnt/storagebox/ssl/ mount
|
-> each instance reads the new certificate from /opt/iklimco/ssl
|
||||||
-> healthcheck checks sealed status every 30 seconds
|
-> entrypoint retry-unseal loop reads from the vault_unseal_key Docker secret and unseals
|
||||||
-> if sealed: reads from the vault_unseal_key Docker secret and automatically unseals
|
|
||||||
```
|
```
|
||||||
|
|
||||||
The auto-unseal mechanism is provided by the Vault healthcheck inside `docker-stack-infra.yml`:
|
The 3 Vault replicas run their own retry-unseal loop independently. The certificate renewal -> distribution -> restart -> unseal chain requires no manual intervention after bootstrap.
|
||||||
|
|
||||||
```yaml
|
|
||||||
healthcheck:
|
|
||||||
test:
|
|
||||||
- "CMD"
|
|
||||||
- "sh"
|
|
||||||
- "-c"
|
|
||||||
- >-
|
|
||||||
vault status -format=json 2>/dev/null | grep -q '"sealed":false' ||
|
|
||||||
vault operator unseal $$(cat /run/secrets/vault_unseal_key 2>/dev/null)
|
|
||||||
interval: 30s
|
|
||||||
timeout: 10s
|
|
||||||
start_period: 15s
|
|
||||||
retries: 5
|
|
||||||
```
|
|
||||||
|
|
||||||
The 3 replicas run their own healthchecks independently; all of them unseal separately. The certificate renewal -> restart -> auto-unseal chain requires no manual intervention. Detail: `roadmap/prod-env/06-cert-reloader.md`.
|
|
||||||
|
|
||||||
#### Vault Raft Configuration
|
#### Vault Raft Configuration
|
||||||
|
|
||||||
Vault is defined as 3 replicas with Raft storage in the `docker-stack-infra.prod.yml` overlay:
|
Vault is defined as 3 replicas with Raft storage in `docker-stack-vault.yml`:
|
||||||
|
|
||||||
```yaml
|
```yaml
|
||||||
vault:
|
vault:
|
||||||
environment:
|
|
||||||
VAULT_LOCAL_CONFIG: >-
|
|
||||||
{"api_addr":"https://vault.iklim.co:8200",
|
|
||||||
"cluster_addr":"https://{{ .Node.Hostname }}:8201",
|
|
||||||
"storage":{"raft":{"path":"/vault/file","node_id":"{{ .Node.Hostname }}"}},
|
|
||||||
"listener":[{"tcp":{"address":"0.0.0.0:8200",
|
|
||||||
"tls_cert_file":"/vault/certs/STAR.iklim.co.full.crt",
|
|
||||||
"tls_key_file":"/vault/certs/STAR.iklim.co_key.pem"}}],
|
|
||||||
"default_lease_ttl":"168h","max_lease_ttl":"720h","ui":true}
|
|
||||||
volumes:
|
volumes:
|
||||||
- /opt/iklimco/vault/data:/vault/file # separate host path on each node — created with Ansible
|
- vault-data-vl:/vault/file
|
||||||
- ${SWAG_CERT_DIR}:/vault/certs:ro # StorageBox shared — all nodes see the same path
|
- vault-logs-vl:/vault/logs
|
||||||
|
- /opt/iklimco/ssl:/vault/certs:ro
|
||||||
deploy:
|
deploy:
|
||||||
mode: replicated
|
mode: replicated
|
||||||
replicas: 3
|
replicas: 3
|
||||||
@ -416,59 +355,37 @@ vault:
|
|||||||
- node.labels.type == service
|
- node.labels.type == service
|
||||||
```
|
```
|
||||||
|
|
||||||
`{{ .Node.Hostname }}` is a Docker Swarm Go template; it gives each Vault instance a unique `node_id` and `cluster_addr`. Because `/opt/iklimco/vault/data` is a host path volume, it is not an overlay volume; it must be created separately on each app node during Ansible bootstrap. See `07-prod-ansible-bootstrap.md` — Node Directory Role. Detail: `roadmap/prod-env/07-vault-raft-plan.md`.
|
The Vault stack uses `vault-template-v2.json`, `vault_unseal_key`, and the `iklimco-net` external network. Bootstrap and unseal are handled by `init/vault/vault-bootstrap.sh`.
|
||||||
|
|
||||||
## Vault Raft Cluster Initial Setup
|
## Vault Raft Cluster Initial Setup
|
||||||
|
|
||||||
After the infra stack is deployed for the first time, the Vault Raft cluster is initialized manually once. These steps are not repeated on every deploy; they are applied only during initial setup.
|
Vault Raft cluster setup is no longer a manual post-deploy procedure. It is handled by `init/vault/vault-bootstrap.sh`, called through `init-infra-prod.sh` by the root prod workflow.
|
||||||
|
|
||||||
### Step 1 — Stack Deploy
|
### Step 1 — Stack Deploy
|
||||||
|
|
||||||
```bash
|
The bootstrap script deploys:
|
||||||
docker stack deploy -c docker-stack-infra.yml -c docker-stack-infra.prod.yml iklimco
|
|
||||||
```
|
|
||||||
|
|
||||||
3 Vault containers start. The first initialized node becomes the leader.
|
```bash
|
||||||
|
docker stack deploy --with-registry-auth -c docker-stack-vault.yml iklimco
|
||||||
|
```
|
||||||
|
|
||||||
### Step 2 — Vault Initialize (iklim-app-01)
|
### Step 2 — Vault Initialize (iklim-app-01)
|
||||||
|
|
||||||
```bash
|
The script runs `vault operator init -key-shares=1 -key-threshold=1` if Vault is not initialized. It stores bootstrap output under `/tmp/vault-bootstrap/main-vault-init.txt` during the run.
|
||||||
VAULT_CTR=$(docker ps -q -f name=iklimco_vault)
|
|
||||||
docker exec -it "$VAULT_CTR" vault operator init
|
|
||||||
```
|
|
||||||
|
|
||||||
Store the unseal keys and root token from the output securely. Save the unseal key as a Docker secret:
|
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
echo -n "<unseal-key>" | docker secret create vault_unseal_key -
|
echo "bootstrap" | docker secret create vault_unseal_key -
|
||||||
```
|
```
|
||||||
|
|
||||||
> After this step, the `vault_unseal_key` secret exists. During later certificate renewals, cert-reloader restarts Vault; the healthcheck reads this secret and automatically unseals, so no manual intervention is required.
|
Then it rotates `vault_unseal_key` to the real unseal key and unseals the leader and peers.
|
||||||
|
|
||||||
### Step 3 — Unseal the Leader
|
### Step 3 — Unseal the Leader
|
||||||
|
|
||||||
```bash
|
No manual unseal command is required in the normal path.
|
||||||
docker exec -it "$VAULT_CTR" vault operator unseal
|
|
||||||
```
|
|
||||||
|
|
||||||
### Step 4 — Join the Other Nodes to the Raft Cluster
|
### Step 4 — Join the Other Nodes to the Raft Cluster
|
||||||
|
|
||||||
The Vault containers on `iklim-app-02` and `iklim-app-03` join the cluster:
|
Peer join and peer unseal are handled by `vault-bootstrap.sh`.
|
||||||
|
|
||||||
```bash
|
|
||||||
docker exec -it <vault-on-iklim-app-02> vault operator raft join \
|
|
||||||
https://vault.iklim.co:8200
|
|
||||||
|
|
||||||
docker exec -it <vault-on-iklim-app-03> vault operator raft join \
|
|
||||||
https://vault.iklim.co:8200
|
|
||||||
```
|
|
||||||
|
|
||||||
Each node is also unsealed after it joins:
|
|
||||||
|
|
||||||
```bash
|
|
||||||
docker exec -it <vault-on-iklim-app-02> vault operator unseal
|
|
||||||
docker exec -it <vault-on-iklim-app-03> vault operator unseal
|
|
||||||
```
|
|
||||||
|
|
||||||
### Step 5 — Verify the Cluster
|
### Step 5 — Verify the Cluster
|
||||||
|
|
||||||
@ -646,20 +563,20 @@ Expected: valid JSON weather response.
|
|||||||
- `rabbitmq_erlang_cookie` appears in `docker secret ls`.
|
- `rabbitmq_erlang_cookie` appears in `docker secret ls`.
|
||||||
- The `ssl`, `swag/config`, `swag/site-confs`, `grafana/data`, and `precipitation/images` directories exist on StorageBox; see `07-prod-ansible-bootstrap.md` — StorageBox Directory Structure.
|
- The `ssl`, `swag/config`, `swag/site-confs`, `grafana/data`, and `precipitation/images` directories exist on StorageBox; see `07-prod-ansible-bootstrap.md` — StorageBox Directory Structure.
|
||||||
- The `template/swag/site-confs/default.conf`, `api.conf.tpl`, `apigw.conf.tpl`, `rabbitmq.conf.tpl`, and `grafana.conf.tpl` template files exist in the repo.
|
- The `template/swag/site-confs/default.conf`, `api.conf.tpl`, `apigw.conf.tpl`, `rabbitmq.conf.tpl`, and `grafana.conf.tpl` template files exist in the repo.
|
||||||
- StorageBox `prod/secrets/iklim.co/.env.prod` has correct values for `API_SUBDOMAIN`, `APIGW_SUBDOMAIN`, `RABBITMQ_SUBDOMAIN`, `GRAFANA_SUBDOMAIN`, `RESTRICTED_IPS`, `SWAG_CERT_DIR`, `SWAG_CONFIG_DIR`, and `SWAG_SITE_CONFS_DIR`.
|
- StorageBox `prod/secrets/iklim.co/.env` has correct values for `API_SUBDOMAIN`, `APIGW_SUBDOMAIN`, `RABBITMQ_SUBDOMAIN`, `GRAFANA_SUBDOMAIN`, `RESTRICTED_IPS`, `SWAG_CERT_DIR`, `SWAG_DNS_CONFIG_DIR`, `SWAG_SITE_CONFS_DIR`, and `SWAG_PROXY_CONFS_DIR`.
|
||||||
- After the first deploy, `docker exec $(docker ps -q -f name=iklimco_swag) nginx -t` succeeds and returns `syntax is ok`.
|
- After the first deploy, `docker exec $(docker ps -q -f name=iklimco_swag) nginx -t` succeeds and returns `syntax is ok`.
|
||||||
- The output of `cat /mnt/storagebox/swag/site-confs/api.conf | grep server_name` contains `server_name api.iklim.co;`.
|
- The output of `cat /mnt/storagebox/swag/site-confs/api.conf | grep server_name` contains `server_name api.iklim.co;`.
|
||||||
- The `ssls/1` PUT block does not exist inside `init/apisix-core/init.sh`.
|
- The `ssls/1` PUT block does not exist inside `init/apisix-core/init.sh`.
|
||||||
- The `registry.tarla.io/iklimco/custom-apisix:3.12.0` image exists in Harbor and its `config.yaml` contains `real_ip_header`, `real_ip_recursive`, and `set_real_ip_from` (covering `10.0.0.0/8`, `172.16.0.0/12`, `192.168.0.0/16`) configuration.
|
- The `registry.tarla.io/iklimco/custom-apisix:3.12.0` image exists in Harbor and its `config.yaml` contains `real_ip_header`, `real_ip_recursive`, and `set_real_ip_from` (covering `10.0.0.0/8`, `172.16.0.0/12`, `192.168.0.0/16`) configuration.
|
||||||
- After the first deploy, real client IP appears in APISIX access logs, not the SWAG overlay IP: `docker exec $(docker ps -q -f name=iklimco_apisix | head -1) tail -5 /usr/local/apisix/logs/access.log`
|
- After the first deploy, real client IP appears in APISIX access logs, not the SWAG overlay IP: `docker exec $(docker ps -q -f name=iklimco_apisix | head -1) tail -5 /usr/local/apisix/logs/access.log`
|
||||||
- `docker service ps iklimco_cert-reloader` shows that the service is running.
|
- `docker service ps iklimco_cert-reloader` shows that the service is running.
|
||||||
- `docker service ls` does not contain `iklimco_etcd`, `iklimco_postgresql`, `iklimco_mongodb`, `iklimco_pg-proxy`, or `iklimco_mongo-proxy`; they are removed by the post-deploy step in `deploy-prod.yml` (base stack services superseded by the `iklim-db` stack or deprecated in prod).
|
- `docker service ls` contains the current prod infra services from `docker-stack-infra_db-prod.yml` and the separate `iklimco_vault` service from `docker-stack-vault.yml`; deprecated base-stack services such as `iklimco_postgresql`, `iklimco_mongodb`, `iklimco_pg-proxy`, and `iklimco_mongo-proxy` are not present.
|
||||||
- The output of `docker service logs iklimco_cert-reloader --tail 20` contains `[cert-reloader] started` and has no error lines.
|
- The output of `docker service logs iklimco_cert-reloader --tail 20` contains `[cert-reloader] started` and has no error lines.
|
||||||
- The `notAfter` date of the Vault TLS endpoint certificate matches `/mnt/storagebox/ssl/STAR.iklim.co.full.crt`: `docker exec $(docker ps -q -f name=iklimco_vault | head -1) sh -c 'echo | openssl s_client -connect vault.iklim.co:8200 2>/dev/null | openssl x509 -noout -dates'`
|
- The `notAfter` date of the Vault TLS endpoint certificate matches `/mnt/storagebox/ssl/STAR.iklim.co.full.crt`: `docker exec $(docker ps -q -f name=iklimco_vault | head -1) sh -c 'echo | openssl s_client -connect vault.iklim.co:8200 2>/dev/null | openssl x509 -noout -dates'`
|
||||||
- `vault operator raft list-peers` returns 3 peers: 1 leader, 2 followers.
|
- `vault operator raft list-peers` returns 3 peers: 1 leader, 2 followers.
|
||||||
- The `vault_unseal_key` Docker secret exists and appears in `docker secret ls`.
|
- The `vault_unseal_key` Docker secret exists and appears in `docker secret ls`.
|
||||||
- 3 Vault containers are not sealed: `docker exec $(docker ps -q -f name=iklimco_vault | head -1) vault status | grep Sealed` -> `Sealed false`.
|
- 3 Vault containers are not sealed: `docker exec $(docker ps -q -f name=iklimco_vault | head -1) vault status | grep Sealed` -> `Sealed false`.
|
||||||
- The first deploy pipeline successfully completes all 21 steps; the `Review Environment` step succeeds.
|
- The first deploy pipeline successfully completes all current root workflow steps; the `Review Environment` step succeeds.
|
||||||
- After the `Bootstrap SWAG Certificate` step, `ls /mnt/storagebox/ssl/` -> `STAR.iklim.co.full.crt` and `STAR.iklim.co_key.pem` exist.
|
- After the `Bootstrap SWAG Certificate` step, `ls /mnt/storagebox/ssl/` -> `STAR.iklim.co.full.crt` and `STAR.iklim.co_key.pem` exist.
|
||||||
- The `Run Database Init Scripts` step completes without error; PostgreSQL and MongoDB are healthy and init scripts are applied.
|
- The `Run Database Init Scripts` step completes without error; PostgreSQL and MongoDB are healthy and init scripts are applied.
|
||||||
- In the output of `docker service ls --filter label=project=co.iklim`, all infra services show `X/X`.
|
- In the output of `docker service ls --filter label=project=co.iklim`, all infra services show `X/X`.
|
||||||
Loading…
x
Reference in New Issue
Block a user