diff --git a/.gitignore b/.gitignore new file mode 100644 index 0000000..db5f27a --- /dev/null +++ b/.gitignore @@ -0,0 +1,48 @@ +# Terraform local/runtime files +.terraform/ +*.tfstate +*.tfstate.* +crash.log +crash.*.log +override.tf +override.tf.json +*_override.tf +*_override.tf.json + +# Terraform secret variable files +*.tfvars +*.tfvars.json +terraform.tfvars +terraform.tfvars.json + +# Ansible local/runtime files +*.retry +.ansible/ +ansible-vault-password* +vault-password* + +# Secret material +.env +.env.* +!.env.example +secrets/ +secret/ +*.pem +*.key +id_rsa +id_rsa.pub +id_ed25519 +id_ed25519.pub +*_private_key +*_private_key.pub + +# Gitea runner tokens/config generated with secrets +act_runner.token +gitea-runner-registration-token* +runner-registration-token* +runner-config.secret.yaml + +# OS/editor noise +.DS_Store +*.swp +*.swo diff --git a/roadmap/prod-env/01-swarm-init-multinode.md b/roadmap/prod-env/01-swarm-init-multinode.md new file mode 100644 index 0000000..f8c88e0 --- /dev/null +++ b/roadmap/prod-env/01-swarm-init-multinode.md @@ -0,0 +1,101 @@ +# 01 — Docker Swarm Init (Prod — Multi-Node) + +## Context +- **Repo:** `iklim.co` root +- **Environment:** prod +- **Topology:** + - 3 × service nodes — all act as **Swarm managers AND app workers** (Raft quorum: 1 can fail) + - 3 × DB nodes — **NOT part of Docker Swarm** (separate DB cluster, out of scope) +- All 6 nodes are in the same private network. +- Pipeline trigger: push to `prod-env` branch → Gitea runner on `prod-runner` (first service node). +- Swarm has 3 nodes total; all are manager-eligible and carry workloads (no dedicated worker-only nodes). + +## Node labeling plan + +| Node | Role | Swarm role | Labels | +|------|------|------------|--------| +| service-1 | API services, SWAG, Vault | Manager + Worker | `type=service` | +| service-2 | API services replicas | Manager + Worker | `type=service` | +| service-3 | API services replicas | Manager + Worker | `type=service` | + +> DB nodes (`db-1/2/3`) are **not part of Docker Swarm**. They run as a separate cluster +> and are provisioned independently. No Swarm join or label step applies to them. + +## Step 1 — Init Swarm on service-1 (the prod-runner node) + +```bash +MANAGER_IP=$(hostname -I | awk '{print $1}') +if ! docker info --format '{{.Swarm.LocalNodeState}}' | grep -q "active"; then + docker swarm init --advertise-addr "$MANAGER_IP" + echo "✅ Swarm initialized on $MANAGER_IP" +else + echo "ℹ️ Swarm already active" +fi +``` + +## Step 2 — Get manager join token + +```bash +docker swarm join-token manager # for service-2, service-3 +``` + +Save this token — needed on service-2 and service-3. + +## Step 3 — Join service-2 and service-3 as managers + +SSH into service-2 and service-3, run: +```bash +docker swarm join --token :2377 +``` + +## Step 4 — Label all Swarm nodes + +On service-1, after service-2 and service-3 have joined: + +```bash +for node in service-1 service-2 service-3; do + docker node update --label-add type=service "$node" +done +``` + +> Replace `service-1`, etc. with actual node hostnames shown in `docker node ls`. +> DB nodes are not in Swarm — no join or label step for them. + +## Step 5 — Verify + +```bash +docker node ls +``` + +Expected: 3 nodes, all with `MANAGER STATUS` = `Leader` or `Reachable`. +All 3 nodes remain in `AVAILABILITY=Active` (not drained) so they also carry workloads. + +```bash +docker node inspect service-1 --format '{{.Spec.Labels}}' +``` + +Expected: `map[type:service]`. + +## Step 6 — Confirm `init/swarm-init.sh` multi-node awareness + +The script is idempotent (skips init if already active). Verify: + +```bash +grep -n "swarm init\|swarm join" init/swarm-init.sh +``` + +The prod pipeline runs on service-1 only. service-2/3 are joined via Ansible (`swarm` role), +not via the Gitea pipeline. + +## Placement constraints used in `docker-stack-infra.yml` + +| Constraint | Resolves to | +|------------|-------------| +| `node.role == manager` | service-1, service-2, service-3 | +| `node.labels.type == service` | service-1, service-2, service-3 | + +SWAG, Vault, cert-reloader: pinned to `node.role == manager`. +Microservices: no constraint (distributed across all 3 service nodes by Swarm scheduler). + +> `node.labels.type == db` constraint is **not used** — DB nodes are not in Swarm. +> PostgreSQL and MongoDB run outside Swarm as a separately managed cluster. diff --git a/roadmap/prod-env/02-godaddy-credentials.md b/roadmap/prod-env/02-godaddy-credentials.md new file mode 100644 index 0000000..48584e5 --- /dev/null +++ b/roadmap/prod-env/02-godaddy-credentials.md @@ -0,0 +1,63 @@ +# 02 — GoDaddy DNS Credentials for SWAG (Prod) + +## Context +Identical to test-env-setup/02, except the storagebox path is `prod/` instead of `test/`. + +## ⚠️ Security — Rotate credentials before use + +If credentials were shared in any chat log, Slack message, or email, **revoke them immediately**: +1. Go to: https://developer.godaddy.com/keys +2. Revoke the exposed key +3. Create a new Production key pair + +**Never commit credentials to the repository.** + +## Step 1 — Add credentials to storagebox `.env.secrets.shared` (prod path) + +Open the file at storagebox path: +``` +prod/secrets/iklim.co/.env.secrets.shared +``` + +Add: +```bash +GODADDY_KEY= +GODADDY_SECRET= +``` + +## Step 2 — Repo template file + +Same file as test: `swag/dns-conf/godaddy.ini.tpl` (already created in test step 02). +No additional action needed in the repo. + +## Step 3 — (Handled by pipeline) Write credentials file on prod host + +The deploy pipeline (see `08-deploy-pipeline-update.md`) runs on service-1: + +```bash +mkdir -p /opt/iklimco/swag/dns-conf +envsubst < swag/dns-conf/godaddy.ini.tpl > /opt/iklimco/swag/dns-conf/godaddy.ini +chmod 600 /opt/iklimco/swag/dns-conf/godaddy.ini +``` + +## Step 4 — GoDaddy A records for prod subdomains + +In GoDaddy DNS panel for `iklim.co`, add/update A records pointing to service-1's public IP: + +| Record | Value | +|--------|-------| +| `api` | `` | +| `apigw` | `` | +| `rabbitmq` | `` | +| `grafana` | `` | + +> Swarm's routing mesh means any node IP would work, but service-1 is the designated +> entry point (runs SWAG). Using a single IP keeps DNS simple. +> +> For HA: add a load balancer or use Hetzner's floating IP in front of the 3 service nodes. +> DNS then points to the floating IP. This is a future improvement. + +## Notes +- Test and prod SWAG instances both obtain `*.iklim.co` independently from Let's Encrypt. + There is no conflict — they use the same domain, different servers. +- `DNSPROPAGATION=90` handles GoDaddy's typical 30-90s propagation delay. diff --git a/roadmap/prod-env/03-infra-stack-changes.md b/roadmap/prod-env/03-infra-stack-changes.md new file mode 100644 index 0000000..45e3a4d --- /dev/null +++ b/roadmap/prod-env/03-infra-stack-changes.md @@ -0,0 +1,98 @@ +# 03 — docker-stack-infra.yml Changes (Prod) + +## Context +- **File:** `docker-stack-infra.yml` (repo root — shared between test and prod) +- All changes from `test-env-setup/03-infra-stack-changes.md` apply here identically. +- **Additional prod-specific changes:** + - PostgreSQL and MongoDB placement constraints point to `type=db` nodes. + - Microservices have no constraint (distributed across service nodes by Swarm). + - Replica counts for stateless services are increased. + +## Step 1 — Apply all test-env changes first + +Follow every step in `test-env-setup/03-infra-stack-changes.md`: +- Add `swag` service +- Add `cert-reloader` service +- Remove published ports for vault, apisix, rabbitmq, prometheus, grafana, apisix-dashboard +- Add `swag-vl` volume + +## Step 2 — Update PostgreSQL placement constraint + +Change `postgres` service placement to use the `type=db` label: + +```yaml +# CHANGE in postgres service: + placement: + constraints: + - node.labels.type == db +``` + +## Step 3 — Update MongoDB placement constraint + +```yaml +# CHANGE in mongo service: + placement: + constraints: + - node.labels.type == db +``` + +## Step 4 — Pin Vault to manager node (initial prod — single instance) + +Vault starts as a single instance pinned to the manager node. +Raft cluster migration is handled separately in `07-vault-raft-plan.md`. + +```yaml +# Vault placement stays as: + placement: + constraints: + - node.role == manager +``` + +## Step 5 — Increase APISIX replicas for prod + +```yaml +# CHANGE in apisix service deploy block: + mode: replicated + replicas: 2 # was 1 +``` + +APISIX is stateless (config in etcd) — multiple replicas are safe. +Swarm load-balances SWAG's requests across APISIX replicas via VIP. + +## Step 6 — etcd: 3-node cluster for prod + +For prod, etcd should run as a 3-node cluster (minimum for Raft quorum). +The current single-instance etcd definition needs to be replaced with a 3-node +StatefulSet-style setup using separate service definitions or a dedicated +`docker-stack-etcd.yml`. + +> **Scope note:** etcd clustering for prod is complex and out of scope for initial launch. +> Deploy with single etcd for initial prod launch. Add etcd clustering as a follow-up task. +> Track in: `Technical Debt/TODO.md` + +## Step 7 — Verify the complete file + +After all edits, validate the YAML: + +```bash +docker stack config -c docker-stack-infra.yml > /dev/null && echo "✅ YAML valid" +``` + +No output errors = valid. + +## Placement summary for prod + +| Service | Placement | +|---------|-----------| +| swag | `node.role == manager` | +| cert-reloader | `node.role == manager` | +| vault | `node.role == manager` | +| apisix (2 replicas) | no constraint (any node) | +| apisix-dashboard | no constraint | +| postgres | `node.labels.type == db` | +| mongo | `node.labels.type == db` | +| redis | `node.role == manager` | +| rabbitmq | `node.role == manager` | +| etcd | `node.role == manager` | +| prometheus | `node.role == manager` | +| grafana | `node.role == manager` | diff --git a/roadmap/prod-env/04-swag-nginx-configs.md b/roadmap/prod-env/04-swag-nginx-configs.md new file mode 100644 index 0000000..94abed5 --- /dev/null +++ b/roadmap/prod-env/04-swag-nginx-configs.md @@ -0,0 +1,71 @@ +# 04 — SWAG Nginx Proxy Configs (Prod) + +## Context +Same template files as test (`swag/proxy-confs/*.conf.tpl`), different env vars. +The pipeline processes templates with prod-specific subdomain values. + +## Required env vars (in `.env` on storagebox `prod/secrets/iklim.co/.env.prod`) + +```bash +API_SUBDOMAIN=api.iklim.co +APIGW_SUBDOMAIN=apigw.iklim.co +RABBITMQ_SUBDOMAIN=rabbitmq.iklim.co +GRAFANA_SUBDOMAIN=grafana.iklim.co +RESTRICTED_IP_1=78.187.87.109 +RESTRICTED_IP_2=95.70.151.248 +``` + +## Template files (already created in test step 04) + +- `swag/site-confs/default.conf` +- `swag/proxy-confs/api.conf.tpl` +- `swag/proxy-confs/apigw.conf.tpl` +- `swag/proxy-confs/rabbitmq.conf.tpl` +- `swag/proxy-confs/grafana.conf.tpl` + +No new files to create — the same templates work for both environments. + +## Deploy step (handled by pipeline — see `08-deploy-pipeline-update.md`) + +```bash +set -a; . ./.env; set +a +export RESTRICTED_IP_1="78.187.87.109" +export RESTRICTED_IP_2="95.70.151.248" + +sudo mkdir -p /opt/iklimco/swag/proxy-confs /opt/iklimco/swag/site-confs + +for tpl in swag/proxy-confs/*.conf.tpl; do + out="/opt/iklimco/swag/proxy-confs/$(basename "${tpl%.tpl}")" + envsubst < "$tpl" | sudo tee "$out" > /dev/null + echo "✅ $out" +done + +sudo cp swag/site-confs/default.conf /opt/iklimco/swag/site-confs/default.conf +``` + +With `API_SUBDOMAIN=api.iklim.co`, the output file `/opt/iklimco/swag/proxy-confs/api.conf` +will contain `server_name api.iklim.co;` — correct for prod. + +## Verification + +After deploy, on service-1: +```bash +cat /opt/iklimco/swag/proxy-confs/api.conf | grep server_name +``` +Expected: `server_name api.iklim.co;` + +```bash +docker exec $(docker ps -q -f name=iklimco_swag) nginx -t +``` +Expected: `syntax is ok` + +```bash +curl -si https://api.iklim.co/health +``` +Expected: APISIX response with valid `*.iklim.co` cert. + +## Notes +- `Prometheus` is intentionally NOT exposed via SWAG. Access it via Grafana + (internal connection: `http://prometheus:9090`) or SSH tunnel. +- If additional restricted-access subdomains are needed in the future, create a new + `swag/proxy-confs/.conf.tpl` following the same pattern. diff --git a/roadmap/prod-env/05-apisix-remove-ssl.md b/roadmap/prod-env/05-apisix-remove-ssl.md new file mode 100644 index 0000000..2b89344 --- /dev/null +++ b/roadmap/prod-env/05-apisix-remove-ssl.md @@ -0,0 +1,37 @@ +# 05 — APISIX: Remove SSL / Configure Trusted Proxy (Prod) + +## Context +Identical to `test-env-setup/05-apisix-remove-ssl.md`. + +The same `init/apisix-core/init.sh` and custom APISIX image are used for both environments. +Changes made for test already apply to prod. + +## Checklist + +- [ ] `ssls/1` PUT block removed from `init/apisix-core/init.sh` +- [ ] `dev` SSL block removed or confirmed non-impactful for prod +- [ ] Custom APISIX image (`custom-apisix:3.12.0`) config.yaml contains `real_ip_header` + and `set_real_ip_from` for overlay CIDR (`10.0.0.0/8`) +- [ ] New image built and pushed to Harbor if config.yaml was changed: + ```bash + docker build -t registry.tarla.io/iklimco/custom-apisix:3.12.0 . + docker push registry.tarla.io/iklimco/custom-apisix:3.12.0 + ``` + +## Prod-specific note + +APISIX runs with `replicas: 2` in prod. Both replicas receive the same configuration +from etcd — no additional steps needed beyond the single init run. + +The `init/apisix-core/init.sh` is called once (from the pipeline) and configures the +shared etcd state that all APISIX instances read from. + +## Verification + +```bash +# From a whitelisted IP, make a request and check real IP in APISIX logs +docker exec $(docker ps -q -f name=iklimco_apisix | head -1) \ + tail -5 /usr/local/apisix/logs/access.log +``` + +Client IP should appear in the log, not SWAG's internal overlay IP. diff --git a/roadmap/prod-env/06-cert-reloader.md b/roadmap/prod-env/06-cert-reloader.md new file mode 100644 index 0000000..8b4b59c --- /dev/null +++ b/roadmap/prod-env/06-cert-reloader.md @@ -0,0 +1,57 @@ +# 06 — cert-reloader Sidecar Service (Prod) + +## Context +Same service definition as test (see `test-env-setup/06-cert-reloader.md`). +Prod-specific consideration: Vault is single-instance on the manager node (same as SWAG), +so the cert copy to `/opt/iklimco/ssl/` works without cross-node distribution. + +When Vault is expanded to a 3-node Raft cluster (see `07-vault-raft-plan.md`), the +cert-reloader must be updated to distribute the cert to the other Vault nodes. + +## Current behavior (single-Vault prod) + +``` +SWAG (manager) renews cert → swag-vl +cert-reloader (manager) detects change → copies to /opt/iklimco/ssl/ → reloads Vault +Vault (manager) reads /opt/iklimco/ssl/ → serves new cert +``` + +No cross-node distribution needed. + +## Future behavior (3-node Vault Raft — see step 07) + +When Vault runs on service-1, service-2, service-3: + +``` +cert-reloader detects cert change +→ copies cert to /opt/iklimco/ssl/ on service-1 (local) +→ SSH copy to service-2:/opt/iklimco/ssl/ +→ SSH copy to service-3:/opt/iklimco/ssl/ +→ docker service update --force iklimco_vault (restarts all 3 replicas) +``` + +This requires: +- An SSH key that cert-reloader can use to reach service-2 and service-3 +- That key mounted as a Docker secret into cert-reloader +- Known_hosts for service-2 and service-3 pre-configured + +Script update for this phase is tracked in `07-vault-raft-plan.md`. + +## Verification + +```bash +docker service ps iklimco_cert-reloader +docker service logs iklimco_cert-reloader --tail 20 +``` + +Expected: `[cert-reloader] started`, no error lines. + +Confirm Vault cert is current after SWAG renewal: +```bash +# Check cert expiry on Vault's TLS endpoint from inside the overlay +docker exec $(docker ps -q -f name=iklimco_vault) \ + sh -c 'echo | openssl s_client -connect vault.iklim.co:8200 2>/dev/null \ + | openssl x509 -noout -dates' +``` + +`notAfter` should match the cert in `/opt/iklimco/ssl/STAR.iklim.co.full.crt`. diff --git a/roadmap/prod-env/07-vault-raft-plan.md b/roadmap/prod-env/07-vault-raft-plan.md new file mode 100644 index 0000000..68c407c --- /dev/null +++ b/roadmap/prod-env/07-vault-raft-plan.md @@ -0,0 +1,105 @@ +# 07 — Vault: Initial Single Instance + Raft Cluster Migration Plan (Prod) + +## Context +Vault starts as a single instance on the manager node (service-1) for the initial prod launch. +This matches the current `docker-stack-infra.yml` configuration (file storage, single replica). + +Raft HA cluster is planned for a later phase. + +## Phase 1 — Initial prod launch (current) + +- **Replicas:** 1 +- **Storage:** file (`/vault/file`) on service-1 +- **Placement:** `node.role == manager` (service-1) +- **Cert:** from `/opt/iklimco/ssl/` (populated by cert-reloader from SWAG volume) +- **TLS:** `VAULT_LOCAL_CONFIG` unchanged — `api_addr: https://vault.iklim.co:8200` + +No changes to `docker-stack-infra.yml` vault service for Phase 1. + +## Phase 2 — Vault Raft Cluster (future) + +### What changes +- **Replicas:** 3 (one per service node) +- **Storage:** Raft integrated (replaces file storage) +- **Placement:** `node.labels.type == service` (all 3 service nodes) +- **Cert distribution:** cert-reloader SSH-copies renewed cert to service-2, service-3 + +### Prerequisites before migration +- [ ] All 3 service nodes are running and labeled `type=service` +- [ ] Vault data backed up from Phase 1 (snapshot via `vault operator raft snapshot save`) +- [ ] SSH key created for cert-reloader to reach service-2 and service-3 +- [ ] SSH key stored as Docker secret `cert_reloader_ssh_key` +- [ ] `/opt/iklimco/ssl/` directory exists on service-2 and service-3 +- [ ] Vault data directory `/opt/iklimco/vault/data/` exists on all 3 nodes (host path volumes) + +### Vault service update for Raft + +```yaml +vault: + # ... (image, secrets, healthcheck unchanged) + environment: + VAULT_LOCAL_CONFIG: >- + {"api_addr":"https://vault.iklim.co:8200", + "cluster_addr":"https://{{ .Node.Hostname }}:8201", + "storage":{"raft":{"path":"/vault/file","node_id":"{{ .Node.Hostname }}"}}, + "listener":[{"tcp":{"address":"0.0.0.0:8200", + "tls_cert_file":"/vault/certs/STAR.iklim.co.full.crt", + "tls_key_file":"/vault/certs/STAR.iklim.co_key.txt"}}], + "default_lease_ttl":"168h","max_lease_ttl":"720h","ui":true} + volumes: + - /opt/iklimco/vault/data:/vault/file # host path per node + - /opt/iklimco/ssl:/vault/certs:ro + deploy: + mode: replicated + replicas: 3 + placement: + constraints: + - node.labels.type == service +``` + +> `{{ .Node.Hostname }}` is Docker Swarm's Go template for the node hostname — +> gives each Vault instance a unique `node_id`. + +### Raft join procedure (after deploying 3-replica Vault) + +Only the leader needs to be bootstrapped; others join via `vault operator raft join`: + +```bash +# On the primary Vault (service-1 container): +VAULT_CTR=$(docker ps -q -f name=iklimco_vault) + +# Unseal if needed +docker exec -it "$VAULT_CTR" vault operator unseal + +# Check Raft peers +docker exec "$VAULT_CTR" vault operator raft list-peers +``` + +On service-2 and service-3 containers: +```bash +docker exec -it vault operator raft join \ + https://vault.iklim.co:8200 +``` + +### cert-reloader update for Raft + +Update the cert-reloader command in `docker-stack-infra.yml` to SSH-copy the cert +to service-2 and service-3 after renewal: + +```bash +# After copying to local /opt/iklimco/ssl/: +ssh -i /run/secrets/cert_reloader_ssh_key service-2 \ + "cp /dev/stdin /opt/iklimco/ssl/STAR.iklim.co.full.crt" < /opt/iklimco/ssl/STAR.iklim.co.full.crt +# (repeat for service-3 and privkey) +docker service update --force iklimco_vault +``` + +Add Docker secret to cert-reloader: +```yaml +secrets: + - cert_reloader_ssh_key +``` + +## Reference +- Vault Raft storage docs: https://developer.hashicorp.com/vault/docs/configuration/storage/raft +- Vault Swarm setup: https://manjit28.medium.com/setting-up-a-secure-and-highly-available-hashicorp-vault-cluster-for-secrets-and-certificates-0ce01a370582 diff --git a/roadmap/prod-env/08-deploy-pipeline-update.md b/roadmap/prod-env/08-deploy-pipeline-update.md new file mode 100644 index 0000000..0844f06 --- /dev/null +++ b/roadmap/prod-env/08-deploy-pipeline-update.md @@ -0,0 +1,130 @@ +# 08 — Deploy Pipeline Update (Prod) + +## Context +- **File:** `.gitea/workflows/deploy-prod.yml` +- Same changes as test pipeline (`test-env-setup/07-deploy-pipeline-update.md`), + adapted for prod paths and prod runner. + +## Step 1 — Remove manual cert scp lines from `Initialize Servers` + +```yaml +# DELETE from "Initialize Servers" step: + scp -P 23 ${{ vars.STORAGEBOX_USER }}@${{ vars.STORAGEBOX_USER }}.your-storagebox.de:prod/app/iklim.co/ssl/STAR.iklim.co.full.crt ./STAR.iklim.co.full.crt + scp -P 23 ${{ vars.STORAGEBOX_USER }}@${{ vars.STORAGEBOX_USER }}.your-storagebox.de:prod/app/iklim.co/ssl/STAR.iklim.co_key.txt ./STAR.iklim.co_key.txt +``` + +Also remove from `Prepare Init Files`: +```yaml +# DELETE or make conditional: + sudo cp STAR.iklim.co.full.crt STAR.iklim.co_key.txt /opt/iklimco/ssl/ +``` + +## Step 2 — Add `Prepare SWAG Directories` step + +Insert **before** `Deploy Swarm Stack`: + +```yaml + - name: Prepare SWAG Directories + run: | + set -a; . ./.env; . ./.env.secrets.shared; set +a + + sudo mkdir -p /opt/iklimco/swag/dns-conf + envsubst < swag/dns-conf/godaddy.ini.tpl | sudo tee /opt/iklimco/swag/dns-conf/godaddy.ini > /dev/null + sudo chmod 600 /opt/iklimco/swag/dns-conf/godaddy.ini + echo "✅ godaddy.ini written" + + sudo mkdir -p /opt/iklimco/swag/proxy-confs /opt/iklimco/swag/site-confs + + export RESTRICTED_IP_1="78.187.87.109" + export RESTRICTED_IP_2="95.70.151.248" + + for tpl in swag/proxy-confs/*.conf.tpl; do + out="/opt/iklimco/swag/proxy-confs/$(basename "${tpl%.tpl}")" + envsubst < "$tpl" | sudo tee "$out" > /dev/null + echo "✅ $out" + done + + sudo cp swag/site-confs/default.conf /opt/iklimco/swag/site-confs/default.conf + echo "✅ SWAG directories ready" + working-directory: /workspace/iklim.co +``` + +> `.env` is sourced first so `API_SUBDOMAIN=api.iklim.co` (prod values) are used. +> Ensure these vars are in `prod/secrets/iklim.co/.env.prod` on storagebox. + +## Step 3 — Add `Bootstrap SWAG Certificate` step + +Insert **after** `Deploy Swarm Stack`: + +```yaml + - name: Bootstrap SWAG Certificate + run: | + echo "Waiting for SWAG container to start..." + SWAG_CTR="" + for i in $(seq 1 24); do + SWAG_CTR=$(docker ps -q -f name=iklimco_swag 2>/dev/null | head -1) + [ -n "$SWAG_CTR" ] && break + sleep 10 + done + + if [ -z "$SWAG_CTR" ]; then + echo "❌ SWAG container did not start" + exit 1 + fi + + CERT_PATH="/config/etc/letsencrypt/live/iklim.co/fullchain.pem" + echo "Waiting for cert (up to 10 min)..." + for i in $(seq 1 20); do + if docker exec "$SWAG_CTR" test -f "$CERT_PATH" 2>/dev/null; then + echo "✅ Cert obtained" + break + fi + echo " attempt $i/20 — waiting 30s..." + sleep 30 + done + + if ! docker exec "$SWAG_CTR" test -f "$CERT_PATH" 2>/dev/null; then + echo "❌ SWAG did not obtain cert. Logs:" + docker service logs iklimco_swag --tail 50 + exit 1 + fi + + sudo mkdir -p /opt/iklimco/ssl + docker exec "$SWAG_CTR" cat "$CERT_PATH" | \ + sudo tee /opt/iklimco/ssl/STAR.iklim.co.full.crt > /dev/null + docker exec "$SWAG_CTR" cat "/config/etc/letsencrypt/live/iklim.co/privkey.pem" | \ + sudo tee /opt/iklimco/ssl/STAR.iklim.co_key.txt > /dev/null + echo "✅ Cert bootstrapped to /opt/iklimco/ssl/" + working-directory: /workspace/iklim.co +``` + +## Step 4 — Ensure subdomain env vars are in prod `.env` + +Add to `prod/secrets/iklim.co/.env.prod` on storagebox: + +```bash +API_SUBDOMAIN=api.iklim.co +APIGW_SUBDOMAIN=apigw.iklim.co +RABBITMQ_SUBDOMAIN=rabbitmq.iklim.co +GRAFANA_SUBDOMAIN=grafana.iklim.co +``` + +## Step 5 — Final step order for prod pipeline + +1. Checkout Branch +2. Prepare Folders +3. Set up SSH Key +4. Install Required Tools +5. Fetch Service Secret Files +6. Initialize Servers ← cert scp lines removed +7. Upload Updated Secrets to Storagebox +8. Provision Vault AppRole IDs and Docker Secrets +9. Upload Updated Env to Storagebox +10. Prepare Init Files ← cert copy lines removed +11. Initialize Docker Swarm +12. Stop Docker Compose Services +13. Docker Login to Harbor +14. **Prepare SWAG Directories** ← NEW +15. Deploy Swarm Stack +16. **Bootstrap SWAG Certificate** ← NEW +17. Review Environment diff --git a/roadmap/prod-env/09-verify.md b/roadmap/prod-env/09-verify.md new file mode 100644 index 0000000..0e20ea2 --- /dev/null +++ b/roadmap/prod-env/09-verify.md @@ -0,0 +1,120 @@ +# 09 — Verification Checklist (Prod) + +## Context +Run after a successful prod pipeline deployment. + +## 1 — Swarm cluster health + +```bash +docker node ls +``` +Expected: 3 managers (`Leader` + 2 `Reachable`), 3 workers (`Ready`). + +```bash +docker service ls --filter label=project=co.iklim +``` +All services show `REPLICAS X/X` (target met). + +## 2 — SWAG cert is valid + +```bash +docker exec $(docker ps -q -f name=iklimco_swag) certbot certificates +``` +Expected: `*.iklim.co`, `VALID: XX days` (Let's Encrypt, not the old manual cert). + +TLS check from outside: +```bash +echo | openssl s_client -connect api.iklim.co:443 -servername api.iklim.co 2>/dev/null \ + | openssl x509 -noout -subject -dates +``` +Expected: `CN=*.iklim.co`, `notAfter` > 2026-07-15 (cert is Let's Encrypt, not expiring old one). + +## 3 — Public API + +```bash +curl -si https://api.iklim.co/health +``` +HTTP 2xx, no TLS errors. + +## 4 — IP restriction working + +From a non-whitelisted IP: +```bash +curl -si https://grafana.iklim.co +curl -si https://apigw.iklim.co +curl -si https://rabbitmq.iklim.co +``` +All expected: HTTP 403. + +From whitelisted IP (78.187.87.109 or 95.70.151.248): +```bash +curl -si https://grafana.iklim.co # HTTP 200 Grafana +curl -si https://apigw.iklim.co # HTTP 200 APISIX Dashboard +curl -si https://rabbitmq.iklim.co # HTTP 200 RabbitMQ Management +``` + +## 5 — Vault not reachable externally + +```bash +# From outside — must fail +curl -sk --connect-timeout 5 https://:8200/v1/sys/health +# Expected: connection refused or timeout +``` + +```bash +# From inside overlay — must succeed +docker exec $(docker ps -q -f name=iklimco_apisix | head -1) \ + curl -sk https://vault.iklim.co:8200/v1/sys/health +# Expected: {"sealed":false,...} +``` + +## 6 — cert-reloader watching + +```bash +docker service logs iklimco_cert-reloader --tail 5 +``` +Expected: `[cert-reloader] started`, no errors. + +## 7 — No unexpected published ports + +```bash +docker service ls --format "{{.Name}}\t{{.Ports}}" \ + --filter label=project=co.iklim +``` +Only `iklimco_swag` should show `*:80->80/tcp, *:443->443/tcp`. + +## 8 — DB nodes running correct services + +```bash +docker service ps iklimco_postgres +docker service ps iklimco_mongo +``` +Tasks should show node names matching `db-1`, `db-2`, or `db-3`. + +## 9 — APISIX replicas + +```bash +docker service ps iklimco_apisix +``` +Expected: 2 tasks, both `Running`, on different nodes. + +## 10 — fail2ban active + +```bash +docker exec $(docker ps -q -f name=iklimco_swag) fail2ban-client status +``` +Expected: multiple jails listed. + +## 11 — Microservice health (post-deploy) + +After microservices are deployed (separate pipeline), verify via the public API: +```bash +curl -si https://api.iklim.co/v1/weather/current?lat=39&lon=35 +``` +Expected: valid JSON weather response. + +## ⚠️ Old cert expiry reminder +The manually managed `*.iklim.co` cert expires **2026-07-15**. +SWAG's Let's Encrypt cert auto-renews every ~60 days. +After first SWAG cert is confirmed valid, the manual cert in storagebox can be archived +and is no longer used. diff --git a/roadmap/test-env/01-swarm-init.md b/roadmap/test-env/01-swarm-init.md new file mode 100644 index 0000000..9418120 --- /dev/null +++ b/roadmap/test-env/01-swarm-init.md @@ -0,0 +1,74 @@ +# 01 — Docker Swarm Init (Test) + +## Context +- **Repo:** `iklim.co` root +- **Environment:** test +- **Server:** single node — same machine is both Swarm manager and worker +- Pipeline trigger: push to `test-env` branch → Gitea runner executes directly on the test server +- `init/swarm-init.sh` already exists in the repo and is called by the pipeline + +## Prerequisites +- Docker Engine installed on test server +- User running the pipeline has Docker access (group `docker` or root) + +## Step 1 — Verify / update `init/swarm-init.sh` + +Check that the script handles idempotent init: + +```bash +grep -n "swarm init" init/swarm-init.sh +``` + +The script must contain logic similar to: + +```bash +if ! docker info --format '{{.Swarm.LocalNodeState}}' | grep -q "active"; then + docker swarm init --advertise-addr $(hostname -I | awk '{print $1}') + echo "✅ Swarm initialized" +else + echo "ℹ️ Swarm already active, skipping init" +fi +``` + +If this guard is missing, add it. Without it, the step fails on second deploy. + +## Step 2 — Run via pipeline + +The pipeline step `Initialize Docker Swarm` in `.gitea/workflows/deploy-test.yml` already calls: + +```bash +/bin/bash init/swarm-init.sh +``` + +No manual action needed after the script is correct. + +## Step 3 — Apply node label + +The `type=service` label is required for placement constraints in `docker-stack-infra.yml`. +Run once after Swarm init (Ansible handles this in automated setup): + +```bash +docker node update --label-add type=service $(docker node ls -q) +``` + +## Step 4 — Verify + +SSH into the test server and run: + +```bash +docker node ls +``` + +Expected: one node, `STATUS=Ready`, `AVAILABILITY=Active`, `MANAGER STATUS=Leader`. + +```bash +docker node inspect self --format '{{.Spec.Labels}}' +``` + +Expected: `map[type:service]`. + +## Notes +- Single-node Swarm: node is simultaneously manager and worker (`AVAILABILITY=Active`, not drained). +- Placement constraints `node.role == manager` and `node.labels.type == service` both resolve to this machine. +- No worker-join or manager-join steps needed for test. +- Docker Swarm overlay network `iklimco-net` is created automatically on first `docker stack deploy`. diff --git a/roadmap/test-env/02-godaddy-credentials.md b/roadmap/test-env/02-godaddy-credentials.md new file mode 100644 index 0000000..2fe90e4 --- /dev/null +++ b/roadmap/test-env/02-godaddy-credentials.md @@ -0,0 +1,73 @@ +# 02 — GoDaddy DNS Credentials for SWAG (Test) + +## Context +SWAG uses certbot with `certbot-dns-godaddy` plugin to obtain and auto-renew the +`*.iklim.co` wildcard certificate via DNS-01 challenge. +GoDaddy API credentials must be available at deploy time. + +## ⚠️ Security — Rotate credentials before use + +If credentials were shared in any chat log, Slack message, or email, **revoke them immediately**: + +1. Go to: https://developer.godaddy.com/keys +2. Revoke the exposed key +3. Create a new Production key pair +4. Use the new Key + Secret everywhere below + +**Never commit credentials to the repository.** + +## Step 1 — Add credentials to storagebox `.env.secrets.swag` + +Open (or create) the file at storagebox path: +``` +test/secrets/iklim.co/.env.secrets.swag +``` + +Add: +```bash +GODADDY_KEY= +GODADDY_SECRET= +``` + +These are fetched by the deploy pipeline's `Fetch Service Secret Files` step and sourced into the environment before further steps run. + +## Step 2 — Template file in the repo + +`swag/dns-conf/godaddy.ini.tpl` already exists in the repository root: + +```ini +dns_godaddy_key = ${GODADDY_KEY} +dns_godaddy_secret = ${GODADDY_SECRET} +``` + +This template is processed at deploy time (Step 07) with `envsubst`. + +## Step 3 — (Handled by pipeline) Write the actual credentials file on the host + +The deploy pipeline (see `07-deploy-pipeline-update.md`) runs: + +```bash +mkdir -p /opt/iklimco/swag/dns-conf +envsubst < swag/dns-conf/godaddy.ini.tpl > /opt/iklimco/swag/dns-conf/godaddy.ini +chmod 600 /opt/iklimco/swag/dns-conf/godaddy.ini +``` + +`GODADDY_KEY` and `GODADDY_SECRET` are already in the environment (sourced from `.env.secrets.swag`). + +The file is bind-mounted into the SWAG container at `/config/dns-conf/godaddy.ini` (read-only). + +## Step 4 — Verify (after SWAG is deployed) + +Inside the SWAG container: +```bash +docker exec $(docker ps -q -f name=iklimco_swag) cat /config/dns-conf/godaddy.ini +``` + +Expected output: file with real key/secret values, not `${...}` placeholders. + +## Notes +- `DNSPROPAGATION=90` is configured in SWAG's environment — GoDaddy DNS changes can take up to 90s. +- SWAG stores the obtained cert at `/config/etc/letsencrypt/live/iklim.co/` inside the container + (persisted in the `swag-vl` Docker named volume). +- cert-reloader service watches this volume and copies renewed certs to `/opt/iklimco/ssl/` + for Vault (see `06-cert-reloader.md`). diff --git a/roadmap/test-env/03-infra-stack-changes.md b/roadmap/test-env/03-infra-stack-changes.md new file mode 100644 index 0000000..59a5c57 --- /dev/null +++ b/roadmap/test-env/03-infra-stack-changes.md @@ -0,0 +1,241 @@ +# 03 — docker-stack-infra.yml Changes (Test) + +## Context +- **File:** `docker-stack-infra.yml` (repo root) +- **Goal:** Add SWAG as TLS-terminating reverse proxy; remove all published ports from internal + services (they become reachable only via SWAG through the `iklimco-net` overlay network); + remove Vault's external port entirely. + +## Changes Summary + +| Service | Before | After | +|---------|--------|-------| +| **swag** | does not exist | add: ports 80+443, manager-pinned | +| **cert-reloader** | does not exist | add: manager-pinned, Docker socket | +| **vault** | publishes 8200 | no published port | +| **apisix** | publishes 8080, 8443, 9180 | no published ports | +| **rabbitmq** | publishes 5672, 15672, 61613, 15674 | no published ports | +| **prometheus** | publishes 9090 | no published port | +| **grafana** | publishes 3000 | no published port | +| **apisix-dashboard** | publishes 9000 | no published port | + +> **RabbitMQ STOMP note:** Ports 61613 (STOMP) and 15674 (WebSocket STOMP) are removed because +> APISIX already proxies WebSocket STOMP to RabbitMQ via the overlay network. Verify that +> APISIX has a stream/WebSocket route for STOMP before removing these if external clients +> connect to STOMP directly (not via APISIX). + +## Step 1 — Add `swag` service + +Add after the `apisix-dashboard` service block: + +```yaml + swag: + image: lscr.io/linuxserver/swag:latest + cap_add: + - NET_ADMIN + environment: + - PUID=1000 + - PGID=1000 + - TZ=Europe/Istanbul + - URL=iklim.co + - SUBDOMAINS=wildcard + - VALIDATION=dns + - DNSPLUGIN=godaddy + - ONLY_SUBDOMAINS=false + - EMAIL=muratozdemir@tarla.io + - DNSPROPAGATION=90 + volumes: + - swag-vl:/config + - /opt/iklimco/swag/dns-conf:/config/dns-conf:ro + - /opt/iklimco/swag/proxy-confs:/config/nginx/proxy-confs:ro + - /opt/iklimco/swag/site-confs:/config/nginx/site-confs:ro + ports: + - target: 80 + published: 80 + protocol: tcp + mode: host + - target: 443 + published: 443 + protocol: tcp + mode: host + deploy: + mode: replicated + replicas: 1 + placement: + constraints: + - node.role == manager + restart_policy: + condition: on-failure + delay: 5s + labels: + project: co.iklim +``` + +## Step 2 — Add `cert-reloader` service + +Add after the `swag` service block: + +```yaml + cert-reloader: + image: docker:27-cli + volumes: + - swag-vl:/swag-config:ro + - /opt/iklimco/ssl:/host-ssl + - /var/run/docker.sock:/var/run/docker.sock + entrypoint: ["/bin/sh", "-c"] + command: + - | + CERT_DIR=/swag-config/etc/letsencrypt/live/iklim.co + HOST_DIR=/host-ssl + LAST_HASH="" + echo "[cert-reloader] started" + while true; do + sleep 3600 + if [ -f "$$CERT_DIR/fullchain.pem" ]; then + CURR=$$(md5sum "$$CERT_DIR/fullchain.pem" | cut -d' ' -f1) + if [ "$$CURR" != "$$LAST_HASH" ]; then + echo "[cert-reloader] cert changed — copying and reloading Vault" + cp "$$CERT_DIR/fullchain.pem" "$$HOST_DIR/STAR.iklim.co.full.crt" + cp "$$CERT_DIR/privkey.pem" "$$HOST_DIR/STAR.iklim.co_key.txt" + docker service update --force iklimco_vault + LAST_HASH="$$CURR" + echo "[cert-reloader] done" + fi + fi + done + deploy: + mode: replicated + replicas: 1 + placement: + constraints: + - node.role == manager + restart_policy: + condition: on-failure + delay: 10s + labels: + project: co.iklim +``` + +> `$$` is required in Docker Swarm YAML to escape `$` and prevent host-side variable expansion. + +## Step 3 — Remove `vault` published port + +Find the `vault` service `ports:` block and **delete it entirely**: + +```yaml +# DELETE this entire block from vault service: + ports: + - target: 8200 + published: 8200 + protocol: tcp + mode: host +``` + +Vault remains reachable within `iklimco-net` via the overlay alias `vault.iklim.co:8200`. +The `VAULT_LOCAL_CONFIG` `api_addr` and `networks.default.aliases` entries stay unchanged. + +## Step 4 — Remove `apisix` published ports + +Find the `apisix` service `ports:` block and **delete it entirely**: + +```yaml +# DELETE this entire block from apisix service: + ports: + - target: 9080 + published: 8080 + protocol: tcp + mode: host + - target: 9443 + published: 8443 + protocol: tcp + mode: host + - target: 9180 + published: 9180 + protocol: tcp + mode: host +``` + +APISIX admin API (9180) access: use `docker exec` or SSH tunnel. +APISIX is reachable from SWAG via `http://apisix:9080` on the overlay network. + +## Step 5 — Remove `apisix-dashboard` published port + +```yaml +# DELETE from apisix-dashboard: + ports: + - target: 9000 + published: 9000 + protocol: tcp + mode: host +``` + +## Step 6 — Remove `rabbitmq` published ports + +```yaml +# DELETE from rabbitmq: + ports: + - target: 5672 + published: 5672 + protocol: tcp + mode: host + - target: 15672 + published: 15672 + protocol: tcp + mode: host + - target: 61613 + published: 61613 + protocol: tcp + mode: host + - target: 15674 + published: 15674 + protocol: tcp + mode: host +``` + +## Step 7 — Remove `prometheus` published port + +```yaml +# DELETE from prometheus: + ports: + - target: 9090 + published: 9090 + protocol: tcp + mode: host +``` + +## Step 8 — Remove `grafana` published port + +```yaml +# DELETE from grafana: + ports: + - target: 3000 + published: 3000 + protocol: tcp + mode: host +``` + +## Step 9 — Add `swag-vl` volume + +In the `volumes:` section at the bottom of the file, add: + +```yaml + swag-vl: + labels: + project: co.iklim +``` + +## Verification + +After deploy: +```bash +docker service ls --filter label=project=co.iklim +``` + +Confirm `iklimco_swag` and `iklimco_cert-reloader` appear in the list. + +```bash +docker service ps iklimco_swag +docker service ps iklimco_cert-reloader +``` + +Both should show `Running`. diff --git a/roadmap/test-env/04-swag-nginx-configs.md b/roadmap/test-env/04-swag-nginx-configs.md new file mode 100644 index 0000000..994388a --- /dev/null +++ b/roadmap/test-env/04-swag-nginx-configs.md @@ -0,0 +1,193 @@ +# 04 — SWAG Nginx Proxy Configs (Test) + +## Context +SWAG reads nginx configs from bind-mounted directories: +- `/config/nginx/proxy-confs/` → `swag/proxy-confs/` in repo, deployed to `/opt/iklimco/swag/proxy-confs/` +- `/config/nginx/site-confs/` → `swag/site-confs/` in repo, deployed to `/opt/iklimco/swag/site-confs/` + +Templates use `${VAR}` placeholders processed with `envsubst` at deploy time. + +## Required env vars (in `.env` on storagebox `test/secrets/iklim.co/.env`) + +```bash +API_SUBDOMAIN=api-test.iklim.co +APIGW_SUBDOMAIN=apigw-test.iklim.co +RABBITMQ_SUBDOMAIN=rabbitmq-test.iklim.co +GRAFANA_SUBDOMAIN=grafana-test.iklim.co +RESTRICTED_IP_1=78.187.87.109 +RESTRICTED_IP_2=95.70.151.248 +``` + +## Files to create + +### `swag/site-confs/default.conf` +Default catch-all: HTTP→HTTPS redirect + 444 for unknown HTTPS hosts. + +```nginx +server { + listen 80 default_server; + listen [::]:80 default_server; + server_name _; + return 301 https://$host$request_uri; +} + +server { + listen 443 ssl http2 default_server; + listen [::]:443 ssl http2 default_server; + server_name _; + include /config/nginx/ssl.conf; + return 444; +} +``` + +### `swag/proxy-confs/api.conf.tpl` +Public API gateway — no IP restriction. + +```nginx +server { + listen 443 ssl http2; + listen [::]:443 ssl http2; + server_name ${API_SUBDOMAIN}; + + include /config/nginx/ssl.conf; + include /config/nginx/resolver.conf; + + client_max_body_size 50m; + + location / { + include /config/nginx/proxy.conf; + include /config/nginx/resolver.conf; + set $upstream_app apisix; + set $upstream_port 9080; + set $upstream_proto http; + proxy_pass $upstream_proto://$upstream_app:$upstream_port; + } +} +``` + +### `swag/proxy-confs/apigw.conf.tpl` +APISIX Dashboard — IP restricted. + +```nginx +server { + listen 443 ssl http2; + listen [::]:443 ssl http2; + server_name ${APIGW_SUBDOMAIN}; + + include /config/nginx/ssl.conf; + include /config/nginx/resolver.conf; + + client_max_body_size 0; + + location / { + allow ${RESTRICTED_IP_1}; + allow ${RESTRICTED_IP_2}; + deny all; + + include /config/nginx/proxy.conf; + include /config/nginx/resolver.conf; + set $upstream_app apisix-dashboard; + set $upstream_port 9000; + set $upstream_proto http; + proxy_pass $upstream_proto://$upstream_app:$upstream_port; + } +} +``` + +### `swag/proxy-confs/rabbitmq.conf.tpl` +RabbitMQ Management UI — IP restricted. + +```nginx +server { + listen 443 ssl http2; + listen [::]:443 ssl http2; + server_name ${RABBITMQ_SUBDOMAIN}; + + include /config/nginx/ssl.conf; + include /config/nginx/resolver.conf; + + client_max_body_size 0; + + location / { + allow ${RESTRICTED_IP_1}; + allow ${RESTRICTED_IP_2}; + deny all; + + include /config/nginx/proxy.conf; + include /config/nginx/resolver.conf; + set $upstream_app rabbitmq; + set $upstream_port 15672; + set $upstream_proto http; + proxy_pass $upstream_proto://$upstream_app:$upstream_port; + } +} +``` + +### `swag/proxy-confs/grafana.conf.tpl` +Grafana — IP restricted. + +```nginx +server { + listen 443 ssl http2; + listen [::]:443 ssl http2; + server_name ${GRAFANA_SUBDOMAIN}; + + include /config/nginx/ssl.conf; + include /config/nginx/resolver.conf; + + client_max_body_size 0; + + location / { + allow ${RESTRICTED_IP_1}; + allow ${RESTRICTED_IP_2}; + deny all; + + include /config/nginx/proxy.conf; + include /config/nginx/resolver.conf; + set $upstream_app grafana; + set $upstream_port 3000; + set $upstream_proto http; + proxy_pass $upstream_proto://$upstream_app:$upstream_port; + } +} +``` + +## Deploy step (handled by pipeline — see `07-deploy-pipeline-update.md`) + +```bash +# Process templates and write to host +mkdir -p /opt/iklimco/swag/proxy-confs /opt/iklimco/swag/site-confs + +set -a; . ./.env; set +a +export RESTRICTED_IP_1="78.187.87.109" +export RESTRICTED_IP_2="95.70.151.248" + +for tpl in swag/proxy-confs/*.conf.tpl; do + out="/opt/iklimco/swag/proxy-confs/$(basename "${tpl%.tpl}")" + envsubst < "$tpl" > "$out" + echo "✅ $out" +done + +cp swag/site-confs/default.conf /opt/iklimco/swag/site-confs/default.conf +``` + +## Verification + +After deploy, check SWAG nginx config is valid: +```bash +docker exec $(docker ps -q -f name=iklimco_swag) nginx -t +``` + +Check subdomains resolve (from outside the server): +```bash +curl -sk https://api-test.iklim.co/health # expects APISIX response +curl -sk https://grafana-test.iklim.co # expects 403 Forbidden (wrong IP) +``` + +## Notes +- `include /config/nginx/resolver.conf` enables dynamic upstream resolution via Docker DNS — + required for overlay service names like `apisix`, `grafana`, etc. +- SWAG's `proxy.conf` already sets `X-Real-IP`, `X-Forwarded-For`, `X-Forwarded-Proto` and + WebSocket upgrade headers. No manual addition needed. +- `*.iklim.co` cert covers both `api.iklim.co` and `api-test.iklim.co` subdomains — + both test and prod servers can independently obtain and use it. diff --git a/roadmap/test-env/05-apisix-remove-ssl.md b/roadmap/test-env/05-apisix-remove-ssl.md new file mode 100644 index 0000000..1602787 --- /dev/null +++ b/roadmap/test-env/05-apisix-remove-ssl.md @@ -0,0 +1,86 @@ +# 05 — APISIX: Remove SSL / Configure Trusted Proxy (Test) + +## Context +- **File:** `init/apisix-core/init.sh` +- SWAG now terminates TLS. APISIX receives plain HTTP from SWAG via the overlay network. +- The `ssls/1` cert upload is no longer needed. +- APISIX must trust SWAG's `X-Real-IP` header to see real client IPs (for rate limiting, fail2ban). + +## Step 1 — Remove the SSL cert upload block from `init/apisix-core/init.sh` + +Locate and **delete** this entire block: + +```bash +# DELETE THIS BLOCK: +if [[ "$PROFILE" == "test" || "$PROFILE" == "prod" ]]; then + if [[ -f "STAR.iklim.co.full.crt" && -f "STAR.iklim.co_key.txt" ]]; then + call_api "ssl iklim.co" -X PUT "$APISIX_ADMIN_URL/ssls/1" \ + -H "X-API-KEY: $API_KEY" -H "Content-Type: application/json" \ + -d '{"cert":"'"$(cat STAR.iklim.co.full.crt)"'","key":"'"$(cat STAR.iklim.co_key.txt)"'","snis":["*.iklim.co"]}' + else + echo "iklim.co ssl certificates not found!" + fi +fi +``` + +Also delete the `dev` SSL block if it only serves the `ssls/1` endpoint: + +```bash +# DELETE THIS BLOCK (if only used for cert upload): +if [[ "$PROFILE" == "dev" ]]; then + if [[ -f "localhost.crt" && -f "localhost.key" ]]; then + call_api "ssl dev" -X PUT "$APISIX_ADMIN_URL/ssls/1" \ + -H "X-API-KEY: $API_KEY" -H "Content-Type: application/json" \ + -d '{"cert":"'"$(cat localhost.crt)"'","key":"'"$(cat localhost.key)"'","snis":["localhost"]}' + else + echo "localhost ssl certificates not found!" + fi +fi +``` + +> If the `dev` block is still needed for local development, keep it but ensure it does not +> affect test/prod behavior. + +## Step 2 — APISIX trusted proxy configuration (custom image) + +APISIX's custom image (`registry.tarla.io/iklimco/custom-apisix:3.12.0`) includes a +`config.yaml`. That config must set real IP headers so APISIX sees real client IPs, not +SWAG's overlay IP. + +Locate the APISIX `config.yaml` in the custom image build source and ensure it contains: + +```yaml +nginx_config: + http: + real_ip_header: "X-Real-IP" + real_ip_recursive: "on" + set_real_ip_from: + - "10.0.0.0/8" + - "172.16.0.0/12" + - "192.168.0.0/16" +``` + +Docker Swarm overlay networks use `10.x.x.x` addressing. These CIDR ranges cover all +typical overlay subnet allocations. + +If the custom image config does not have these, add them and rebuild+push the image to Harbor +before deploying. + +## Step 3 — Remove APISIX TLS upstream configs (if any) + +If any APISIX upstream in `init/apisix-core/init.sh` uses `scheme: https` pointing to +backend microservices, change to `scheme: http`. Backends are internal HTTP-only. + +The `apisix:9443` HTTPS listener is gone; APISIX only listens on `9080` (HTTP). + +## Verification + +After deploy, confirm APISIX receives real client IPs: +```bash +# From a machine with known IP, make a request to api-test.iklim.co +# Then check APISIX access log +docker exec $(docker ps -q -f name=iklimco_apisix) \ + tail -20 /usr/local/apisix/logs/access.log +``` + +The IP in the log should be the actual client IP, not SWAG's overlay IP (`10.x.x.x`). diff --git a/roadmap/test-env/06-cert-reloader.md b/roadmap/test-env/06-cert-reloader.md new file mode 100644 index 0000000..c597d87 --- /dev/null +++ b/roadmap/test-env/06-cert-reloader.md @@ -0,0 +1,79 @@ +# 06 — cert-reloader Sidecar Service (Test) + +## Context +- **Purpose:** Watches SWAG's certificate volume for changes; copies renewed certs to + `/opt/iklimco/ssl/` on the host; forces Vault to reload its TLS cert. +- **Replaces:** `ops/vault-reload-after-swag-renewal.sh` (which was designed for manual use). + The sidecar automates this after every SWAG renewal. +- **Runs on:** manager node (same node as SWAG and Vault, ensuring volume + socket access). + +## How it works + +``` +SWAG renews cert + → writes new fullchain.pem to swag-vl:/config/etc/letsencrypt/live/iklim.co/ +cert-reloader wakes every 3600s + → detects MD5 change on fullchain.pem + → copies fullchain.pem + privkey.pem to /opt/iklimco/ssl/ (host bind mount) + → docker service update --force iklimco_vault +Vault restarts + → reads new cert from /opt/iklimco/ssl/ (already mounted as /vault/certs) +``` + +## Step 1 — Service definition (already in `03-infra-stack-changes.md`) + +The `cert-reloader` service is added to `docker-stack-infra.yml` as documented in step 03. +No separate action needed here beyond that file change. + +## Step 2 — Ensure `/opt/iklimco/ssl/` exists on the host + +The `Prepare Init Files` step in the pipeline already creates this directory and copies +the initial cert. The cert-reloader handles subsequent renewals. + +On first deploy, the bootstrap cert (copied during pipeline init) is used until SWAG +obtains its first Let's Encrypt cert (see `07-deploy-pipeline-update.md`). + +## Step 3 — Verify cert-reloader is running + +```bash +docker service ps iklimco_cert-reloader +docker service logs iklimco_cert-reloader --tail 20 +``` + +Expected log on startup: +``` +[cert-reloader] started +``` + +## Step 4 — Trigger a manual test (optional, for verification) + +Force a cert copy and Vault reload without waiting for renewal: + +```bash +SWAG_VOL=$(docker volume inspect iklimco_swag-vl --format '{{.Mountpoint}}') +CERT="$SWAG_VOL/etc/letsencrypt/live/iklim.co/fullchain.pem" + +if [ -f "$CERT" ]; then + cp "$CERT" /opt/iklimco/ssl/STAR.iklim.co.full.crt + KEYF="$SWAG_VOL/etc/letsencrypt/live/iklim.co/privkey.pem" + cp "$KEYF" /opt/iklimco/ssl/STAR.iklim.co_key.txt + docker service update --force iklimco_vault + echo "✅ Manual reload triggered" +else + echo "⚠️ Cert not yet obtained by SWAG" +fi +``` + +## Notes +- Docker socket (`/var/run/docker.sock`) is mounted into cert-reloader — this is intentional + and necessary. The service is pinned to manager and is minimal (`docker:27-cli` image). +- cert-reloader checks every 3600s (1 hour). Let's Encrypt certs renew every ~60 days; + the 1-hour check window is more than sufficient. +- If Vault restarts (due to cert reload), it may need to be **unsealed** automatically. + Vault's healthcheck in `docker-stack-infra.yml` already handles auto-unseal via the + `vault_unseal_key` Docker secret. Verify this works after a cert reload. + +## Future — Multi-node Vault (prod) +When Vault runs as a 3-node Raft cluster on different physical machines, +cert-reloader must also SSH-copy the cert to the other nodes' `/opt/iklimco/ssl/`. +This is handled in `prod-env-setup/06-cert-reloader.md`. diff --git a/roadmap/test-env/07-deploy-pipeline-update.md b/roadmap/test-env/07-deploy-pipeline-update.md new file mode 100644 index 0000000..81e2f86 --- /dev/null +++ b/roadmap/test-env/07-deploy-pipeline-update.md @@ -0,0 +1,151 @@ +# 07 — Deploy Pipeline Update (Test) + +## Context +- **File:** `.gitea/workflows/deploy-test.yml` +- Changes: + 1. Remove manual `scp STAR.iklim.co.full.crt` steps (SWAG now owns cert lifecycle). + 2. Add SWAG host directories preparation (dns-conf, nginx proxy-confs). + 3. Add cert bootstrap step: on first deploy, wait for SWAG to obtain cert, then copy + to `/opt/iklimco/ssl/` so Vault can start. + 4. Ensure `GODADDY_KEY` and `GODADDY_SECRET` are available from `.env.secrets.swag`. + +## Step 1 — Update `Initialize Servers` step + +**Remove** the two `scp` lines that copy the TLS cert files: + +```yaml +# DELETE these two lines from the "Initialize Servers" step: + scp -P 23 ${{ vars.STORAGEBOX_USER }}@${{ vars.STORAGEBOX_USER }}.your-storagebox.de:test/app/iklim.co/ssl/STAR.iklim.co.full.crt ./STAR.iklim.co.full.crt + scp -P 23 ${{ vars.STORAGEBOX_USER }}@${{ vars.STORAGEBOX_USER }}.your-storagebox.de:test/app/iklim.co/ssl/STAR.iklim.co_key.txt ./STAR.iklim.co_key.txt +``` + +Also remove any references to `STAR.iklim.co.full.crt` and `STAR.iklim.co_key.txt` in +the `Prepare Init Files` step's `sudo cp` commands: + +```yaml +# DELETE or make conditional: + sudo cp STAR.iklim.co.full.crt STAR.iklim.co_key.txt /opt/iklimco/ssl/ 2>/dev/null || true +``` + +## Step 2 — Add `Prepare SWAG Directories` step + +Insert this step **before** `Deploy Swarm Stack`: + +```yaml + - name: Prepare SWAG Directories + run: | + set -a; . ./.env; . ./.env.secrets.swag; set +a + + # GoDaddy credentials file + sudo mkdir -p /opt/iklimco/swag/dns-conf + envsubst < swag/dns-conf/godaddy.ini.tpl | sudo tee /opt/iklimco/swag/dns-conf/godaddy.ini > /dev/null + sudo chmod 600 /opt/iklimco/swag/dns-conf/godaddy.ini + echo "✅ godaddy.ini written" + + # Nginx proxy conf files + sudo mkdir -p /opt/iklimco/swag/proxy-confs /opt/iklimco/swag/site-confs + + export RESTRICTED_IP_1="78.187.87.109" + export RESTRICTED_IP_2="95.70.151.248" + + for tpl in swag/proxy-confs/*.conf.tpl; do + out="/opt/iklimco/swag/proxy-confs/$(basename "${tpl%.tpl}")" + envsubst < "$tpl" | sudo tee "$out" > /dev/null + echo "✅ $out" + done + + sudo cp swag/site-confs/default.conf /opt/iklimco/swag/site-confs/default.conf + echo "✅ SWAG directories ready" + working-directory: /workspace/iklim.co +``` + +> `GODADDY_KEY` and `GODADDY_SECRET` must be present in `.env.secrets.swag` (see step 02). +> `API_SUBDOMAIN`, `APIGW_SUBDOMAIN`, etc. must be in `.env` (see step 04). + +## Step 3 — Add `Bootstrap SWAG Certificate` step + +Insert this step **after** `Deploy Swarm Stack` and **before** any step that depends on +Vault being accessible (e.g., `Provision Vault AppRole IDs`): + +```yaml + - name: Bootstrap SWAG Certificate + run: | + echo "Waiting for SWAG container to start..." + SWAG_CTR="" + for i in $(seq 1 24); do + SWAG_CTR=$(docker ps -q -f name=iklimco_swag 2>/dev/null | head -1) + [ -n "$SWAG_CTR" ] && break + sleep 10 + done + + if [ -z "$SWAG_CTR" ]; then + echo "❌ SWAG container did not start in time" + exit 1 + fi + + CERT_PATH="/config/etc/letsencrypt/live/iklim.co/fullchain.pem" + echo "Waiting for SWAG to obtain Let's Encrypt cert (up to 10 min)..." + for i in $(seq 1 20); do + if docker exec "$SWAG_CTR" test -f "$CERT_PATH" 2>/dev/null; then + echo "✅ Cert obtained by SWAG" + break + fi + echo " attempt $i/20 — waiting 30s..." + sleep 30 + done + + if ! docker exec "$SWAG_CTR" test -f "$CERT_PATH" 2>/dev/null; then + echo "❌ SWAG did not obtain cert in time. Check logs:" + docker service logs iklimco_swag --tail 50 + exit 1 + fi + + # Copy cert to host for Vault bootstrap + sudo mkdir -p /opt/iklimco/ssl + docker exec "$SWAG_CTR" cat "$CERT_PATH" | \ + sudo tee /opt/iklimco/ssl/STAR.iklim.co.full.crt > /dev/null + docker exec "$SWAG_CTR" cat "/config/etc/letsencrypt/live/iklim.co/privkey.pem" | \ + sudo tee /opt/iklimco/ssl/STAR.iklim.co_key.txt > /dev/null + echo "✅ Cert bootstrapped to /opt/iklimco/ssl/" + working-directory: /workspace/iklim.co +``` + +> **First deploy only:** SWAG contacts Let's Encrypt via GoDaddy DNS challenge. +> This step waits up to 10 minutes. On subsequent deploys the cert is already in +> `swag-vl` (persisted volume) and SWAG starts immediately — wait loop exits fast. + +## Step 4 — Re-order steps + +Final step order in the pipeline: + +1. Checkout Branch +2. Prepare Folders +3. Set up SSH Key +4. Update Apt / Install Tools +5. Fetch Service Secret Files +6. Initialize Servers +7. Upload Updated Secrets to Storagebox +8. Provision Vault AppRole IDs and Docker Secrets +9. Upload Updated Env to Storagebox +10. Prepare Init Files ← `sudo cp STAR.iklim.co.*.crt` lines removed +11. Initialize Docker Swarm +12. Stop Docker Compose Services +13. Docker Login to Harbor +14. **Prepare SWAG Directories** ← NEW +15. Deploy Swarm Stack +16. **Bootstrap SWAG Certificate** ← NEW +17. Review Environment + +> Steps 8 (Provision Vault) runs before SWAG because it creates Docker secrets and +> AppRole IDs — Vault must be reachable for this. On re-deploys, Vault is already +> running with the previous cert. On first deploy, step 16 handles the cert wait before +> any further Vault interaction is needed post-deploy. +> +> If Vault provisioning (step 8) fails on first deploy because Vault has no cert yet, +> move step 16 before step 8. Adjust based on observed behavior. + +## Notes +- `.env` must contain the subdomain env vars added in step 04. Add them to storagebox + `test/secrets/iklim.co/.env` before the first deploy. +- `RESTRICTED_IP_1` and `RESTRICTED_IP_2` are hardcoded in the pipeline step above. + Move to `.env` if they change often. diff --git a/roadmap/test-env/08-verify.md b/roadmap/test-env/08-verify.md new file mode 100644 index 0000000..73b9a75 --- /dev/null +++ b/roadmap/test-env/08-verify.md @@ -0,0 +1,125 @@ +# 08 — Verification Checklist (Test) + +## Context +Run these checks after a successful pipeline deployment to the test environment. + +## 1 — Swarm services are up + +```bash +docker service ls --filter label=project=co.iklim +``` + +All services should show `REPLICAS 1/1`. + +```bash +docker service ps iklimco_swag +docker service ps iklimco_cert-reloader +docker service ps iklimco_vault +docker service ps iklimco_apisix +``` + +No tasks in `Failed` or `Rejected` state. + +## 2 — SWAG obtained the cert + +```bash +docker exec $(docker ps -q -f name=iklimco_swag) \ + certbot certificates +``` + +Expected: certificate for `*.iklim.co`, `VALID: XX days`. + +```bash +docker exec $(docker ps -q -f name=iklimco_swag) \ + ls /config/etc/letsencrypt/live/iklim.co/ +``` + +Expected: `fullchain.pem`, `privkey.pem`, `cert.pem`, `chain.pem`. + +## 3 — Nginx config is valid + +```bash +docker exec $(docker ps -q -f name=iklimco_swag) nginx -t +``` + +Expected: `syntax is ok` and `test is successful`. + +## 4 — Public API endpoint + +```bash +curl -si https://api-test.iklim.co/health +``` + +Expected: HTTP 2xx or APISIX response (not a cert error, not a 502). + +TLS cert check: +```bash +echo | openssl s_client -connect api-test.iklim.co:443 -servername api-test.iklim.co 2>/dev/null \ + | openssl x509 -noout -subject -dates +``` + +Expected: `subject=CN=*.iklim.co`, dates valid, `notAfter` > today. + +## 5 — IP-restricted subdomains block non-whitelisted IPs + +From a non-whitelisted IP: +```bash +curl -si https://grafana-test.iklim.co +``` +Expected: HTTP 403. + +From a whitelisted IP (78.187.87.109 or 95.70.151.248): +```bash +curl -si https://grafana-test.iklim.co +``` +Expected: HTTP 200 (Grafana login page). + +## 6 — Vault is reachable internally (not externally) + +From outside the server: +```bash +curl -sk https://vault.iklim.co:8200/v1/sys/health +# or +curl -sk https://:8200/v1/sys/health +``` +Expected: **connection refused** or **timeout** — Vault must not be reachable externally. + +From inside the Swarm (exec into any service container): +```bash +docker exec $(docker ps -q -f name=iklimco_apisix | head -1) \ + curl -sk https://vault.iklim.co:8200/v1/sys/health +``` +Expected: JSON response `{"sealed":false,...}`. + +## 7 — cert-reloader is watching + +```bash +docker service logs iklimco_cert-reloader --tail 10 +``` +Expected: `[cert-reloader] started` — no errors. + +## 8 — Vault cert path is correct + +```bash +VAULT_CTR=$(docker ps -q -f name=iklimco_vault) +docker exec "$VAULT_CTR" ls /vault/certs/ +``` +Expected: `STAR.iklim.co.full.crt` and `STAR.iklim.co_key.txt`. + +## 9 — fail2ban is active (SWAG) + +```bash +docker exec $(docker ps -q -f name=iklimco_swag) \ + fail2ban-client status +``` +Expected: list of jails including `nginx-http-auth`, `nginx-botsearch`, etc. + +## 10 — No services have published unexpected ports + +```bash +docker service ls --format "{{.Name}}\t{{.Ports}}" \ + --filter label=project=co.iklim +``` + +Only `iklimco_swag` should have published ports (`*:80->80`, `*:443->443`). +All other services should show empty ports column. diff --git a/setup-vs-roadmap-map.md b/setup-vs-roadmap-map.md new file mode 100644 index 0000000..77e5988 --- /dev/null +++ b/setup-vs-roadmap-map.md @@ -0,0 +1,55 @@ +# Setup Aşamaları — Roadmap Eşleştirme Tablosu + +Bu tablo, `roadmap/test-env` ve `roadmap/prod-env` klasörlerindeki yol haritası adımlarının +Terraform/Ansible setup aşamalarından hangisinde ele alındığını gösterir. + +## TEST ortamı + +| Roadmap adımı | Hangi aşamada ele alınmalı | +| ------------------------------------------------ | ----------------------------------------------------------------------------------------------------------- | +| Hetzner firewall (sadece 22/80/443) | **Terraform `01-test-terraform-iaac.md`** — `firewall.tf` | +| Sunucu oluşturma (`test-swarm-01`, `test-db-01`) | **Terraform `01-test-terraform-iaac.md`** — `servers.tf` | +| Private network + placement group | **Terraform `01-test-terraform-iaac.md`** — `network.tf`, `placement.tf` | +| Docker Engine kurulumu | **Ansible `02-test-ansible-bootstrap.md`** — `docker` role | +| Security hardening (SSH, UFW, fail2ban) | **Ansible `02-test-ansible-bootstrap.md`** — `hardening` role | +| Docker Swarm init (`init/swarm-init.sh`) | **Ansible `02-test-ansible-bootstrap.md`** — `swarm` role (pipeline script idempotent çalışmaya devam eder) | +| `type=service` node label | **Ansible `02-test-ansible-bootstrap.md`** — `swarm` role | +| `/opt/iklimco/...` dizinleri | **Ansible `02-test-ansible-bootstrap.md`** — `node_dirs` role | +| `act_runner` systemd kurulumu | **Ansible `03-test-runner-ve-deploy-onkosullari.md`** — `gitea_runner` role | +| GoDaddy credentials storagebox'a yükleme | **Manuel kalır** — secret yönetimi, Terraform/Ansible dışı | + +## PROD ortamı + +| Roadmap adımı | Hangi aşamada ele alınmalı | +| ----------------------------------------------- | ------------------------------------------------------------------------ | +| 6 sunucu oluşturma (3 swarm + 3 db) | **Terraform `04-prod-terraform-iaac.md`** — `servers.tf` | +| Private network + 2 placement group | **Terraform `04-prod-terraform-iaac.md`** — `network.tf`, `placement.tf` | +| Firewall (sadece 22/80/443 public) | **Terraform `04-prod-terraform-iaac.md`** — `firewall.tf` | +| Docker Engine kurulumu (`prod-swarm-*`) | **Ansible `05-prod-ansible-bootstrap.md`** — `docker` role | +| Security hardening (tüm node'lar) | **Ansible `05-prod-ansible-bootstrap.md`** — `hardening` role | +| Swarm init (`prod-swarm-01`) | **Ansible `05-prod-ansible-bootstrap.md`** — `swarm` role | +| Manager join (`prod-swarm-02`, `prod-swarm-03`) | **Ansible `05-prod-ansible-bootstrap.md`** — `swarm` role | +| `type=service` node label (3 swarm node) | **Ansible `05-prod-ansible-bootstrap.md`** — `swarm` role | +| `/opt/iklimco/...` dizinleri | **Ansible `05-prod-ansible-bootstrap.md`** — `node_dirs` role | +| 3× `act_runner` systemd (HA runner) | **Ansible `06-prod-runner-ha-ve-swarm.md`** — `gitea_runner` role | +| GoDaddy credentials storagebox'a yükleme | **Manuel kalır** — secret yönetimi, Terraform/Ansible dışı | +| DB node'ları Swarm'a join | **Kapsam dışı** — DB cluster ayrı yönetilir | + +## Klasör yapısı + +``` +Environment_Infrastructure/ + setup/ ← Terraform + Ansible aşama dokümanları + 00-genel-yol-haritasi.md + 01-test-terraform-iaac.md + 02-test-ansible-bootstrap.md + 03-test-runner-ve-deploy-onkosullari.md + 04-prod-terraform-iaac.md + 05-prod-ansible-bootstrap.md + 06-prod-runner-ha-ve-swarm.md + 07-private-network-port-matrisi.md +roadmap/ + test-env/ ← Test ortamı Roadmap adımları + prod-env/ ← Prod Roadmap adımları + setup-vs-technical-debt-map.md ← Bu dosya +``` diff --git a/setup/00-genel-yol-haritasi.md b/setup/00-genel-yol-haritasi.md new file mode 100644 index 0000000..057db36 --- /dev/null +++ b/setup/00-genel-yol-haritasi.md @@ -0,0 +1,164 @@ +# 00 - Genel Yol Haritasi + +Bu dosya, `Environment_Infrastructure` reposunda Terraform ve Ansible ile Hetzner Cloud uzerinde test/prod altyapisini kuracak ajanlar icin ana baglamdir. Her asama dosyasi kendi basina yeterli olacak sekilde yazilmistir; yine de bu dokuman genel karar kaydidir. + +## Hedef + +Iklim.co altyapisi iki ayri Hetzner Cloud Project uzerinde kurulacak: + +- `test` Hetzner Cloud Project +- `prod` Hetzner Cloud Project + +Bu ayrim zorunlu kabul edilir. API token, network, firewall, placement group, server, maliyet ve yanlislikla silme riskleri ortam bazinda ayrilmis olur. + +## Terraform ve Ansible Sorumluluk Siniri + +Terraform sadece IaaS kaynaklarini olusturur: + +- Hetzner Cloud server +- Private network ve subnet +- Firewall +- SSH key +- Placement group +- Opsiyonel volume, floating IP, load balancer veya DNS kaydi +- Ansible inventory output + +Ansible olusan Linux makineleri hazirlar: + +- Linux base paketleri +- Security hardening +- Docker Engine kurulumu +- Docker Swarm init/join +- Gitea Actions `act_runner` systemd kurulumu +- Ortak klasorler ve deploy on kosullari + +Terraform icinde Docker, Swarm, runner veya uygulama deploy'u yapilmayacak. Ansible icinde Hetzner Cloud kaynaklari yaratilmeyecek. + +## Ortam Topolojileri + +### Test + +Test ortami minimum topoloji: + +| Node | Rol | Not | +| --- | --- | --- | +| `test-swarm-01` | Swarm manager + app worker + Gitea runner | CI/CD test deploy bu node uzerinden calisir | +| `test-db-01` | DB node | DB altyapisi manuel kurulacak; Gitea CI/CD ile kurulmayacak | + +Test DB kurulumu Terraform/Ansible ile sadece makine ve OS hazirligina kadar getirilir. PostgreSQL/MongoDB cluster kurulumu bu asamanin disindadir. + +### Prod + +Prod ortami HA topoloji: + +| Node grubu | Adet | Rol | +| --- | ---: | --- | +| `prod-swarm-*` | 3 | Her biri Swarm manager + app worker | +| `prod-db-*` | 3 | DB cluster node'lari | + +Prod DB altyapisi manuel kurulacak; Gitea CI/CD ile kurulmayacak. Terraform DB makinelerini ve network/firewall kurallarini hazirlar, Ansible OS hardening ve temel bagimliliklari kurar. + +## Public Port Politikasi + +Public internete acik portlar sadece: + +- `22/tcp` SSH, sadece admin IP/CIDR kaynaklarindan +- `80/tcp` HTTP +- `443/tcp` HTTPS + +`8200/tcp` Vault public internete acilmayacak. Vault sadece private network veya Docker overlay icinden erisilebilir olmalidir. + +Mevcut uygulama stack dosyalarinda bazi servisler host port publish ediyor olabilir. Hetzner Cloud firewall public ingress'i engelleyecegi icin bu portlar public'ten erisilemez olmalidir. Ancak uzun vadede stack manifestleri de bu politikaya uyacak sekilde sadeleştirilmelidir. + +## Private Network Politikasi + +Private network icinde acilmasi gereken portlarin ayrintili matrisi `07-private-network-port-matrisi.md` dosyasindadir. Ajanlar firewall veya Ansible UFW kurali yazarken bu dosyayi kaynak kabul etmelidir. + +## Gitea Actions Runner Karari + +`act_runner` Docker container olarak calistirilmayacak ve Docker socket container'a mount edilmeyecek. + +Tercih edilen kurulum: + +- `act_runner` Linux systemd servisi olarak kurulur. +- Runner icin ayri `gitea-runner` kullanicisi olusturulur. +- CI/CD job'lari gerekli oldugunda container olusturabilir; bunun icin runner host uzerinde Docker CLI/daemon erisimi gerekir. +- Docker group uyeligi root seviyesine yakin yetki verdigi icin sadece guvenilir Gitea repo/job'lari bu runner label'larini kullanmalidir. + +Prod HA icin `act_runner` tek makineye degil, 3 Swarm manager node'unun tamamına kurulacaktir. Boylece bir manager/runner kaybedildiginde pipeline calismaya devam edebilir. Runner label'lari hem ortak hem node-spesifik olmalidir: + +- Ortak: `prod-runner` +- Node spesifik: `prod-swarm-01`, `prod-swarm-02`, `prod-swarm-03` + +Test icin tek runner yeterlidir: + +- Ortak: `test-runner` +- Node spesifik: `test-swarm-01` + +## Deploy Lock Karari + +Prod ortaminda 3 runner HA icin gereklidir; ancak ayni anda birden fazla deploy job'u +calistirabilir. Bu nedenle prod deploy islemleri StorageBox uzerinde otomatik lock ile +tekillestirilmelidir. + +Lock dosyalari/klasorleri manuel olusturulmayacak. Workflow basinda atomik `mkdir` +ile olusturulacak, deploy bitince `rmdir` ile silinecek. + +Onerilen StorageBox path'leri: + +```text +prod/locks/prod-deploy.lock +prod/locks/prod-infra.lock +prod/locks/services/.lock +``` + +Baslangic icin en sade ve guvenli model tek global prod deploy lock'tur: + +```text +prod/locks/prod-deploy.lock +``` + +Bu model tum prod deploy'lari siraya sokar. Daha sonra ihtiyac olursa servis bazli +lock modeline gecilebilir. + +Ornek akış: + +```bash +ssh storagebox 'mkdir -p prod/locks && mkdir prod/locks/prod-deploy.lock' +# deploy islemleri +ssh storagebox 'rmdir prod/locks/prod-deploy.lock' +``` + +`mkdir` atomik oldugu icin lock zaten varsa komut fail olur; bu durumda job beklemeli +veya temiz bir hata ile cikmalidir. Workflow fail olsa bile cleanup adimi lock'u silmeye +calismalidir. Eski kalmis lock'lari tespit etmek icin lock klasoru icine timestamp, +runner adi ve workflow bilgisi yazilabilir. + +## Hetzner Fiziksel Host Ayrimi + +Hetzner Cloud'da kabinet secimi dogrudan yapilmaz. Ayni fiziksel host'a dusmeme ihtiyaci icin `Placement Group` kullanilir. `spread` tipindeki placement group, gruptaki cloud server'lari farkli fiziksel host'lara yerlestirmeyi hedefler. + +Kisitlar: + +- Spread placement group, tek fiziksel host arizasinin etkisini azaltir. +- Ayni datacenter veya lokasyon icindeki daha genis bir arizaya karsi garanti vermez. +- Lokasyon bazli felaket kurtarma icin ileride farkli lokasyon/region dagilimi tasarlanmalidir. +- Hetzner dokumanina gore spread placement group basina en fazla 10 server limiti vardir. + +Prod icin en az iki placement group onerilir: + +- `prod-swarm-spread`: 3 Swarm manager/app node +- `prod-db-spread`: 3 DB node + +Test icin opsiyonel: + +- `test-spread`: `test-swarm-01` ve `test-db-01` + +Kaynaklar: + +- Hetzner Terraform provider: https://registry.terraform.io/providers/hetznercloud/hcloud/latest +- Hetzner Networks: https://docs.hetzner.com/cloud/networks/overview/ +- Hetzner Firewalls: https://docs.hetzner.com/cloud/firewalls/overview +- Hetzner Placement Groups: https://docs.hetzner.com/cloud/placement-groups/overview +- Docker Swarm overlay portlari: https://docs.docker.com/engine/network/drivers/overlay/ +- Gitea act_runner: https://docs.gitea.com/usage/actions/act-runner diff --git a/setup/01-test-terraform-iaac.md b/setup/01-test-terraform-iaac.md new file mode 100644 index 0000000..d9a101f --- /dev/null +++ b/setup/01-test-terraform-iaac.md @@ -0,0 +1,119 @@ +# 01 - Test Terraform IaC + +Bu asamanin amaci test Hetzner Cloud Project icinde minimum IaaS kaynaklarini Terraform ile olusturmaktir. Bu dokuman tek basina uygulanabilir olacak sekilde yazilmistir. + +## Kapsam + +Terraform test ortaminda sunlari olusturur: + +- Private network: `iklim-test-net` +- Subnetler: + - App/Swarm subnet: `10.10.10.0/24` + - DB subnet: `10.10.20.0/24` +- Firewall: + - Public ingress: sadece `22/tcp`, `80/tcp`, `443/tcp` + - Private ingress: `07-private-network-port-matrisi.md` dosyasindaki test kurallari +- SSH key +- Placement group: `test-spread` +- Server: + - `test-swarm-01` + - `test-db-01` +- Ansible inventory output + +Terraform DB yazilimini kurmaz. DB node sadece makine, network ve firewall seviyesinde hazirlanir. + +## Onerilen Dosya Yapisi + +```text +terraform/ + hetzner/ + test/ + versions.tf + providers.tf + variables.tf + locals.tf + network.tf + firewall.tf + placement.tf + servers.tf + outputs.tf + terraform.tfvars.example +``` + +`terraform.tfvars` commit edilmeyecek. `.gitignore` icinde ignore edilmelidir. + +## Degiskenler + +Minimum degiskenler: + +```hcl +hcloud_token = "secret" +environment = "test" +location = "fsn1" +image = "ubuntu-24.04" +server_type_swarm = "cx32" +server_type_db = "cx42" +admin_ssh_public_key_path = "~/.ssh/id_ed25519.pub" +admin_allowed_cidrs = ["X.X.X.X/32"] +``` + +`location` icin tek lokasyonla baslanir. Farkli region/lokasyon felaket kurtarma bu asamada konu disidir; ileride dokumana eklenmelidir. + +## Server Rolleri + +| Server | Private IP | Rol | +| --- | --- | --- | +| `test-swarm-01` | `10.10.10.11` | Swarm manager + app worker + Gitea runner | +| `test-db-01` | `10.10.20.11` | Manuel DB kurulumu icin hazir DB node | + +Private IP'ler Terraform icinde sabit tanimlanmalidir. Ansible inventory ve firewall kurallari deterministik kalir. + +## Firewall Kurallari + +Public ingress: + +| Port | Kaynak | Hedef | +| --- | --- | --- | +| `22/tcp` | `admin_allowed_cidrs` | Tum test node'lari | +| `80/tcp` | `0.0.0.0/0`, `::/0` | `test-swarm-01` | +| `443/tcp` | `0.0.0.0/0`, `::/0` | `test-swarm-01` | + +Public ingress icin `8200/tcp`, `5432/tcp`, `27017/tcp`, `5672/tcp`, `15672/tcp`, `6379/tcp`, `2379/tcp`, `9180/tcp`, `9090/tcp`, `3000/tcp` acilmayacak. + +Private ingress icin `07-private-network-port-matrisi.md` kaynak alinacak. + +## Placement Group + +`test-spread` placement group `type = "spread"` olacak. Testte iki server oldugu icin bu grup `test-swarm-01` ve `test-db-01` makinelerinin farkli fiziksel host'lara dagitilmasini hedefler. + +Not: Spread placement group farkli kabinet veya lokasyon garantisi degildir; tek fiziksel host arizasinin etkisini azaltir. + +## Terraform Cikti Beklentisi + +`outputs.tf` minimum su bilgileri uretmelidir: + +```hcl +output "ansible_inventory_yaml" { + sensitive = false +} + +output "test_private_ips" { + sensitive = false +} + +output "test_public_ips" { + sensitive = false +} +``` + +Inventory output'u daha sonra `ansible/inventory/generated/test.yml` dosyasina yazilabilir. Inventory dosyasinda secret bulunmayacaksa commit edilebilir; secret veya token icerirse commit edilmeyecek. + +## Kabul Kriterleri + +- `terraform plan` sadece test Hetzner Project token'i ile calisir. +- `terraform apply` sonrasinda 2 server olusur. +- Iki server private network uzerinden birbirine erisebilir. +- Public internetten sadece `22`, `80`, `443` firewall seviyesinde aciktir. +- Vault `8200` public'ten kapali kalir. +- Terraform state repo'ya commit edilmez. + diff --git a/setup/02-test-ansible-bootstrap.md b/setup/02-test-ansible-bootstrap.md new file mode 100644 index 0000000..0fb563b --- /dev/null +++ b/setup/02-test-ansible-bootstrap.md @@ -0,0 +1,140 @@ +# 02 - Test Ansible Bootstrap + +Bu asamanin amaci Terraform ile olusturulan test makinelerini Linux, hardening, Docker ve Swarm acisindan hazir hale getirmektir. DB yazilimi kurulumu bu asamanin disindadir. + +## Hedef Makineler + +| Host | Rol | +| --- | --- | +| `test-swarm-01` | Swarm manager + app worker | +| `test-db-01` | Manuel DB kurulumu icin OS-hardening uygulanmis DB node | + +## Onerilen Dosya Yapisi + +```text +ansible/ + ansible.cfg + inventory/ + generated/ + test.yml + group_vars/ + all.yml + test.yml + playbooks/ + test-bootstrap.yml + roles/ + base/ + hardening/ + docker/ + swarm/ + node_dirs/ +``` + +## Base Role + +Tum test node'larina uygulanir: + +- `apt update` +- temel paketler: + - `curl` + - `wget` + - `git` + - `jq` + - `unzip` + - `ca-certificates` + - `gnupg` + - `lsb-release` + - `ufw` + - `fail2ban` + - `chrony` + - `python3` + - `python3-pip` +- timezone: `Europe/Istanbul` +- hostname ayari +- sistem reboot gerekiyorsa kontrollu reboot + +## Security Hardening Role + +Tum test node'larina uygulanir: + +- SSH password login kapatilir. +- Root SSH login kapatilir. +- Sadece SSH key ile login kalir. +- `PermitEmptyPasswords no` +- `MaxAuthTries 3` +- `fail2ban` SSH jail aktif edilir. +- `unattended-upgrades` aktif edilir. +- UFW default: + - incoming: deny + - outgoing: allow +- Public SSH sadece admin CIDR'dan acilir. + +Not: Docker iptables kurallari UFW ile etkilesebilir. Hetzner Cloud firewall asil dis perimeter kabul edilir; UFW host icinde ikinci katman olarak kullanilir. + +## Docker Role + +Sadece `test-swarm-01` uzerinde zorunludur. `test-db-01` uzerinde DB manual kurulum stratejisine gore opsiyonel tutulabilir. + +Docker kurulumu resmi Docker apt repository uzerinden yapilir: + +- Docker GPG key +- Docker apt source +- paketler: + - `docker-ce` + - `docker-ce-cli` + - `containerd.io` + - `docker-buildx-plugin` + - `docker-compose-plugin` +- Docker servisi enabled + started + +Docker convenience script kullanilmayacak. Production benzeri test ortami icin paket repository yolu tercih edilir. + +## Swarm Role + +`test-swarm-01` uzerinde: + +- `docker swarm init` +- advertise addr: `10.10.10.11` +- data path addr: `10.10.10.11` +- overlay network: + - `iklimco-net` + - driver: `overlay` + - attachable: `true` +- Node `type=service` label'i ile isaretlenir: + ```bash + docker node update --label-add type=service test-swarm-01 + ``` +- Node `AVAILABILITY=Active` kalir (drain edilmez); tek node hem manager hem worker'dir. + +Test tek node Swarm oldugu icin join token kullanimi yoktur. + +## Node Directory Role + +`test-swarm-01` uzerinde deploy on kosullari: + +```text +/opt/iklimco +/opt/iklimco/ssl +/opt/iklimco/init +/opt/iklimco/init/postgresql +/opt/iklimco/init/mongodb +``` + +DB node uzerinde manuel DB kurulumu icin minimum: + +```text +/opt/iklimco +/opt/iklimco/db +/opt/iklimco/backup +``` + +## Kabul Kriterleri + +- `ansible -i inventory/generated/test.yml all -m ping` basarili olur. +- `test-swarm-01` uzerinde `docker info` calisir. +- `test-swarm-01` uzerinde Swarm active olur; node `AVAILABILITY=Active` (drain degil). +- `docker network ls` icinde `iklimco-net` gorulur. +- `docker node inspect test-swarm-01 --format '{{.Spec.Labels}}'` ciktisi `map[type:service]` icerir. +- `test-db-01` uzerinde public DB portu acik degildir. +- Public portlar Hetzner firewall + UFW seviyesinde `22`, `80`, `443` ile sinirlidir. + diff --git a/setup/03-test-runner-ve-deploy-onkosullari.md b/setup/03-test-runner-ve-deploy-onkosullari.md new file mode 100644 index 0000000..2cd641a --- /dev/null +++ b/setup/03-test-runner-ve-deploy-onkosullari.md @@ -0,0 +1,117 @@ +# 03 - Test Runner ve Deploy On Kosullari + +Bu asamanin amaci test ortaminda Gitea Actions runner'i systemd servisi olarak kurmak ve mevcut test CI/CD pipeline'larinin calisabilecegi host on kosullarini hazirlamaktir. + +## Runner Yerlesimi + +Test ortaminda tek runner yeterlidir: + +| Host | Runner | +| --- | --- | +| `test-swarm-01` | `act_runner` systemd servisi | + +Runner Docker container olarak calistirilmayacak. `/var/run/docker.sock` bir runner container'ina mount edilmeyecek. + +## Neden Systemd Runner + +Mevcut CI/CD akisinda Gitea job'lari gerekli hazirliklari kendi icinde yapip deploy komutlarini calistiriyor. Runner'in Docker container olmasi, Docker socket mount edilmesini gerektirir ve bu model ekstra yetki riski uretir. Systemd runner modelinde socket mount yoktur; ancak runner host uzerinde Docker kullanacagi icin runner kullanicisinin Docker erisimi yine yuksek yetki kabul edilir. + +Bu nedenle: + +- Runner sadece guvenilir Gitea instance/repo icin kullanilir. +- Runner token Ansible Vault veya CI secret olarak saklanir. +- Runner config ve token repo'ya commit edilmez. + +## Runner Kullanicisi + +Ansible `gitea_runner` role'u: + +- `gitea-runner` sistem kullanicisi olusturur. +- Kullanici shell'i ihtiyaca gore `/bin/bash` olabilir. +- Kullanici Docker kullanacaksa `docker` grubuna eklenir. +- Home dizini: `/var/lib/gitea-runner` +- Config dizini: `/etc/gitea-act-runner` + +Docker group root seviyesine yakin yetki verdigi icin bu karar bilincli kabul edilir. + +## Runner Binary Kurulumu + +Kurulum adimlari: + +1. `act_runner` Linux amd64 binary indirilir. +2. `/usr/local/bin/act_runner` olarak yerlestirilir. +3. Executable permission verilir. +4. Config uretilir veya template ile yazilir. +5. Runner register edilir. +6. Systemd unit enabled + started edilir. + +## Runner Label PolitikasI + +Test runner label'lari: + +```text +test-runner +test-swarm-01 +ubuntu-24.04 +docker +swarm-manager +``` + +Mevcut workflow'larda `runs-on` degeri test icin bu label'lardan biriyle uyumlu hale getirilmelidir. Eski `ubuntu-latest` kullanimi self-hosted Gitea runner eslesmesi icin yeterli olmayabilir; bu durum Gitea Actions label konfigurasyonuna gore netlestirilmelidir. + +## Deploy On Kosullari + +Test deploy pipeline'lari icin `test-swarm-01` uzerinde bulunmasi gerekenler: + +- Docker Engine +- Docker Compose plugin +- Git +- curl +- jq +- gettext/envsubst +- tree +- ssh/scp client +- Harbor registry erisimi +- StorageBox erisimi +- Gitea reposuna erisim +- Swarm manager yetkisi +- `iklimco-net` overlay network + +CI/CD DB altyapisini kurmayacak. Test DB node hazir olacak; DB yazilimi ve cluster/manual setup ayridir. + +## Deploy Lock Notu + +Test ortaminda tek runner oldugu icin runner'lar arasi deploy yarismasi beklenmez. +Yine de ayni branch'e arka arkaya push edilmesi veya manuel yeniden calistirma gibi +durumlarda ayni servis deploy'u ust uste binebilir. + +Test icin lock zorunlu degildir; ancak prod ile ayni aliskanligi kazanmak istenirse +StorageBox uzerinde su path kullanilabilir: + +```text +test/locks/test-deploy.lock +test/locks/services/.lock +``` + +Lock manuel olusturulmaz. Workflow basinda atomik `mkdir`, bitiste `rmdir` kullanilir. + +## Secret Gereksinimleri + +Runner kurulumu ve pipeline calismasi icin secret'lar: + +- Gitea runner registration token +- Harbor username/password veya token +- StorageBox credential +- SSH deploy key +- Hetzner token gerekmez; Terraform asamasinda kullanilir + +Bu secret'lar repo'ya yazilmayacak. + +## Kabul Kriterleri + +- `systemctl status gitea-act-runner` active gorunur. +- Gitea UI icinde test runner online gorunur. +- Runner label'lari test workflow `runs-on` ile eslesir. +- Basit bir test workflow runner uzerinde calisir. +- Runner job'u Docker komutu calistirabiliyorsa deploy on kosulu saglanmistir. +- `8200/tcp` public internete acik degildir. diff --git a/setup/04-prod-terraform-iaac.md b/setup/04-prod-terraform-iaac.md new file mode 100644 index 0000000..185d5d8 --- /dev/null +++ b/setup/04-prod-terraform-iaac.md @@ -0,0 +1,132 @@ +# 04 - Prod Terraform IaC + +Bu asamanin amaci prod Hetzner Cloud Project icinde HA odakli IaaS kaynaklarini Terraform ile olusturmaktir. Bu dokuman prod Terraform ajanina tek basina verilebilir. + +## Kapsam + +Terraform prod ortaminda sunlari olusturur: + +- Private network: `iklim-prod-net` +- Subnetler: + - App/Swarm subnet: `10.20.10.0/24` + - DB subnet: `10.20.20.0/24` +- Firewall: + - Public ingress: sadece `22/tcp`, `80/tcp`, `443/tcp` + - Private ingress: `07-private-network-port-matrisi.md` dosyasindaki prod kurallari +- SSH key +- Placement groups: + - `prod-swarm-spread` + - `prod-db-spread` +- Servers: + - `prod-swarm-01` + - `prod-swarm-02` + - `prod-swarm-03` + - `prod-db-01` + - `prod-db-02` + - `prod-db-03` +- Ansible inventory output + +DB cluster yazilimi Terraform ile kurulmayacak. DB node'lari sadece makine, network ve firewall seviyesinde hazirlanacak. + +## Onerilen Dosya Yapisi + +```text +terraform/ + hetzner/ + prod/ + versions.tf + providers.tf + variables.tf + locals.tf + network.tf + firewall.tf + placement.tf + servers.tf + outputs.tf + terraform.tfvars.example +``` + +`terraform.tfvars`, state dosyalari ve token repo'ya commit edilmeyecek. + +## Degiskenler + +Minimum degiskenler: + +```hcl +hcloud_token = "secret" +environment = "prod" +location = "fsn1" +image = "ubuntu-24.04" +server_type_swarm = "cx42" +server_type_db = "cx52" +admin_ssh_public_key_path = "~/.ssh/id_ed25519.pub" +admin_allowed_cidrs = ["X.X.X.X/32"] +``` + +Server type degerleri kapasiteye gore degisebilir. Bu dokuman topoloji ve guvenlik kararini tanimlar; sizing daha sonra revize edilebilir. + +## Server Rolleri ve Private IP PlanI + +| Server | Private IP | Rol | +| --- | --- | --- | +| `prod-swarm-01` | `10.20.10.11` | Swarm manager + app worker + runner | +| `prod-swarm-02` | `10.20.10.12` | Swarm manager + app worker + runner | +| `prod-swarm-03` | `10.20.10.13` | Swarm manager + app worker + runner | +| `prod-db-01` | `10.20.20.11` | Manuel DB cluster node | +| `prod-db-02` | `10.20.20.12` | Manuel DB cluster node | +| `prod-db-03` | `10.20.20.13` | Manuel DB cluster node | + +Private IP'ler sabit tanimlanmalidir. + +## Placement Group Karari + +Prod icin iki ayri spread placement group: + +```text +prod-swarm-spread: prod-swarm-01/02/03 +prod-db-spread: prod-db-01/02/03 +``` + +Bu sayede Swarm quorum node'lari kendi aralarinda farkli fiziksel host'lara, DB node'lari da kendi aralarinda farkli fiziksel host'lara yerlestirilmeye calisilir. + +Notlar: + +- Hetzner kabinet secimi dogrudan sunmaz. +- Spread placement group farkli fiziksel host hedefler. +- Farkli lokasyon/region felaket kurtarma bu asamada konu disidir. +- Ileride scale buyudugunde multi-location DR ayri tasarlanmalidir. + +## Public Firewall + +Public ingress: + +| Port | Kaynak | Hedef | +| --- | --- | --- | +| `22/tcp` | `admin_allowed_cidrs` | Tum prod node'lari | +| `80/tcp` | `0.0.0.0/0`, `::/0` | Prod gateway entrypoint | +| `443/tcp` | `0.0.0.0/0`, `::/0` | Prod gateway entrypoint | + +Prod'da su portlar public acilmayacak: + +- `8200/tcp` Vault +- `5432/tcp` PostgreSQL +- `27017/tcp` MongoDB +- `6379/tcp` Redis +- `5672/tcp`, `15672/tcp`, `61613/tcp`, `15674/tcp` RabbitMQ +- `2377/tcp`, `7946/tcp`, `7946/udp`, `4789/udp` Docker Swarm +- `9180/tcp` APISIX Admin API +- `9090/tcp` Prometheus +- `3000/tcp` Grafana + +Bu servisler gerekiyorsa private network, VPN, bastion veya admin CIDR ile sinirlandirilmis ek kural uzerinden erisilebilir. Varsayilan public politika kapali kalir. + +## Kabul Kriterleri + +- `terraform plan` sadece prod Hetzner Project token'i ile calisir. +- 6 server olusur. +- Swarm node'lari `prod-swarm-spread` placement group icindedir. +- DB node'lari `prod-db-spread` placement group icindedir. +- Public firewall sadece `22`, `80`, `443` ingress'e izin verir. +- Private firewall `07-private-network-port-matrisi.md` ile uyumludur. +- Terraform state ve secret tfvars commit edilmez. + diff --git a/setup/05-prod-ansible-bootstrap.md b/setup/05-prod-ansible-bootstrap.md new file mode 100644 index 0000000..25079df --- /dev/null +++ b/setup/05-prod-ansible-bootstrap.md @@ -0,0 +1,146 @@ +# 05 - Prod Ansible Bootstrap + +Bu asamanin amaci Terraform ile olusturulan prod makinelerini Linux, security hardening, Docker ve Swarm acisindan hazir hale getirmektir. DB cluster yazilimi manuel kurulacaktir; bu playbook DB node'larinda sadece OS ve temel guvenlik hazirligini yapar. + +## Hedef Makineler + +| Host | Rol | +| --- | --- | +| `prod-swarm-01` | Swarm manager + app worker | +| `prod-swarm-02` | Swarm manager + app worker | +| `prod-swarm-03` | Swarm manager + app worker | +| `prod-db-01` | Manuel DB cluster node | +| `prod-db-02` | Manuel DB cluster node | +| `prod-db-03` | Manuel DB cluster node | + +## Onerilen Dosya Yapisi + +```text +ansible/ + ansible.cfg + inventory/ + generated/ + prod.yml + group_vars/ + all.yml + prod.yml + playbooks/ + prod-bootstrap.yml + roles/ + base/ + hardening/ + docker/ + swarm/ + node_dirs/ +``` + +## Base Role + +Tum prod node'larina uygulanir: + +- Paket cache update +- Temel paketler: + - `curl` + - `wget` + - `git` + - `jq` + - `unzip` + - `ca-certificates` + - `gnupg` + - `lsb-release` + - `ufw` + - `fail2ban` + - `chrony` + - `python3` + - `python3-pip` +- timezone: `Europe/Istanbul` +- hostname ayari +- chrony/NTP aktif + +## Security Hardening Role + +Tum prod node'larina uygulanir: + +- SSH password auth kapatilir. +- Root SSH login kapatilir. +- Sadece SSH key auth kalir. +- `PermitEmptyPasswords no` +- `MaxAuthTries 3` +- `fail2ban` aktif edilir. +- `unattended-upgrades` aktif edilir. +- UFW default incoming deny, outgoing allow. +- SSH sadece admin CIDR'dan acilir. +- DB portlari public acilmaz. + +Hetzner Cloud Firewall asil perimeter kabul edilir. UFW host uzerinde ikinci savunma katmanidir. + +## Docker Role + +Sadece `prod-swarm-*` node'larinda zorunludur. + +Kurulacak paketler: + +- `docker-ce` +- `docker-ce-cli` +- `containerd.io` +- `docker-buildx-plugin` +- `docker-compose-plugin` + +Kurulum resmi Docker apt repository uzerinden yapilacak. Convenience script kullanilmayacak. + +DB node'larinda Docker zorunlu degildir. DB manuel kurulum stratejisi container tabanli olacaksa daha sonra ayri DB dokumaninda ele alinmalidir. + +## Swarm Role + +Prod Swarm 3 manager ile kurulacak: + +1. `prod-swarm-01` uzerinde `docker swarm init` +2. Advertise/data path addr: `10.20.10.11` +3. Manager join token alinir. +4. `prod-swarm-02` ve `prod-swarm-03` manager olarak join olur. +5. Overlay network olusturulur: + - `iklimco-net` + - driver: `overlay` + - attachable: `true` +6. Tum 3 node `type=service` label'i ile isaretlenir: + ```bash + for node in prod-swarm-01 prod-swarm-02 prod-swarm-03; do + docker node update --label-add type=service "$node" + done + ``` +7. Hicbir node drain edilmez. 3 node da `AVAILABILITY=Active` kalir; hem manager hem app worker olarak calisir. + +> DB node'lari (`prod-db-*`) Swarm'a join ettirilmez. DB cluster ayri yonetilir. + +## Node Directory Role + +Tum `prod-swarm-*` node'larinda: + +```text +/opt/iklimco +/opt/iklimco/ssl +/opt/iklimco/init +/opt/iklimco/init/postgresql +/opt/iklimco/init/mongodb +``` + +DB node'larinda manuel DB kurulumu icin: + +```text +/opt/iklimco +/opt/iklimco/db +/opt/iklimco/backup +``` + +## Kabul Kriterleri + +- `ansible -i inventory/generated/prod.yml all -m ping` basarili olur. +- 3 Swarm node `docker node ls` icinde manager olarak gorunur; hepsi `AVAILABILITY=Active`. +- Manager quorum saglanir (3 manager, 1 kayip tolere edilir). +- `iklimco-net` overlay network vardir. +- `docker node inspect prod-swarm-01 --format '{{.Spec.Labels}}'` ciktisi `map[type:service]` icerir. +- DB node'lari `docker node ls` ciktisinda gorunmez. +- Public firewall sadece `22`, `80`, `443` ingress'e izin verir. +- DB node'lari public DB portu acmaz. +- DB yazilimi kurulumu bu playbook tarafindan yapilmaz. + diff --git a/setup/06-prod-runner-ha-ve-swarm.md b/setup/06-prod-runner-ha-ve-swarm.md new file mode 100644 index 0000000..f2f32a2 --- /dev/null +++ b/setup/06-prod-runner-ha-ve-swarm.md @@ -0,0 +1,158 @@ +# 06 - Prod Runner HA ve Swarm Deploy Modeli + +Bu asamanin amaci prod ortaminda Gitea Actions runner'lari HA calisacak sekilde kurmak ve Swarm uzerinde servislerin 3 node'a dagitilmasina uygun on kosullari tanimlamaktir. + +## Runner Sayisi + +Tek runner fonksiyonel olarak yeterlidir, ancak HA degildir. Prod hedefi HA oldugu icin `act_runner` 3 Swarm manager node'unun tamamına systemd servisi olarak kurulacak: + +| Host | Runner | +| --- | --- | +| `prod-swarm-01` | `act_runner` systemd | +| `prod-swarm-02` | `act_runner` systemd | +| `prod-swarm-03` | `act_runner` systemd | + +Bu modelde herhangi bir manager/runner kaybedilirse diger runner'lar pipeline job'larini alabilir. + +## Runner Kurulum Modeli + +Runner Docker container olarak calismayacak. Docker socket mount yok. + +Kurulum: + +- `gitea-runner` sistem kullanicisi +- `/usr/local/bin/act_runner` +- `/etc/gitea-act-runner/config.yaml` +- `/var/lib/gitea-runner` +- `gitea-act-runner.service` + +Runner job'lari deploy icin Docker CLI kullanacaksa `gitea-runner` kullanicisinin Docker daemon erisimi gerekir. Docker group uyeligi root seviyesine yakin yetki kabul edilir; sadece guvenilir repo/job'lar bu runner label'larini kullanmalidir. + +## Runner Label PolitikasI + +Tum prod runner'larda ortak label: + +```text +prod-runner +docker +swarm-manager +ubuntu-24.04 +``` + +Node-spesifik label'lar: + +```text +prod-swarm-01 +prod-swarm-02 +prod-swarm-03 +``` + +Mevcut prod workflow'lari `runs-on: prod-runner` kullaniyorsa 3 runner'dan herhangi biri job'u alabilir. Belirli bir node'a sabitlemek gerekirse node-spesifik label kullanilir. + +## Deploy Yarismasi Riski + +Birden fazla runner oldugunda ayni anda birden fazla deploy job'u calisabilir. Bu HA icin iyidir ama ortak kaynaklarda yarisma riski yaratabilir. + +Riskli alanlar: + +- Ayni stack uzerinde es zamanli `docker stack deploy` +- Ayni servis icin es zamanli `docker service update` +- StorageBox'ta ayni `.env` veya manifest dosyasinin es zamanli guncellenmesi +- Root altyapi pipeline'i ile mikroservis deploy pipeline'inin ayni anda calismasi + +Gerekli onlem: + +- Prod root altyapi deploy'u manuel/onayli calismali. +- Ayni servis icin prod deploy ayni anda birden fazla kez tetiklenmemeli. +- Prod deploy workflow'lari StorageBox uzerinde otomatik deploy lock kullanmalidir. + +## StorageBox Deploy Lock Modeli + +Prod'da 3 runner oldugu icin deploy lock zorunlu kabul edilir. Lock lokal dosya +sisteminde tutulmayacak; cunku runner'lar farkli makinelerde calisir ve birbirlerinin +`/tmp` veya `/var/lock` dizinlerini gormez. + +Lock konumu StorageBox olacaktir: + +```text +prod/locks/prod-deploy.lock +prod/locks/prod-infra.lock +prod/locks/services/.lock +``` + +Baslangic modeli: + +```text +prod/locks/prod-deploy.lock +``` + +Bu tek global lock tum prod deploy'lari siraya sokar ve en az karmasik modeldir. +Ileride deploy sureleri uzarsa servis bazli lock'a gecilebilir. + +Lock dosyasi/klasoru manuel olusturulmaz. Workflow basinda atomik `mkdir` ile lock +alinir, workflow sonunda `rmdir` ile lock birakilir. + +Ornek: + +```bash +LOCK_DIR="prod/locks/prod-deploy.lock" +LOCK_META="owner.txt" + +ssh "$STORAGEBOX_SSH" "mkdir -p prod/locks && mkdir '$LOCK_DIR'" +ssh "$STORAGEBOX_SSH" "printf '%s\n' 'runner=${GITEA_RUNNER_NAME:-unknown}' 'run=${GITHUB_RUN_ID:-unknown}' 'created_at=$(date -u +%FT%TZ)' > '$LOCK_DIR/$LOCK_META'" + +# deploy islemleri + +ssh "$STORAGEBOX_SSH" "rm -f '$LOCK_DIR/$LOCK_META' && rmdir '$LOCK_DIR'" +``` + +Davranis: + +- `mkdir '$LOCK_DIR'` basariliysa lock alinmistir. +- `mkdir '$LOCK_DIR'` fail olursa baska deploy calisiyor kabul edilir. +- Job fail olsa bile cleanup adimi `rm/rmdir` calistirmalidir. +- Stale lock temizligi manuel/onayli olmalidir; otomatik zorla silme ilk asamada uygulanmamalidir. + +Lock seviyesi: + +| Lock | Ne icin | +| --- | --- | +| `prod/locks/prod-deploy.lock` | Ilk asama: tum prod deploy'lar icin global lock | +| `prod/locks/prod-infra.lock` | Ileride root infra deploy'u mikroservis deploy'larindan ayirmak icin | +| `prod/locks/services/.lock` | Ileride servis bazli paralel deploy'a gecmek icin | + +## Swarm Servis Dagilimi + +Prod'da 3 node da manager + app worker oldugu icin servisler 3 node'a dagitilabilir. + +Uygulama servisleri icin ileride `docker-stack-service.yml` deploy ayarlari su prensiplere gore revize edilebilir: + +- Stateless servislerde `replicas: 3` +- `placement` ile sadece app-capable node'lar secilir +- `update_config` rolling update olacak sekilde ayarlanir +- `restart_policy` aktif kalir +- State tutan servisler app worker uzerinde cogaltilmaz; stateful katman DB node'larinda ayridir + +Mevcut repo durumunda mikroservis stack dosyalari servis bazli deploy ediliyor. Bu dokuman, prod HA hedefi icin runner ve Swarm on kosullarini tanimlar; her mikroservisin replica sayisi ayri uygulama deploy refaktoru olarak ele alinmalidir. + +## Gateway ve Public Trafik + +Public internet sadece `80/tcp` ve `443/tcp` ile gateway katmanina girmelidir. + +Mevcut stack dosyalarinda APISIX `8080/8443` publish ediyor olabilir. Prod hedef mimaride public firewall sadece `80/443` acik oldugu icin iki secenekten biri secilmelidir: + +1. APISIX/SWAG host publish portlari `80/443` ile uyumlu hale getirilir. +2. Hetzner Load Balancer veya reverse proxy `80/443` alip Swarm gateway portlarina private network uzerinden aktarir. + +Bu karar Terraform/Ansible bootstrap'tan ayridir; uygulama altyapi manifest revizyonu gerektirir. + +## Kabul Kriterleri + +- 3 prod runner Gitea UI'da online gorunur. +- Her runner `prod-runner` label'ina sahiptir. +- Runner'lardan herhangi biri basit Docker komutu calistirabilir. +- `docker node ls` 3 manager gosterir. +- Bir runner/node kapatildiginda diger runner yeni job alabilir. +- Prod workflow'lari StorageBox uzerindeki `prod/locks/prod-deploy.lock` global lock'unu kullanir. +- Lock manuel degil, workflow tarafindan `mkdir/rmdir` ile otomatik yonetilir. +- Public ingress sadece `22`, `80`, `443` ile sinirlidir. diff --git a/setup/07-private-network-port-matrisi.md b/setup/07-private-network-port-matrisi.md new file mode 100644 index 0000000..223c2d3 --- /dev/null +++ b/setup/07-private-network-port-matrisi.md @@ -0,0 +1,149 @@ +# 07 - Private Network Port Matrisi + +Bu dosya test ve prod ortamlarinda Hetzner private network icinde acilmasi gereken portlari tanimlar. Public internete acik portlar sadece `22/tcp`, `80/tcp`, `443/tcp` olacaktir. Vault `8200/tcp` public acilmayacak. + +Bu matris Terraform Hetzner firewall ve Ansible UFW kurallari icin kaynak kabul edilmelidir. + +## Network PlanI + +### Test + +| Subnet | CIDR | Amac | +| --- | --- | --- | +| App/Swarm | `10.10.10.0/24` | `test-swarm-01` | +| DB | `10.10.20.0/24` | `test-db-01` | + +### Prod + +| Subnet | CIDR | Amac | +| --- | --- | --- | +| App/Swarm | `10.20.10.0/24` | `prod-swarm-01/02/03` | +| DB | `10.20.20.0/24` | `prod-db-01/02/03` | + +## Public Ingress Standardi + +Tum ortamlar icin public ingress: + +| Port | Protocol | Kaynak | Hedef | Zorunluluk | +| --- | --- | --- | --- | --- | +| `22` | TCP | Admin IP/CIDR | Tum node'lar | SSH yonetim | +| `80` | TCP | Internet | Gateway entrypoint | HTTP / ACME redirect | +| `443` | TCP | Internet | Gateway entrypoint | HTTPS | + +Public olarak acilmayacak kritik portlar: + +| Port | Servis | +| --- | --- | +| `8200/tcp` | Vault | +| `5432/tcp` | PostgreSQL | +| `27017/tcp` | MongoDB | +| `6379/tcp` | Redis | +| `5672/tcp`, `15672/tcp`, `61613/tcp`, `15674/tcp` | RabbitMQ | +| `2377/tcp`, `7946/tcp`, `7946/udp`, `4789/udp` | Docker Swarm | +| `9180/tcp` | APISIX Admin API | +| `9090/tcp` | Prometheus | +| `3000/tcp` | Grafana | + +## Docker Swarm Private Portlari + +Docker Swarm node'lari arasinda zorunlu portlar: + +| Port | Protocol | Kaynak | Hedef | Aciklama | +| --- | --- | --- | --- | --- | +| `2377` | TCP | Swarm node'lari | Swarm manager node'lari | Swarm control plane / join | +| `7946` | TCP | Tum Swarm node'lari | Tum Swarm node'lari | Node discovery / gossip | +| `7946` | UDP | Tum Swarm node'lari | Tum Swarm node'lari | Node discovery / gossip | +| `4789` | UDP | Tum Swarm node'lari | Tum Swarm node'lari | Overlay VXLAN data path | + +Testte bu portlar fiilen tek Swarm node icin gerekli olsa da ileride worker eklemeyi kolaylastirmak icin app subnet icinde tanimlanabilir. + +Prod'da `10.20.10.0/24` app/swarm subnet icinde bu portlar tum `prod-swarm-*` node'lari arasinda acik olmalidir. + +Kaynak: Docker overlay network dokumani, https://docs.docker.com/engine/network/drivers/overlay/ + +## Uygulama ve Infra Servis Private Portlari + +Bu portlar public acilmayacak. Sadece private network veya Docker overlay icinde gerekli kaynaklardan erisime izin verilecek. + +| Port | Protocol | Servis | Kaynak | Hedef | Not | +| --- | --- | --- | --- | --- | --- | +| `8200` | TCP | Vault API/UI | Swarm app node'lari / runner | Vault service/node | Public kapali. Runtime servisleri Vault'a private/overlay uzerinden erismeli | +| `6379` | TCP | Redis | Swarm app node'lari | Redis service/node | Public kapali | +| `5672` | TCP | RabbitMQ AMQP | Swarm app node'lari | RabbitMQ service/node | Public kapali | +| `15672` | TCP | RabbitMQ Management | Admin CIDR veya private ops | RabbitMQ service/node | Public kapali; tercihen VPN/bastion | +| `61613` | TCP | RabbitMQ STOMP | Gerekli app node'lari | RabbitMQ service/node | Public kapali | +| `15674` | TCP | RabbitMQ Web STOMP | Gerekli app/gateway node'lari | RabbitMQ service/node | Public kapali | +| `2379` | TCP | etcd client | APISIX service/node | etcd service/node | Public kapali | +| `2380` | TCP | etcd peer | etcd cluster node'lari | etcd cluster node'lari | Tek replica ise gerekmeyebilir; cluster olursa gerekli | +| `9180` | TCP | APISIX Admin API | Admin CIDR veya private ops | APISIX service/node | Public kapali | +| `9090` | TCP | Prometheus UI/API | Admin CIDR veya private ops | Prometheus service/node | Public kapali | +| `3000` | TCP | Grafana UI | Admin CIDR veya private ops | Grafana service/node | Public kapali | + +Mevcut `docker-stack-infra.yml` bazi servisleri host mode ile publish ediyor olabilir. Hetzner firewall public ingress'i kapatsa bile private ingress kararini bu tablo belirler. + +## DB Node Portlari + +DB altyapisi manuel kurulacagi icin kesin cluster teknolojisi bu dokumanin disindadir. Yine de firewall icin varsayilan portlar asagidadir. + +### PostgreSQL / PostGIS + +| Port | Protocol | Kaynak | Hedef | Not | +| --- | --- | --- | --- | --- | +| `5432` | TCP | App/Swarm subnet | PostgreSQL node/cluster endpoint | Uygulama DB baglantisi | +| `5432` | TCP | DB subnet | PostgreSQL node'lari | Streaming replication ayni portu kullanabilir | + +Eger Patroni kullanilirsa ek portlar daha sonra DB runbook'unda netlestirilmelidir: + +| Port | Protocol | Amac | +| --- | --- | --- | +| `8008` | TCP | Patroni REST API | +| `2379-2380` | TCP | Patroni icin etcd kullanilirsa etcd client/peer | +| `5000-5001` | TCP | HAProxy veya benzeri DB endpoint kullanilirsa | + +Bu ek portlar ancak ilgili teknoloji secildiginde acilmalidir. + +### MongoDB + +| Port | Protocol | Kaynak | Hedef | Not | +| --- | --- | --- | --- | --- | +| `27017` | TCP | App/Swarm subnet | MongoDB node/replica set endpoint | Uygulama DB baglantisi | +| `27017` | TCP | DB subnet | MongoDB replica set node'lari | Replica set internal trafik | + +Ileride sharding yapilirsa `27018/27019` gibi ek MongoDB rolleri gundeme gelebilir; bu asamada acilmayacak. + +## Test Private Kurallari + +Test ortaminda minimum: + +| Kaynak | Hedef | Portlar | +| --- | --- | --- | +| `10.10.10.0/24` | `10.10.10.0/24` | `2377/tcp`, `7946/tcp`, `7946/udp`, `4789/udp` | +| `10.10.10.0/24` | `10.10.20.0/24` | `5432/tcp`, `27017/tcp` | +| `10.10.10.0/24` | `10.10.10.0/24` | `8200/tcp`, `6379/tcp`, `5672/tcp`, `61613/tcp`, `15674/tcp` | +| Admin CIDR veya VPN | `10.10.10.0/24` | `15672/tcp`, `9180/tcp`, `9090/tcp`, `3000/tcp` | + +Testte DB node tek oldugu icin DB subnet icindeki PostgreSQL/MongoDB replication portlari aktif kullanilmayabilir. + +## Prod Private Kurallari + +Prod ortaminda minimum: + +| Kaynak | Hedef | Portlar | +| --- | --- | --- | +| `10.20.10.0/24` | `10.20.10.0/24` | `2377/tcp`, `7946/tcp`, `7946/udp`, `4789/udp` | +| `10.20.10.0/24` | `10.20.20.0/24` | `5432/tcp`, `27017/tcp` | +| `10.20.20.0/24` | `10.20.20.0/24` | `5432/tcp`, `27017/tcp` | +| `10.20.10.0/24` | `10.20.10.0/24` | `8200/tcp`, `6379/tcp`, `5672/tcp`, `61613/tcp`, `15674/tcp`, `2379/tcp` | +| Admin CIDR veya VPN | `10.20.10.0/24` | `15672/tcp`, `9180/tcp`, `9090/tcp`, `3000/tcp` | + +Patroni, HAProxy, Mongo sharding veya ayri monitoring agent mimarisi secilirse bu matrise ek portlar kontrollu sekilde eklenmelidir. + +## Kabul Kriterleri + +- Public firewall `8200/tcp` acmaz. +- DB portlari public acik degildir. +- Swarm portlari sadece private app/swarm subnet icinde aciktir. +- App/Swarm subnet DB subnet'e sadece gerekli DB portlarindan erisir. +- DB subnet app subnet'e genis yetkiyle acilmaz. +- Admin UI portlari public yerine admin CIDR/VPN/private ops ile sinirlandirilir. +