This commit introduces a reordered and renumbered set of setup documentation files to better reflect the deployment stages for both test and production environments. Key changes include: * A new `setup-vs-roadmap-map.md` file to provide a clear mapping between roadmap tasks and their corresponding setup phases. * Significantly expanded Ansible bootstrap documentation for both test and production, detailing Docker, Swarm, security hardening, and StorageBox SSH key management roles. * Formalized database Docker and Swarm cluster setup instructions for test and production, including explicit steps for Swarm worker integration of DB nodes. * Updated roadmap documentation (`roadmap/prod-env/*`) to align with the refined setup, incorporating correct private IP addresses for Swarm joins, new node labels, and floating IP usage for GoDaddy DNS records.
125 lines
4.0 KiB
Markdown
125 lines
4.0 KiB
Markdown
# 01 — Docker Swarm Init (Prod — Multi-Node)
|
||
|
||
## Context
|
||
- **Repo:** `iklim.co` root
|
||
- **Environment:** prod
|
||
- **Topology:**
|
||
- 3 × app nodes (`iklim-app-01/02/03`) — all act as **Swarm managers AND app workers** (Raft quorum: 1 can fail)
|
||
- 3 × DB nodes (`iklim-db-01/02/03`) — join Swarm as **workers** with `role=db` label; DB services are placed exclusively on them
|
||
- **Sizing:** app nodes are `cpx42`, DB nodes are `cpx32`; see `../../hetzner-sizing-report.md`
|
||
- All 6 nodes are in the same private network.
|
||
- Pipeline trigger: push to `prod-env` branch → Gitea runner on `prod-runner` (first app node).
|
||
- App Swarm managers: 3 nodes all manager-eligible and carry app workloads (no dedicated worker-only app nodes).
|
||
|
||
## Node labeling plan
|
||
|
||
| Node | Role | Swarm role | Labels |
|
||
|------|------|------------|--------|
|
||
| `iklim-app-01` | API services, SWAG, Vault | Manager + Worker | `type=service` |
|
||
| `iklim-app-02` | API services replicas | Manager + Worker | `type=service` |
|
||
| `iklim-app-03` | API services replicas | Manager + Worker | `type=service` |
|
||
| `iklim-db-01` | PostgreSQL (Patroni), etcd | Worker | `role=db` |
|
||
| `iklim-db-02` | PostgreSQL (Patroni), etcd | Worker | `role=db` |
|
||
| `iklim-db-03` | MongoDB replica + PostgreSQL (Patroni), etcd | Worker | `role=db` |
|
||
|
||
## Step 1 — Init Swarm on iklim-app-01 (the prod-runner node)
|
||
|
||
```bash
|
||
MANAGER_IP=$(hostname -I | awk '{print $1}')
|
||
if ! docker info --format '{{.Swarm.LocalNodeState}}' | grep -q "active"; then
|
||
docker swarm init --advertise-addr "$MANAGER_IP"
|
||
echo "Swarm initialized on $MANAGER_IP"
|
||
else
|
||
echo "Swarm already active"
|
||
fi
|
||
```
|
||
|
||
## Step 2 — Get manager join token
|
||
|
||
```bash
|
||
docker swarm join-token manager # for iklim-app-02, iklim-app-03
|
||
```
|
||
|
||
Save this token — needed on iklim-app-02 and iklim-app-03.
|
||
|
||
## Step 3 — Join iklim-app-02 and iklim-app-03 as managers
|
||
|
||
SSH into iklim-app-02 and iklim-app-03, run:
|
||
```bash
|
||
docker swarm join --token <MANAGER_TOKEN> 10.20.10.11:2377
|
||
```
|
||
|
||
## Step 4 — Label app nodes
|
||
|
||
On iklim-app-01, after iklim-app-02 and iklim-app-03 have joined:
|
||
|
||
```bash
|
||
for node in iklim-app-01 iklim-app-02 iklim-app-03; do
|
||
docker node update --label-add type=service "$node"
|
||
done
|
||
```
|
||
|
||
## Step 5 — Join DB nodes as Swarm workers
|
||
|
||
Get the worker join token on iklim-app-01:
|
||
|
||
```bash
|
||
docker swarm join-token worker
|
||
```
|
||
|
||
SSH into each DB node and join:
|
||
|
||
```bash
|
||
docker swarm join --token <WORKER_TOKEN> 10.20.10.11:2377
|
||
```
|
||
|
||
Then label them on iklim-app-01:
|
||
|
||
```bash
|
||
for node in iklim-db-01 iklim-db-02 iklim-db-03; do
|
||
docker node update --label-add role=db "$node"
|
||
done
|
||
```
|
||
|
||
> DB nodes are Swarm **workers** only — they never become managers.
|
||
> DB services are pinned to them via `node.labels.role == db` placement constraint.
|
||
> See `08-prod-db-cluster-kurulum.md` for DB stack deployment.
|
||
|
||
## Step 6 — Verify
|
||
|
||
```bash
|
||
docker node ls
|
||
```
|
||
|
||
Expected: 6 nodes — 3 with `MANAGER STATUS` = `Leader` or `Reachable`, 3 workers with `Ready`.
|
||
|
||
```bash
|
||
docker node inspect iklim-app-01 --format '{{.Spec.Labels}}'
|
||
docker node inspect iklim-db-01 --format '{{.Spec.Labels}}'
|
||
```
|
||
|
||
Expected: `map[type:service]` for app nodes, `map[role:db]` for DB nodes.
|
||
|
||
## Step 7 — Confirm `init/swarm-init.sh` multi-node awareness
|
||
|
||
The script is idempotent (skips init if already active). Verify:
|
||
|
||
```bash
|
||
grep -n "swarm init\|swarm join" init/swarm-init.sh
|
||
```
|
||
|
||
The prod pipeline runs on iklim-app-01 only. iklim-app-02/03 are joined via Ansible (`swarm` role),
|
||
not via the Gitea pipeline.
|
||
|
||
## Placement constraints used in `docker-stack-infra.yml`
|
||
|
||
| Constraint | Resolves to |
|
||
|------------|-------------|
|
||
| `node.role == manager` | iklim-app-01, iklim-app-02, iklim-app-03 |
|
||
| `node.labels.type == service` | iklim-app-01, iklim-app-02, iklim-app-03 |
|
||
| `node.labels.role == db` | iklim-db-01, iklim-db-02, iklim-db-03 |
|
||
|
||
SWAG, Vault, cert-reloader: pinned to `node.role == manager`.
|
||
Microservices: no constraint (distributed across all app nodes by Swarm scheduler).
|
||
DB services (Patroni, etcd, MongoDB): pinned to `node.labels.role == db` in separate DB stacks.
|