Environment_Infrastructure/roadmap/prod-env/01-swarm-init-multinode.md
2026-05-09 16:26:06 +03:00

102 lines
3.2 KiB
Markdown
Raw Blame History

This file contains invisible Unicode characters

This file contains invisible Unicode characters that are indistinguishable to humans but may be processed differently by a computer. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# 01 — Docker Swarm Init (Prod — Multi-Node)
## Context
- **Repo:** `iklim.co` root
- **Environment:** prod
- **Topology:**
- 3 × service nodes — all act as **Swarm managers AND app workers** (Raft quorum: 1 can fail)
- 3 × DB nodes — **NOT part of Docker Swarm** (separate DB cluster, out of scope)
- All 6 nodes are in the same private network.
- Pipeline trigger: push to `prod-env` branch → Gitea runner on `prod-runner` (first service node).
- Swarm has 3 nodes total; all are manager-eligible and carry workloads (no dedicated worker-only nodes).
## Node labeling plan
| Node | Role | Swarm role | Labels |
|------|------|------------|--------|
| service-1 | API services, SWAG, Vault | Manager + Worker | `type=service` |
| service-2 | API services replicas | Manager + Worker | `type=service` |
| service-3 | API services replicas | Manager + Worker | `type=service` |
> DB nodes (`db-1/2/3`) are **not part of Docker Swarm**. They run as a separate cluster
> and are provisioned independently. No Swarm join or label step applies to them.
## Step 1 — Init Swarm on service-1 (the prod-runner node)
```bash
MANAGER_IP=$(hostname -I | awk '{print $1}')
if ! docker info --format '{{.Swarm.LocalNodeState}}' | grep -q "active"; then
docker swarm init --advertise-addr "$MANAGER_IP"
echo "✅ Swarm initialized on $MANAGER_IP"
else
echo " Swarm already active"
fi
```
## Step 2 — Get manager join token
```bash
docker swarm join-token manager # for service-2, service-3
```
Save this token — needed on service-2 and service-3.
## Step 3 — Join service-2 and service-3 as managers
SSH into service-2 and service-3, run:
```bash
docker swarm join --token <MANAGER_TOKEN> <service-1-ip>:2377
```
## Step 4 — Label all Swarm nodes
On service-1, after service-2 and service-3 have joined:
```bash
for node in service-1 service-2 service-3; do
docker node update --label-add type=service "$node"
done
```
> Replace `service-1`, etc. with actual node hostnames shown in `docker node ls`.
> DB nodes are not in Swarm — no join or label step for them.
## Step 5 — Verify
```bash
docker node ls
```
Expected: 3 nodes, all with `MANAGER STATUS` = `Leader` or `Reachable`.
All 3 nodes remain in `AVAILABILITY=Active` (not drained) so they also carry workloads.
```bash
docker node inspect service-1 --format '{{.Spec.Labels}}'
```
Expected: `map[type:service]`.
## Step 6 — Confirm `init/swarm-init.sh` multi-node awareness
The script is idempotent (skips init if already active). Verify:
```bash
grep -n "swarm init\|swarm join" init/swarm-init.sh
```
The prod pipeline runs on service-1 only. service-2/3 are joined via Ansible (`swarm` role),
not via the Gitea pipeline.
## Placement constraints used in `docker-stack-infra.yml`
| Constraint | Resolves to |
|------------|-------------|
| `node.role == manager` | service-1, service-2, service-3 |
| `node.labels.type == service` | service-1, service-2, service-3 |
SWAG, Vault, cert-reloader: pinned to `node.role == manager`.
Microservices: no constraint (distributed across all 3 service nodes by Swarm scheduler).
> `node.labels.type == db` constraint is **not used** — DB nodes are not in Swarm.
> PostgreSQL and MongoDB run outside Swarm as a separately managed cluster.