9 Commits

Author SHA1 Message Date
8a056a381b fix(monitoring): prevent Vault crash and DNS null error
- Vault: Wrap resp.json() in a try-except block to prevent JSONDecodeError when hitting an HTML error page (e.g. 502/503). This prevents the entire agent from crashing and missing heartbeats.
- Uptime Kuma DNS: Explicitly set dns_resolve_server to 1.1.1.1 in Python API payload to prevent Uptime Kuma backend from crashing on null properties.
2026-06-26 23:23:02 +03:00
969c4a2301 fix(monitoring): resolve health-agent bugs and flapping monitors
- Vault flapping: Fix resp evaluation on HTTP 429
- Storagebox block: Move mount check to a daemon thread
- Push monitors: Increase interval to 75s and restore 60s sleep
- Redis Sentinel: Fix authentication in sentinel_kwargs
- Ext Https Api: Update URL to /health
2026-06-26 22:51:15 +03:00
94e6b57c52 fix(health-agent): check all 3 patroni node configs on storagebox; switch ping monitors to TCP port 22 (ICMP blocked from Docker) 2026-06-26 21:54:49 +03:00
fa7ed41063 fix(health-agent): reload uk_tokens.yml on every push call instead of caching at startup 2026-06-26 21:35:44 +03:00
3c2e872bf4 refactor(health-agent): rename monitor keys to Title Case With Space
Update all hardcoded push monitor names in check files to match the new Title Case With Space format in monitors.yml. The uk_tokens.yml keys are derived from monitor names so the push() calls must match exactly.
2026-06-26 20:52:35 +03:00
d51c073556 fix(health-agent): fix uk_tokens.yml load race and LogRecord msg conflict
- config.py: Replace exists()+open() with try/except open() to avoid TOCTOU race on SSHFS mounts where stat can succeed but open can fail with FileNotFoundError.
- uptime_kuma.py: Rename msg key to push_msg in logger extra dicts. Python LogRecord reserves the msg field; passing it in extra raises ValueError which was being silently swallowed by the except block, masking successful pushes as errors.
2026-06-26 20:37:42 +03:00
58d5c24f41 feat(health-agent): add CI/CD pipeline, Uptime Kuma setup, and runtime configuration
Some checks failed
Deploy Environment Monitoring to Production Environment / deploy (push) Failing after 10s
Deploy workflows:
- Integrate health-agent build (test) and image promotion (prod) into monitoring stack workflows
- Add storagebox download of health-agent runtime (.env.monitoring.health-agent-runtime → health-agent/.env) and setup (.env.monitoring.health-agent-setup → health-agent/.env.setup) env files
- Add "Run Uptime Kuma Setup" step: runs setup_uptime_kuma.py inside the built image only when uk_tokens.yml is missing, writes tokens to HEALTH_AGENT_CONFIG_GENERATED_DIR (/mnt/storagebox/monitoring/uk_generated)
- Add health-agent/** and health-agent/deploy/prod.env path triggers to test and prod workflows respectively
- Add HARBOR_CI_TOKEN login and HARBOR_PULL_TOKEN login before stack deploy in both workflows
- Source health-agent/.env before docker stack deploy to expose HEALTH_AGENT_CONFIG_GENERATED_DIR

Dockerfile:
- Copy config/ and scripts/ into image so setup_uptime_kuma.py can run inside the container

setup_uptime_kuma.py:
- Load .env and .env.setup automatically via python-dotenv (no manual export needed)
- Write uk_tokens.yml to config/generated/ (aligned with container volume mount)

Health checks:
- PATRONI_HOSTS and VAULT_HOSTS are now configurable via env vars (comma-separated host:port); no code change needed when node count changes
- REDIS_SENTINEL_HOSTS now correctly parses host:port format; default updated to redis-sentinel:26379
- Fix NameError in check_patroni_cluster() caused by leftover node variable after loop refactor
- Remove verify_ssl=False from Vault check; vault.iklim.co has a valid certificate

Ops:
- Add ops/build-and-push-health-agent.sh for manual bypass of CI pipeline
- Add health-agent/deploy/prod.env template for prod image promotion manifest

Project structure:
- Move .env.example and .env.setup.example to health-agent/env-example/ (root .gitignore excludes health-agent/.env*)
- Add root .gitignore: excludes uk_tokens.yml, __pycache__, .venv, and env files
- Remove health-agent/.gitignore (superseded by root .gitignore)
2026-06-26 18:45:17 +03:00
7ab186b961 feat(health-agent): add --once and --dry-run flags to main.py 2026-06-26 16:43:21 +03:00
f742bfdd11 feat(health-agent): add monitors.yml with env-aware node IP mapping from Ansible inventory 2026-06-25 18:59:14 +03:00