26 Commits

Author SHA1 Message Date
475eb762b9 health-agent redeploy with new image
All checks were successful
Deploy Environment Monitoring to Production Environment / deploy (push) Successful in 17s
2026-06-26 23:13:38 +03:00
b49ca276f0 fix(monitoring): support existing monitor updates and vault nodes
- setup_uptime_kuma: Use api.edit_monitor to update existing monitors with new configuration instead of skipping them.
- setup_uptime_kuma: Add port and accepted_statuscodes to DNS monitors to prevent NodeJS null reading errors in Kuma.
- http.py: Parse VAULT_HOSTS environment variable for Vault cluster nodes instead of hardcoding 'vault'.
2026-06-26 23:07:37 +03:00
2a482ce4df health-agent redeploy with new image
All checks were successful
Deploy Environment Monitoring to Production Environment / deploy (push) Successful in 16s
2026-06-26 22:53:35 +03:00
969c4a2301 fix(monitoring): resolve health-agent bugs and flapping monitors
- Vault flapping: Fix resp evaluation on HTTP 429
- Storagebox block: Move mount check to a daemon thread
- Push monitors: Increase interval to 75s and restore 60s sleep
- Redis Sentinel: Fix authentication in sentinel_kwargs
- Ext Https Api: Update URL to /health
2026-06-26 22:51:15 +03:00
b73ae4e5fb revert(health-agent): revert ping monitors back to PING type 2026-06-26 21:55:42 +03:00
94e6b57c52 fix(health-agent): check all 3 patroni node configs on storagebox; switch ping monitors to TCP port 22 (ICMP blocked from Docker) 2026-06-26 21:54:49 +03:00
fa7ed41063 fix(health-agent): reload uk_tokens.yml on every push call instead of caching at startup 2026-06-26 21:35:44 +03:00
0551b01c64 health-agent redeploy with new image
All checks were successful
Deploy Environment Monitoring to Production Environment / deploy (push) Successful in 16s
2026-06-26 21:27:14 +03:00
2827b227d5 fix(health-agent): fix notification param name and type — notificationIDList expects a list of IDs not a dict 2026-06-26 21:23:45 +03:00
a5fc058978 health-agent redeploy with new image
Some checks failed
Deploy Environment Monitoring to Production Environment / deploy (push) Failing after 15s
2026-06-26 21:15:06 +03:00
e4acd0e57b fix(health-agent): skip uk_tokens.yml write when tokens dict is empty to prevent setup skip loop 2026-06-26 21:10:10 +03:00
8b10653ff4 fix(health-agent): fix ping maxretries param and status page group lookup
Fix ping monitor creation error ('max_retries' is not a valid uptime-kuma-api param; correct name is 'maxretries'). Fix status pages never linking groups: re-fetching get_monitors() after add_monitor() races with WebSocket delivery so newly created groups are missing; use group_map populated in Section 1 directly instead.
2026-06-26 21:07:11 +03:00
95dd439a34 health-agent redeploy with new image
All checks were successful
Deploy Environment Monitoring to Production Environment / deploy (push) Successful in 36s
2026-06-26 20:53:59 +03:00
3c2e872bf4 refactor(health-agent): rename monitor keys to Title Case With Space
Update all hardcoded push monitor names in check files to match the new Title Case With Space format in monitors.yml. The uk_tokens.yml keys are derived from monitor names so the push() calls must match exactly.
2026-06-26 20:52:35 +03:00
bc8b3d0934 refactor: convert all monitor names to Title Case and update health-agent digest 2026-06-26 20:47:31 +03:00
d51c073556 fix(health-agent): fix uk_tokens.yml load race and LogRecord msg conflict
- config.py: Replace exists()+open() with try/except open() to avoid TOCTOU race on SSHFS mounts where stat can succeed but open can fail with FileNotFoundError.
- uptime_kuma.py: Rename msg key to push_msg in logger extra dicts. Python LogRecord reserves the msg field; passing it in extra raises ValueError which was being silently swallowed by the except block, masking successful pushes as errors.
2026-06-26 20:37:42 +03:00
8d5fe55b14 health-agent redeploy with new image
All checks were successful
Deploy Environment Monitoring to Production Environment / deploy (push) Successful in 31s
2026-06-26 20:06:48 +03:00
0ef4f0b6f8 refactor: rename iklimco-monitoring stack to monitoring 2026-06-26 19:24:01 +03:00
58d5c24f41 feat(health-agent): add CI/CD pipeline, Uptime Kuma setup, and runtime configuration
Some checks failed
Deploy Environment Monitoring to Production Environment / deploy (push) Failing after 10s
Deploy workflows:
- Integrate health-agent build (test) and image promotion (prod) into monitoring stack workflows
- Add storagebox download of health-agent runtime (.env.monitoring.health-agent-runtime → health-agent/.env) and setup (.env.monitoring.health-agent-setup → health-agent/.env.setup) env files
- Add "Run Uptime Kuma Setup" step: runs setup_uptime_kuma.py inside the built image only when uk_tokens.yml is missing, writes tokens to HEALTH_AGENT_CONFIG_GENERATED_DIR (/mnt/storagebox/monitoring/uk_generated)
- Add health-agent/** and health-agent/deploy/prod.env path triggers to test and prod workflows respectively
- Add HARBOR_CI_TOKEN login and HARBOR_PULL_TOKEN login before stack deploy in both workflows
- Source health-agent/.env before docker stack deploy to expose HEALTH_AGENT_CONFIG_GENERATED_DIR

Dockerfile:
- Copy config/ and scripts/ into image so setup_uptime_kuma.py can run inside the container

setup_uptime_kuma.py:
- Load .env and .env.setup automatically via python-dotenv (no manual export needed)
- Write uk_tokens.yml to config/generated/ (aligned with container volume mount)

Health checks:
- PATRONI_HOSTS and VAULT_HOSTS are now configurable via env vars (comma-separated host:port); no code change needed when node count changes
- REDIS_SENTINEL_HOSTS now correctly parses host:port format; default updated to redis-sentinel:26379
- Fix NameError in check_patroni_cluster() caused by leftover node variable after loop refactor
- Remove verify_ssl=False from Vault check; vault.iklim.co has a valid certificate

Ops:
- Add ops/build-and-push-health-agent.sh for manual bypass of CI pipeline
- Add health-agent/deploy/prod.env template for prod image promotion manifest

Project structure:
- Move .env.example and .env.setup.example to health-agent/env-example/ (root .gitignore excludes health-agent/.env*)
- Add root .gitignore: excludes uk_tokens.yml, __pycache__, .venv, and env files
- Remove health-agent/.gitignore (superseded by root .gitignore)
2026-06-26 18:45:17 +03:00
062d3ff90d docs(health-agent): document --once and --dry-run flags in README 2026-06-26 16:47:53 +03:00
7ab186b961 feat(health-agent): add --once and --dry-run flags to main.py 2026-06-26 16:43:21 +03:00
28d726d2d8 fix(health-agent): correct Uptime Kuma URLs in example env files 2026-06-25 20:55:58 +03:00
07a364b2bc fix(health-agent): correct UK_URL placeholder in .env.setup.example 2026-06-25 20:54:28 +03:00
208f4768b9 chore(health-agent): switch to uptime-kuma-api-v2, fix .env.setup.example credentials 2026-06-25 20:50:23 +03:00
72a91072fb feat(health-agent): add README, workflows, and translate monitors.yml to English
- Add health-agent README with architecture, config, and deployment docs
- Add deploy-monitoring-test.yml workflow (mirrors prod, test-runner, test storagebox paths)
- Add health-agent service to docker-stack-monitoring.yml
- Add .env.example with all runtime variables and .gitignore for generated files
- Add config/generated/.gitkeep to track empty generated directory
- Translate all Turkish group names and status page titles in monitors.yml to English
- Remove users.yml.example (Dozzle was removed in previous commit)
2026-06-25 19:20:25 +03:00
f742bfdd11 feat(health-agent): add monitors.yml with env-aware node IP mapping from Ansible inventory 2026-06-25 18:59:14 +03:00