ops(cmo): self-healing stack — auto-restart, healthchecks, pm2 tunnel, heartbeat (!29) · Merge requests · Tools / CMOAgent

isidro requested to merge claude/ops-resilience into main Apr 27, 2026

Why

On 2026-04-22 cmoagent-db and cmoagent-app exited with code 255 and stayed dead for 5 days. cmoagent-watcher and cmoagent-notifier crashlooped 289+ times on socket.gaierror (could not resolve "db") and nobody noticed because Slack daily briefs / trends digests just went silent. Cloudflare tunnel "cmoagent" was also stopped, leaving https://cmo.laparedita.com offline.

What

Three independent layers so the same outage cannot happen silently again:

docker-compose.yml — restart: unless-stopped on db and app (was missing). HTTP healthcheck on app, file-mtime healthchecks on watcher/notifier that flip the container unhealthy if the loop hasn't iterated within 3× the interval.
scripts/cmoagent-tunnel.pm2.config.js — Cloudflare named tunnel registered as a pm2 process so it auto-restarts and survives machine reboots (pm2 save && pm2 startup).
scripts/cmoagent-heartbeat.sh — cron-friendly probe that verifies portal reachable + all 4 containers running + notifier state file fresh. Posts to Slack on failure, silent on success.

Verification

All 4 containers report (healthy) after docker compose up -d.
pm2 status cmoagent-tunnel → online; https://cmo.laparedita.com/ → 401 (portal up + auth).
bash scripts/cmoagent-heartbeat.sh → [heartbeat] healthy.
Full pytest suite green: 347 passed.

Test plan

After merge, run pm2 startup (one-time) so tunnel survives reboot.
Add the heartbeat cron line documented in README.
Kill cmoagent-db-1 manually to confirm app/watcher/notifier reconnect (and that the heartbeat catches it if longer than 30 min).

ops(cmo): self-healing stack — auto-restart, healthchecks, pm2 tunnel, heartbeat

Why

What

Verification

Test plan

Merge request reports