Skip to content

ops(cmo): self-healing stack — auto-restart, healthchecks, pm2 tunnel, heartbeat

isidro requested to merge claude/ops-resilience into main

Why

On 2026-04-22 cmoagent-db and cmoagent-app exited with code 255 and stayed dead for 5 days. cmoagent-watcher and cmoagent-notifier crashlooped 289+ times on socket.gaierror (could not resolve "db") and nobody noticed because Slack daily briefs / trends digests just went silent. Cloudflare tunnel "cmoagent" was also stopped, leaving https://cmo.laparedita.com offline.

What

Three independent layers so the same outage cannot happen silently again:

  1. docker-compose.ymlrestart: unless-stopped on db and app (was missing). HTTP healthcheck on app, file-mtime healthchecks on watcher/notifier that flip the container unhealthy if the loop hasn't iterated within 3× the interval.
  2. scripts/cmoagent-tunnel.pm2.config.js — Cloudflare named tunnel registered as a pm2 process so it auto-restarts and survives machine reboots (pm2 save && pm2 startup).
  3. scripts/cmoagent-heartbeat.sh — cron-friendly probe that verifies portal reachable + all 4 containers running + notifier state file fresh. Posts to Slack on failure, silent on success.

Verification

  • All 4 containers report (healthy) after docker compose up -d.
  • pm2 status cmoagent-tunnel → online; https://cmo.laparedita.com/ → 401 (portal up + auth).
  • bash scripts/cmoagent-heartbeat.sh[heartbeat] healthy.
  • Full pytest suite green: 347 passed.

Test plan

  • After merge, run pm2 startup (one-time) so tunnel survives reboot.
  • Add the heartbeat cron line documented in README.
  • Kill cmoagent-db-1 manually to confirm app/watcher/notifier reconnect (and that the heartbeat catches it if longer than 30 min).

Merge request reports