< Back to all hacks

#54 Healthcheck (2-Phase Gateway Check)

Boot
Problem
Gateway process can be alive but not responding (hung state).
Solution
Cron every 2 min: if process alive but HTTP times out, wait 5s retry, kill only if both checks fail.
Lesson
Process-alive checks are insufficient. HTTP healthchecks catch hung/deadlocked processes.

Context

The gateway process can be in three states: not running (crashed), running and responsive (healthy), or running but not responding (hung). A simple pgrep check catches the first case but misses the third. On a 1 GB device with aggressive memory pressure, the gateway occasionally enters a hung state where V8 is stuck in garbage collection or waiting on a blocked I/O operation.

The healthcheck uses a two-phase approach: first check if the process exists, then verify it responds to HTTP requests. Two consecutive HTTP failures trigger a kill + restart.

Implementation

The healthcheck runs as a cron job every 2 minutes:

#!/data/data/com.termux/files/usr/bin/bash
# healthcheck.sh — 2-phase gateway health verification

PID=$(pgrep -f "openclaw-gateway")

# Phase 1: Process check
if [ -z "$PID" ]; then
  # Not running, start it
  echo "$(date) Gateway not running, starting" >> $PREFIX/tmp/healthcheck.log
  start-openclaw
  exit 0
fi

# Phase 2: HTTP check (5s timeout)
if ! curl -sf -m 5 http://localhost:9000/api/status > /dev/null; then
  # First failure — wait and retry
  sleep 5
  if ! curl -sf -m 5 http://localhost:9000/api/status > /dev/null; then
    # Second failure — kill and restart
    echo "$(date) Gateway hung (PID $PID), killing" >> $PREFIX/tmp/healthcheck.log
    kill -9 $PID
    sleep 2
    start-openclaw
  fi
fi

Install the cron job:

# Add to crontab (busybox crontab):
$PREFIX/bin/applets/crontab -e
# Add line:
*/2 * * * * $PREFIX/bin/healthcheck.sh

Verification

# Check cron is running the healthcheck:
$PREFIX/bin/applets/crontab -l | grep healthcheck
# Expected: */2 * * * * .../healthcheck.sh

# Check healthcheck log:
cat $PREFIX/tmp/healthcheck.log
# Expected: restart events with timestamps (if any)

# Simulate hung gateway (test only):
kill -STOP $(pgrep -f openclaw-gateway)  # Pause the process
# Wait 2-4 minutes for cron to detect and restart
cat $PREFIX/tmp/healthcheck.log | tail -3
# Expected: "Gateway hung" message

Gotchas

  • The double-check with 5-second delay prevents false positives during legitimate slow responses (e.g., large file uploads, GC pauses)
  • kill -9 (SIGKILL) is used instead of SIGTERM because a hung process won't respond to graceful signals
  • The curl -m 5 timeout must be shorter than the cron interval. With 2-minute cron and 5-second timeout, there's ample margin
  • pgrep -f "openclaw-gateway" matches the process title set by process.title = 'openclaw-gateway' in start-openclaw
  • On the Moto E2, busybox cron is at $PREFIX/bin/applets/crond, not the standard /usr/sbin/crond

Result

MetricBeforeAfter
Hung detectionNone (manual SSH)Auto (2 min)
Restart on crashManualAuto
False positive rateN/A~0 (double-check)
Max downtimeHours (until noticed)~4 min