#54 Healthcheck (2-Phase Gateway Check)

Boot

Context

The gateway process can be in three states: not running (crashed), running and responsive (healthy), or running but not responding (hung). A simple pgrep check catches the first case but misses the third. On a 1 GB device with aggressive memory pressure, the gateway occasionally enters a hung state where V8 is stuck in garbage collection or waiting on a blocked I/O operation.

The healthcheck uses a two-phase approach: first check if the process exists, then verify it responds to HTTP requests. Two consecutive HTTP failures trigger a kill + restart.

Implementation

The healthcheck runs as a cron job every 2 minutes:

#!/data/data/com.termux/files/usr/bin/bash
# healthcheck.sh — 2-phase gateway health verification

PID=$(pgrep -f "openclaw-gateway")

# Phase 1: Process check
if [ -z "$PID" ]; then
  # Not running, start it
  echo "$(date) Gateway not running, starting" >> $PREFIX/tmp/healthcheck.log
  start-openclaw
  exit 0
fi

# Phase 2: HTTP check (5s timeout)
if ! curl -sf -m 5 http://localhost:9000/api/status > /dev/null; then
  # First failure — wait and retry
  sleep 5
  if ! curl -sf -m 5 http://localhost:9000/api/status > /dev/null; then
    # Second failure — kill and restart
    echo "$(date) Gateway hung (PID $PID), killing" >> $PREFIX/tmp/healthcheck.log
    kill -9 $PID
    sleep 2
    start-openclaw
  fi
fi

Install the cron job:

# Add to crontab (busybox crontab):
$PREFIX/bin/applets/crontab -e
# Add line:
*/2 * * * * $PREFIX/bin/healthcheck.sh

Verification

# Check cron is running the healthcheck:
$PREFIX/bin/applets/crontab -l | grep healthcheck
# Expected: */2 * * * * .../healthcheck.sh

# Check healthcheck log:
cat $PREFIX/tmp/healthcheck.log
# Expected: restart events with timestamps (if any)

# Simulate hung gateway (test only):
kill -STOP $(pgrep -f openclaw-gateway)  # Pause the process
# Wait 2-4 minutes for cron to detect and restart
cat $PREFIX/tmp/healthcheck.log | tail -3
# Expected: "Gateway hung" message

Gotchas

The double-check with 5-second delay prevents false positives during legitimate slow responses (e.g., large file uploads, GC pauses)
kill -9 (SIGKILL) is used instead of SIGTERM because a hung process won't respond to graceful signals
The curl -m 5 timeout must be shorter than the cron interval. With 2-minute cron and 5-second timeout, there's ample margin
pgrep -f "openclaw-gateway" matches the process title set by process.title = 'openclaw-gateway' in start-openclaw
On the Moto E2, busybox cron is at $PREFIX/bin/applets/crond, not the standard /usr/sbin/crond

Result

Metric	Before	After
Hung detection	None (manual SSH)	Auto (2 min)
Restart on crash	Manual	Auto
False positive rate	N/A	~0 (double-check)
Max downtime	Hours (until noticed)	~4 min

< #53 Boot Script with WiFi Retry

#55 termux-wake-lock + Doze Bypass >