guide #054

Healthcheck (2-Phase Gateway Check)

Problem
Gateway process can be alive but not responding (hung state).

Solution
Cron every 2 min: if process alive but HTTP times out, wait 5s retry, kill only if both checks fail.

Context

The gateway process can be in three states: not running (crashed), running and responsive (healthy), or running but not responding (hung). A simple pgrep check catches the first case but misses the third. On a 1 GB device with aggressive memory pressure, the gateway occasionally enters a hung state where V8 is stuck in garbage collection or waiting on a blocked I/O operation.

The healthcheck uses a two-phase approach: first check if the process exists, then verify it responds to HTTP requests. Two consecutive HTTP failures trigger a kill + restart.

Implementation

The healthcheck runs as a cron job every 2 minutes:

#!/data/data/com.termux/files/usr/bin/bash
# healthcheck.sh — 2-phase gateway health verification

PID=$(pgrep -f "openclaw-gateway")

# Phase 1: Process check
if [ -z "$PID" ]; then
  # Not running, start it
  echo "$(date) Gateway not running, starting" >> $PREFIX/tmp/healthcheck.log
  start-openclaw
  exit 0
fi

# Phase 2: HTTP check (5s timeout)
if ! curl -sf -m 5 http://localhost:9000/api/status > /dev/null; then
  # First failure — wait and retry
  sleep 5
  if ! curl -sf -m 5 http://localhost:9000/api/status > /dev/null; then
    # Second failure — kill and restart
    echo "$(date) Gateway hung (PID $PID), killing" >> $PREFIX/tmp/healthcheck.log
    kill -9 $PID
    sleep 2
    start-openclaw
  fi
fi

Install the cron job:

# Add to crontab (busybox crontab):
$PREFIX/bin/applets/crontab -e
# Add line:
*/2 * * * * $PREFIX/bin/healthcheck.sh

Verification

# Check cron is running the healthcheck:
$PREFIX/bin/applets/crontab -l | grep healthcheck
# Expected: */2 * * * * .../healthcheck.sh

# Check healthcheck log:
cat $PREFIX/tmp/healthcheck.log
# Expected: restart events with timestamps (if any)

# Simulate hung gateway (test only):
kill -STOP $(pgrep -f openclaw-gateway)  # Pause the process
# Wait 2-4 minutes for cron to detect and restart
cat $PREFIX/tmp/healthcheck.log | tail -3
# Expected: "Gateway hung" message

Gotchas

  • The double-check with 5-second delay prevents false positives during legitimate slow responses (e.g., large file uploads, GC pauses)
  • kill -9 (SIGKILL) is used instead of SIGTERM because a hung process won't respond to graceful signals
  • The curl -m 5 timeout must be shorter than the cron interval. With 2-minute cron and 5-second timeout, there's ample margin
  • pgrep -f "openclaw-gateway" matches the process title set by process.title = 'openclaw-gateway' in start-openclaw
  • On the Moto E2, busybox cron is at $PREFIX/bin/applets/crond, not the standard /usr/sbin/crond

Result

MetricBeforeAfter
Hung detectionNone (manual SSH)Auto (2 min)
Restart on crashManualAuto
False positive rateN/A~0 (double-check)
Max downtimeHours (until noticed)~4 min
Continue reading
guide
Pocket AI complete guide
Running self-hosted AI on portable hardware
guide
Edge AI hardware buyer's guide 2026
Pi 5 vs Mini PC vs Mac Mini
report
Self-hosted AI landscape 2026
Quarterly state of the ecosystem
section
Pocket AI hardware hub
All portable hosts reviewed
section
Agent tracker
Live stats on every agent
newsletter
Thursday digest
Weekly summary in your inbox