Methodology — how we test self-hosted AI agents

The test rig

Every agent gets installed on the same baseline VPS. We use Hetzner CX22 (€5/mo, 2 vCPU, 4 GB RAM, 40 GB SSD) as the standard rig because it represents the realistic floor for most self-hosters. Anything that can't run on a CX22 we flag explicitly.

The reference OS is Debian 12 with a minimal install. We add Docker, an unprivileged user, and Tailscale for remote dashboard access. Nothing else by default.

For agents that require Apple Silicon (NanoClaw) or local GPU (ZeroClaw with mid-large local models), we use a Mac Studio M3 Max (64 GB unified memory) as the secondary rig.

The task suite

Every agent runs the same five tasks, in the same order, with the same default LLM (Claude 4.5 Sonnet via Anthropic API).

Single-step tool call. Read a file, summarise. Tests basic tool invocation reliability.
Multi-step planning. Plan and execute a non-trivial task across at least three tool calls. Tests agent reasoning over multiple steps.
Browser automation. Navigate to a known page, extract structured data. Tests browser tool integration.
Long context. Ingest a 50,000-token document and answer four follow-up questions referencing different sections.
Error recovery. Hand the agent a task that requires a tool that doesn't exist. Score the recovery behaviour.

The scoring rubric

Each task scores from 0 to 5 on three dimensions: correctness, latency, and cost. The aggregate is reported as a single 0-15 number per task.

We do not aggregate further into a single “agent score” because single scores hide tradeoffs that matter. A faster agent that's less correct is not strictly worse than a slower agent that's more correct — depends on the use case.

The security audit checklist

For every agent we cover, we walk a standard checklist. The list is exhaustive on purpose; not every item applies to every agent.

Default authentication on the dashboard?
WebSocket origin validation?
Bind address default — localhost or 0.0.0.0?
Tool execution sandbox by default?
Approval flow for tool calls?
Credential storage — plaintext, encrypted, OS keyring?
Network egress allowlist?
Filesystem scope — read/write boundaries?
MCP / plugin signature validation?
Audit log — exists, writable, tamper-evident?
CVE disclosure policy and SLA?
Patch cadence (rolling 90-day average)?
Security contact address and PGP availability?

What we do not test

We are honest about the limits of our testing. Things we do not systematically benchmark:

Long-term reliability. A two-week test cannot tell you what crashes after six months of agent operation.
Multi-user load. Most agents we cover ship for single-user; we don't pretend they're tested for ten concurrent users.
Adversarial security testing. We do not run penetration tests. We read the threat model, audit the defaults, and check for known issues. We are not pretending to be a Red Team.
Cost at scale. Our cost numbers are for the test rig and the test suite. Production cost depends entirely on your workload and we won't guess.

Why this methodology

Two principles drive the choices above.

One. Most readers don't need a benchmark of maximum theoretical performance. They need to know whether the agent works on a $5/month VPS with the LLM they already pay for. The CX22 + Claude rig represents that user.

Two. Reproducible beats comprehensive. A methodology you can replicate at home, with results you can challenge, beats an opaque industrial benchmark every time. Everything we do is documented; everything is reproducible.

Disagreements welcome

If you think a task is misaligned with how agents are actually used, or a security check is missing, or a scoring dimension is wrong — email us. contact@pocketclaw.dev. The methodology is updated quarterly with the landscape report.