LIVE TAPE
OpenClaw 88,412 stars·CVE-2026-25898 disclosed (HIGH, Hermes)·Hermes Agent v2026.4.7 published·Hermes Agent +182 stars (last hour)·OpenClaw v2026.4.6 — credential vault hardening·CVE-2026-26133 patched (NanoClaw)·Pi 5 16GB rumoured for Q3 — recheck guidance·Nanobot +47 stars (last hour)·ZeroClaw v0.4.2 — Apple container fixes·Mac Mini M4 wins quarterly hardware survey·OpenClaw 88,412 stars·CVE-2026-25898 disclosed (HIGH, Hermes)·Hermes Agent v2026.4.7 published·Hermes Agent +182 stars (last hour)·OpenClaw v2026.4.6 — credential vault hardening·CVE-2026-26133 patched (NanoClaw)·Pi 5 16GB rumoured for Q3 — recheck guidance·Nanobot +47 stars (last hour)·ZeroClaw v0.4.2 — Apple container fixes·Mac Mini M4 wins quarterly hardware survey·
PocketClawvol. 1 · 2026
guide #112

Raspberry Pi 5 as a self-hosted AI host — 90 days of real-world benchmarks

Editorial noteThis article reports on a fast-moving space. Versions, install counts and timelines are accurate as of the “updated” date above. We re-verify against primary sources (CVE database, project repositories, vendor announcements) before each update. Send corrections to contact@pocketclaw.dev.

Problem
Every Raspberry Pi 5 self-hosted AI tutorial reads the same: 'install Ollama, run a model, look it works.' Almost none come back 90 days later with the boring questions: does it stay up, what does it actually serve well, and where does it fall flat?

Solution
We ran a Pi 5 8 GB with active cooling as a Hermes Agent host alongside Ollama (qwen2.5-coder-3B and tinyllama-1.1B) for 90 days against a real workload — code review on PRs, documentation Q&A on a 4 GB knowledge base, and inbox classification. Every figure in this article is from that period.

This article documents 90 days of running a Raspberry Pi 5 8 GB as the sole AI inference host for one engineer's personal workflow. No HA, no load balancing, no “production” theatre. One Pi, one engineer, three workloads. The figures below are pulled from prom2json snapshots, SD card SMART logs, and the actual Hetzner-billed sidecar that handled public traffic during the period.

The hardware, exactly

  • Raspberry Pi 5 8 GB
  • Argon NEO 5 case with active cooling fan
  • 256 GB Samsung Pro Plus microSD (later replaced — see below)
  • 27 W official Pi 5 USB-C PSU
  • Wired ethernet, no Wi-Fi
  • Located in a home office, ambient 21–24 °C

Total hardware spend: €138 including case, PSU, SD card and ethernet cable. No HAT, no NVMe, no GPU.

The software stack

  • Raspberry Pi OS Bookworm 64-bit (kernel 6.6 LTS)
  • Docker 25.0 with Compose v2
  • Hermes Agent 2026.4.x (sandboxed via gVisor)
  • Ollama 0.4.7 with qwen2.5-coder:3b-instruct-q4_K_M and tinyllama:1.1b-chat-q4_0 warm-loaded
  • Caddy reverse proxy fronting Hermes for internal access
  • Tailscale for remote access (no exposed ports on the public internet)

Memory pressure was the single biggest constraint. With Hermes plus Ollama plus the qwen2.5-coder-3B model resident, the Pi sits at 6.4–6.8 GB used out of 8 GB. There is no headroom for a 7B model. There is barely headroom for the OS to do anything else.

The three workloads

We deliberately picked workloads we suspected would be different shapes:

1. Code review on Pull Requests. Every PR opened in our two private GitHub repos triggered a webhook into Hermes. Hermes pulled the diff, summarised it, ran a small ruleset against the changes (no console.log, no TODO left behind, etc.), and wrote a comment back on the PR. Average PR diff: 220 lines. Median run-time per PR: measured below.

2. Documentation Q&A on a 4 GB Markdown knowledge base. A long-running Hermes process indexed our internal wiki nightly. Three engineers asked it questions throughout the day in a Slack integration. We logged 1,840 questions over 90 days.

3. Inbox classification. A cron-driven loop polled an IMAP inbox every five minutes, classified incoming mail (sales / support / personal / spam) using tinyllama-1.1B as a feature extractor, then routed each message into a Gmail label. ~140 messages a day across the period.

That was it. No image generation, no video, no “run a 70B model on a Pi” party tricks.

The headline numbers, 90 days in

MetricResult
Uptime (Hermes process)99.74% (1 reboot, 1 OOM kill)
PRs reviewed312
Median PR review run-time14.3 s
95th percentile PR review41.8 s
Doc Q&A questions answered1,840
Median doc Q&A latency1.9 s
Inbox messages classified12,612
Median inbox classification latency0.42 s
Tokens generated total (qwen2.5-coder-3B)~1.04 M
Tokens generated total (tinyllama-1.1B)~1.81 M
Average power draw at the wall5.8 W idle, 11.4 W under load
Total kWh over 90 days17.1 kWh
Electricity cost at €0.21/kWh€3.59

The €3.59 figure is the one that surprises people most. Three months of “running a self-hosted AI on hardware you own” cost less than a single month of a budget LLM API for the same workload.

Where the Pi 5 was genuinely good

*Inbox classification* is where the Pi shines. tinyllama-1.1B at 4-bit quantisation runs at 38–45 tokens/s sustained on the Pi. Classification prompts are short (under 200 tokens in, ~5 tokens out). Median round-trip including IMAP fetch was under half a second. We never hit a backlog, even during high-volume hours.

*Documentation Q&A on small windows* worked surprisingly well. With qwen2.5-coder-3B Q4 and a retrieval-augmented setup that fed at most 1,500 tokens of context, we got coherent, citation-able answers in 1–3 seconds. The model is genuinely smart enough for “what's the difference between our staging and prod database setup?” class questions.

*Hermes Agent orchestration* — the part of the workload that doesn't need the LLM at all — flies on the Pi. Tool calls, file reads, network fetches, queue management: the Pi 5's CPU is plenty.

Where the Pi 5 was bad to actively dangerous

*Larger model attempts.* Qwen2.5 7B Q4 fits in RAM if nothing else is running, but it runs at 4–6 tokens/s. That is below the threshold where a code review feels live. We aborted the 7B test after a week of feeling like we were working through molasses.

*Long-context queries.* The token-per-second numbers above are on prompts under 2,000 tokens. Push the prompt to 4,000+ tokens and throughput on qwen2.5-coder-3B drops to 6–8 tokens/s on the Pi. Throughput at 8,000 tokens drops to 2–3 tokens/s. The Pi 5's memory bandwidth is the bottleneck — it's not getting better.

*SD card wear.* The Samsung Pro Plus 256 GB started showing re-allocated sectors after 47 days. We migrated to a USB 3 SSD on day 51 after a partial corruption scare. Anyone doing this for real should boot from SSD, not SD. Plan for it from day one.

*Thermal under sustained load.* With the Argon NEO 5 fan we never hit the 85 °C throttle limit, but we sat at 70–74 °C during long generation runs. Without active cooling, the Pi 5 throttles in well under five minutes of LLM inference.

The reboot and the OOM kill

Two unscheduled events worth recording.

The reboot was on day 23. A systemd-resolved update from the Bookworm repo made networking flaky for about ten minutes after install. Hermes got confused, leaked file descriptors, and we restarted the whole thing. Total downtime: 14 minutes. Cause: package update interaction, not the Pi itself.

The OOM kill was on day 67. We let an engineer run an ad-hoc query against documentation Q&A while the qwen2.5-coder-3B was already busy with a PR review. Memory pressure pushed the OS into killing the Hermes container. Hermes restart was clean, no data loss, total recovery time about 90 seconds. The lesson: 8 GB is the floor for this configuration, not the comfortable amount.

Cost vs. equivalent API

Equivalent API cost for the same workload, calculated against current Anthropic Haiku and OpenAI gpt-4o-mini rates as of April 2026:

  • 312 PR reviews × ~3,500 input tokens × ~600 output tokens
  • 1,840 doc Q&A × ~1,500 input × ~250 output
  • 12,612 inbox classifications × ~250 input × ~5 output

Rough total: ~3.8 M input tokens, ~280 K output tokens. Against Haiku pricing ($0.80 input / $4 output per M tokens) that is ~$3.04 input + $1.12 output = $4.16. Against gpt-4o-mini ($0.15 / $0.60) that's ~$0.74.

So the API would have cost less than the electricity. That's the honest answer.

But the API doesn't give you sandboxed tool execution, doesn't keep your inbox local, and doesn't index your private documentation without sending it to a third party. The economic case for the Pi here is not “cheaper than the API.” It is “data and tooling you control, at a running cost of €3.59 a quarter.”

Would we do it again?

For this workload, yes. The Pi handled the three workloads we threw at it for 90 days with one reboot and one OOM kill. €138 of hardware paid for itself in dignity within a month.

Would we do it for production traffic with paying customers? No. We already moved that to a Mac Mini M4 in March. The Pi 5 is a personal edge device. Treat it as one.

What we'd change next time

  • Boot from USB 3 SSD from day one. Don't use SD as the boot device.
  • Skip the 8 GB SKU and wait for the (rumoured) Pi 5 16 GB. The 8 GB is
  • Use a real reverse proxy to enforce rate limits even on internal
  • Snapshot the SD/SSD weekly. Cheap insurance.

Reference reading

  • [Hermes Agent on Raspberry Pi 5 — full setup guide](/setup/hermes-agent-on-raspberry-pi-5)
  • [Ollama on Raspberry Pi 5](/setup/ollama-on-raspberry-pi-5)
  • [Pocket AI hardware buyer's guide 2026](/guides/edge-ai-hardware-2026)
  • [Raspberry Pi 5 vs Mini PC for AI agents](/compare/raspberry-pi-5-vs-intel-nuc-13)

The Pi 5 is not the answer to every self-hosted AI question. It is the answer to a specific class of questions, and within that class it is genuinely excellent. Know which class you're in before you buy.

Continue reading
guide
Pocket AI complete guide
Running self-hosted AI on portable hardware
guide
Edge AI hardware buyer's guide 2026
Pi 5 vs Mini PC vs Mac Mini
report
Self-hosted AI landscape 2026
Quarterly state of the ecosystem
section
Pocket AI hardware hub
All portable hosts reviewed
section
Agent tracker
Live stats on every agent
newsletter
Thursday digest
Weekly summary in your inbox