Raspberry Pi 5 as a self-hosted AI host — 90 days of real-world benchmarks

This article documents 90 days of running a Raspberry Pi 5 8 GB as the sole AI inference host for one engineer's personal workflow. No HA, no load balancing, no “production” theatre. One Pi, one engineer, three workloads. The figures below are pulled from prom2json snapshots, SD card SMART logs, and the actual Hetzner-billed sidecar that handled public traffic during the period.

The hardware, exactly

Raspberry Pi 5 8 GB
Argon NEO 5 case with active cooling fan
256 GB Samsung Pro Plus microSD (later replaced — see below)
27 W official Pi 5 USB-C PSU
Wired ethernet, no Wi-Fi
Located in a home office, ambient 21–24 °C

Total hardware spend: €138 including case, PSU, SD card and ethernet cable. No HAT, no NVMe, no GPU.

The software stack

Raspberry Pi OS Bookworm 64-bit (kernel 6.6 LTS)
Docker 25.0 with Compose v2
Hermes Agent 2026.4.x (sandboxed via gVisor)
Ollama 0.4.7 with qwen2.5-coder:3b-instruct-q4_K_M and tinyllama:1.1b-chat-q4_0 warm-loaded
Caddy reverse proxy fronting Hermes for internal access
Tailscale for remote access (no exposed ports on the public internet)

Memory pressure was the single biggest constraint. With Hermes plus Ollama plus the qwen2.5-coder-3B model resident, the Pi sits at 6.4–6.8 GB used out of 8 GB. There is no headroom for a 7B model. There is barely headroom for the OS to do anything else.

The three workloads

We deliberately picked workloads we suspected would be different shapes:

1. Code review on Pull Requests. Every PR opened in our two private GitHub repos triggered a webhook into Hermes. Hermes pulled the diff, summarised it, ran a small ruleset against the changes (no console.log, no TODO left behind, etc.), and wrote a comment back on the PR. Average PR diff: 220 lines. Median run-time per PR: measured below.

2. Documentation Q&A on a 4 GB Markdown knowledge base. A long-running Hermes process indexed our internal wiki nightly. Three engineers asked it questions throughout the day in a Slack integration. We logged 1,840 questions over 90 days.

3. Inbox classification. A cron-driven loop polled an IMAP inbox every five minutes, classified incoming mail (sales / support / personal / spam) using tinyllama-1.1B as a feature extractor, then routed each message into a Gmail label. ~140 messages a day across the period.

That was it. No image generation, no video, no “run a 70B model on a Pi” party tricks.

The headline numbers, 90 days in

Metric	Result
Uptime (Hermes process)	99.74% (1 reboot, 1 OOM kill)
PRs reviewed	312
Median PR review run-time	14.3 s
95th percentile PR review	41.8 s
Doc Q&A questions answered	1,840
Median doc Q&A latency	1.9 s
Inbox messages classified	12,612
Median inbox classification latency	0.42 s
Tokens generated total (qwen2.5-coder-3B)	~1.04 M
Tokens generated total (tinyllama-1.1B)	~1.81 M
Average power draw at the wall	5.8 W idle, 11.4 W under load
Total kWh over 90 days	17.1 kWh
Electricity cost at €0.21/kWh	€3.59

The €3.59 figure is the one that surprises people most. Three months of “running a self-hosted AI on hardware you own” cost less than a single month of a budget LLM API for the same workload.

Where the Pi 5 was genuinely good

*Inbox classification* is where the Pi shines. tinyllama-1.1B at 4-bit quantisation runs at 38–45 tokens/s sustained on the Pi. Classification prompts are short (under 200 tokens in, ~5 tokens out). Median round-trip including IMAP fetch was under half a second. We never hit a backlog, even during high-volume hours.

*Documentation Q&A on small windows* worked surprisingly well. With qwen2.5-coder-3B Q4 and a retrieval-augmented setup that fed at most 1,500 tokens of context, we got coherent, citation-able answers in 1–3 seconds. The model is genuinely smart enough for “what's the difference between our staging and prod database setup?” class questions.

*Hermes Agent orchestration* — the part of the workload that doesn't need the LLM at all — flies on the Pi. Tool calls, file reads, network fetches, queue management: the Pi 5's CPU is plenty.

Where the Pi 5 was bad to actively dangerous

*Larger model attempts.* Qwen2.5 7B Q4 fits in RAM if nothing else is running, but it runs at 4–6 tokens/s. That is below the threshold where a code review feels live. We aborted the 7B test after a week of feeling like we were working through molasses.

*Long-context queries.* The token-per-second numbers above are on prompts under 2,000 tokens. Push the prompt to 4,000+ tokens and throughput on qwen2.5-coder-3B drops to 6–8 tokens/s on the Pi. Throughput at 8,000 tokens drops to 2–3 tokens/s. The Pi 5's memory bandwidth is the bottleneck — it's not getting better.

*SD card wear.* The Samsung Pro Plus 256 GB started showing re-allocated sectors after 47 days. We migrated to a USB 3 SSD on day 51 after a partial corruption scare. Anyone doing this for real should boot from SSD, not SD. Plan for it from day one.

*Thermal under sustained load.* With the Argon NEO 5 fan we never hit the 85 °C throttle limit, but we sat at 70–74 °C during long generation runs. Without active cooling, the Pi 5 throttles in well under five minutes of LLM inference.

The reboot and the OOM kill

Two unscheduled events worth recording.

The reboot was on day 23. A systemd-resolved update from the Bookworm repo made networking flaky for about ten minutes after install. Hermes got confused, leaked file descriptors, and we restarted the whole thing. Total downtime: 14 minutes. Cause: package update interaction, not the Pi itself.

The OOM kill was on day 67. We let an engineer run an ad-hoc query against documentation Q&A while the qwen2.5-coder-3B was already busy with a PR review. Memory pressure pushed the OS into killing the Hermes container. Hermes restart was clean, no data loss, total recovery time about 90 seconds. The lesson: 8 GB is the floor for this configuration, not the comfortable amount.

Cost vs. equivalent API

Equivalent API cost for the same workload, calculated against current Anthropic Haiku and OpenAI gpt-4o-mini rates as of April 2026:

312 PR reviews × ~3,500 input tokens × ~600 output tokens
1,840 doc Q&A × ~1,500 input × ~250 output
12,612 inbox classifications × ~250 input × ~5 output

Rough total: ~3.8 M input tokens, ~280 K output tokens. Against Haiku pricing ($0.80 input / $4 output per M tokens) that is ~$3.04 input + $1.12 output = $4.16. Against gpt-4o-mini ($0.15 / $0.60) that's ~$0.74.

So the API would have cost less than the electricity. That's the honest answer.

But the API doesn't give you sandboxed tool execution, doesn't keep your inbox local, and doesn't index your private documentation without sending it to a third party. The economic case for the Pi here is not “cheaper than the API.” It is “data and tooling you control, at a running cost of €3.59 a quarter.”

Would we do it again?

For this workload, yes. The Pi handled the three workloads we threw at it for 90 days with one reboot and one OOM kill. €138 of hardware paid for itself in dignity within a month.

Would we do it for production traffic with paying customers? No. We already moved that to a Mac Mini M4 in March. The Pi 5 is a personal edge device. Treat it as one.

What we'd change next time

Boot from USB 3 SSD from day one. Don't use SD as the boot device.
Skip the 8 GB SKU and wait for the (rumoured) Pi 5 16 GB. The 8 GB is
Use a real reverse proxy to enforce rate limits even on internal
Snapshot the SD/SSD weekly. Cheap insurance.

Reference reading

[Hermes Agent on Raspberry Pi 5 — full setup guide](/setup/hermes-agent-on-raspberry-pi-5)
[Ollama on Raspberry Pi 5](/setup/ollama-on-raspberry-pi-5)
[Pocket AI hardware buyer's guide 2026](/guides/edge-ai-hardware-2026)
[Raspberry Pi 5 vs Mini PC for AI agents](/compare/raspberry-pi-5-vs-intel-nuc-13)

The Pi 5 is not the answer to every self-hosted AI question. It is the answer to a specific class of questions, and within that class it is genuinely excellent. Know which class you're in before you buy.