There is a default assumption running through most self-hosted AI tutorials in 2026: get a GPU. The assumption is so common that almost nobody benchmarks the alternative seriously. We did, over 21 days, across four hosts. The data does not always agree with the default assumption.
This is the article we wished existed when we were buying hardware.
The four hosts
| Host | Spec | Idle W | Cost (mid-2026 EU) |
|---|---|---|---|
| Raspberry Pi 5 8 GB | ARM Cortex-A76, no GPU | 5.8 W | €138 inc. case + PSU |
| Intel NUC 13 i7 | i7-1360P, Iris Xe iGPU | 12 W | €620 |
| Mac Mini M4 16 GB | Apple M4, 10-core, unified memory | 7.8 W | €749 |
| GPU box | Ryzen 5 7600 + RTX 4060 Ti 16 GB | 41 W | €1,180 self-built |
Each host ran Ubuntu 24.04 (or macOS 15 on the Mini) with Ollama 0.4.7 and the same five quantised models pulled from Hugging Face. Power was measured at the wall socket with a ZuverSiv ZS-PWR3 meter.
The five models
tinyllama:1.1b-chat-q4_0(~640 MB resident)qwen2.5-coder:3b-instruct-q4_K_M(~2.0 GB)qwen2.5-coder:7b-instruct-q4_K_M(~4.5 GB)qwen2.5:14b-instruct-q4_K_M(~8.5 GB)qwen2.5:32b-instruct-q4_K_M(~19 GB)
The 32B fits on the Mac Mini's 16 GB unified memory because of swap-style paging — performance suffers but it runs. It does not fit on the Intel NUC at all without aggressive offload. It fits on the GPU box across CPU+GPU memory.
The three workloads
1. Single-prompt classification — 200 input tokens, ~5 output. The inbox-classification shape. 2. Code review on a 220-line diff — 3,500 input tokens, ~600 output. The PR-review shape. 3. Long-context summarisation — 12,000 input tokens, ~400 output. The doc-summarisation shape.
We ran each combination 30 times and report the median. All numbers are tokens per second of *generated* output unless otherwise stated. Lower is worse.
Single-prompt classification (1B model)
| Host | tokens/s | watts | tokens/Wh |
|---|---|---|---|
| Pi 5 | 42 | 7.6 | 19,894 |
| NUC 13 | 71 | 22 | 11,618 |
| Mac Mini M4 | 138 | 13 | 38,215 |
| GPU box (CPU only) | 88 | 51 | 6,212 |
| GPU box (GPU) | 312 | 89 | 12,624 |
The GPU is fastest in absolute terms but worst in tokens-per-watt-hour because the whole machine is drawing 89 W to run a 1B model. The Mac Mini wins on efficiency by a factor of 3×. The Pi is third on efficiency despite being slowest.
For inbox classification at ~140 messages a day, the median latency gap between any of these hosts is invisible to the user. The Pi classifies a message in 420 ms; the GPU box classifies it in 60 ms; both feel instant.
Code review on 3.5 K input + 600 output (3B model)
| Host | tokens/s | watts during | total time |
|---|---|---|---|
| Pi 5 | 7.8 | 11.4 | 77 s |
| NUC 13 | 21 | 28 | 29 s |
| Mac Mini M4 | 47 | 16 | 13 s |
| GPU box (CPU) | 28 | 62 | 21 s |
| GPU box (GPU) | 142 | 91 | 4.2 s |
This is the workload where the GPU starts to genuinely matter. A 4-second review feels live; a 77-second review does not. The Pi 5 is technically capable but uncomfortable for interactive use at this size.
But notice: the Mac Mini at 47 tokens/s on 16 W is *competitive enough for interactive use* and vastly more efficient than the GPU. For a team that reviews ~10–20 PRs a day, the Mini is the cost-effective answer.
Code review on 7B
| Host | tokens/s | watts | total time |
|---|---|---|---|
| Pi 5 | 4.2 | 11.6 | (143 s — too slow for interactive) |
| NUC 13 | 11 | 32 | 56 s |
| Mac Mini M4 | 28 | 18 | 22 s |
| GPU box (GPU) | 92 | 99 | 6.7 s |
7B on the Pi is unusable for interactive work. The NUC's iGPU helps a bit but it's still 56 s. The Mac Mini at 22 s is the cheapest interactive option. The GPU box is the only host that feels truly fast, and it pays for that with 99 W under load.
Long-context summarisation (12 K in, 400 out, 7B model)
| Host | tokens/s | total time | DRAM pressure |
|---|---|---|---|
| Pi 5 | did not complete | OOM | n/a |
| NUC 13 | 6.3 | 67 s | 14 GB used / 16 GB |
| Mac Mini M4 | 19 | 21 s | 11 GB used / 16 GB |
| GPU box (GPU) | 71 | 5.6 s | 8 GB GPU / 16 GB |
The Pi 5 cannot do this workload — 7B with 12 K of context exceeds 8 GB of RAM. The NUC squeezes through but with no headroom. The Mini handles it cleanly because of unified memory. The GPU is fast but uses the whole rig for a workload that runs once a day.
Cost-per-million-output-tokens
Putting electricity at €0.21/kWh and amortising hardware over 36 months, running each host 24/7 (worst case for self-hosted AI):
| Host | Total cost / month | Tokens / month at full load (3B model) | €/M tokens |
|---|---|---|---|
| Pi 5 | €4.30 | ~20.2 M | €0.21 |
| NUC 13 | €25 | ~54 M | €0.46 |
| Mac Mini M4 | €23 | ~122 M | €0.19 |
| GPU box | €38 | ~368 M | €0.10 |
The GPU is cheapest per million tokens *only if you keep it busy*. For intermittent traffic — i.e. most personal and small-team use — the GPU's 24/7 idle power makes the per-token cost worse than its theoretical maximum.
The Mac Mini wins on €/M tokens for moderate steady traffic. The Pi wins on absolute monthly bill if your traffic is genuinely small.
When each host genuinely wins
After 21 days of benchmarking and another month of using each in anger, the recommendation is:
- Raspberry Pi 5: edge classification, agent orchestration, small
- Intel NUC 13 (or Mini PC equivalent): bridge tier. Better than
- Mac Mini M4: the best all-rounder for self-hosted AI in 2026.
- Single-GPU box: high steady throughput, big models, batch
The shape of your traffic matters more than the headline benchmark
Three lessons from this work:
1. Idle power is a hidden cost. A box drawing 41 W idle costs €75/year in electricity at EU rates *before doing any work*. The Mac Mini at 7.8 W idle is a structural advantage on intermittent workloads.
2. Tokens-per-second is not the only axis. Tokens-per-watt-hour and tokens-per-euro-of-hardware-amortised matter for self-hosting in ways they do not for hyperscale.
3. The Apple Silicon advantage is real. Unified memory, low idle power, and decent throughput at small-to-medium model sizes make the Mac Mini the surprise winner across most rows of these tables.
Reference reading
- [Pocket AI hardware buyer's guide 2026](/guides/edge-ai-hardware-2026)
- [Raspberry Pi 5 90-day benchmark](/guides/raspberry-pi-5-self-hosted-ai-90-day-benchmark)
- [Mac Mini M4 review](/pocket/mac-mini-m4)
- [Intel NUC 13 review](/pocket/intel-nuc-13)
- [Self-hosted AI cost calculator](/calculator/cost)
The right answer is “it depends on your traffic shape and whether you care about Wh as much as you care about ms.” Most self-hosted AI users care about Wh more than they realise. The Mac Mini is, in 2026, the boring answer that is correct most of the time.