GPU vs CPU for self-hosted AI inference — when each genuinely wins in 2026

There is a default assumption running through most self-hosted AI tutorials in 2026: get a GPU. The assumption is so common that almost nobody benchmarks the alternative seriously. We did, over 21 days, across four hosts. The data does not always agree with the default assumption.

This is the article we wished existed when we were buying hardware.

The four hosts

Host	Spec	Idle W	Cost (mid-2026 EU)
Raspberry Pi 5 8 GB	ARM Cortex-A76, no GPU	5.8 W	€138 inc. case + PSU
Intel NUC 13 i7	i7-1360P, Iris Xe iGPU	12 W	€620
Mac Mini M4 16 GB	Apple M4, 10-core, unified memory	7.8 W	€749
GPU box	Ryzen 5 7600 + RTX 4060 Ti 16 GB	41 W	€1,180 self-built

Each host ran Ubuntu 24.04 (or macOS 15 on the Mini) with Ollama 0.4.7 and the same five quantised models pulled from Hugging Face. Power was measured at the wall socket with a ZuverSiv ZS-PWR3 meter.

The five models

tinyllama:1.1b-chat-q4_0 (~640 MB resident)
qwen2.5-coder:3b-instruct-q4_K_M (~2.0 GB)
qwen2.5-coder:7b-instruct-q4_K_M (~4.5 GB)
qwen2.5:14b-instruct-q4_K_M (~8.5 GB)
qwen2.5:32b-instruct-q4_K_M (~19 GB)

The 32B fits on the Mac Mini's 16 GB unified memory because of swap-style paging — performance suffers but it runs. It does not fit on the Intel NUC at all without aggressive offload. It fits on the GPU box across CPU+GPU memory.

The three workloads

1. Single-prompt classification — 200 input tokens, ~5 output. The inbox-classification shape. 2. Code review on a 220-line diff — 3,500 input tokens, ~600 output. The PR-review shape. 3. Long-context summarisation — 12,000 input tokens, ~400 output. The doc-summarisation shape.

We ran each combination 30 times and report the median. All numbers are tokens per second of *generated* output unless otherwise stated. Lower is worse.

Single-prompt classification (1B model)

Host	tokens/s	watts	tokens/Wh
Pi 5	42	7.6	19,894
NUC 13	71	22	11,618
Mac Mini M4	138	13	38,215
GPU box (CPU only)	88	51	6,212
GPU box (GPU)	312	89	12,624

The GPU is fastest in absolute terms but worst in tokens-per-watt-hour because the whole machine is drawing 89 W to run a 1B model. The Mac Mini wins on efficiency by a factor of 3×. The Pi is third on efficiency despite being slowest.

For inbox classification at ~140 messages a day, the median latency gap between any of these hosts is invisible to the user. The Pi classifies a message in 420 ms; the GPU box classifies it in 60 ms; both feel instant.

Code review on 3.5 K input + 600 output (3B model)

Host	tokens/s	watts during	total time
Pi 5	7.8	11.4	77 s
NUC 13	21	28	29 s
Mac Mini M4	47	16	13 s
GPU box (CPU)	28	62	21 s
GPU box (GPU)	142	91	4.2 s

This is the workload where the GPU starts to genuinely matter. A 4-second review feels live; a 77-second review does not. The Pi 5 is technically capable but uncomfortable for interactive use at this size.

But notice: the Mac Mini at 47 tokens/s on 16 W is *competitive enough for interactive use* and vastly more efficient than the GPU. For a team that reviews ~10–20 PRs a day, the Mini is the cost-effective answer.

Code review on 7B

Host	tokens/s	watts	total time
Pi 5	4.2	11.6	(143 s — too slow for interactive)
NUC 13	11	32	56 s
Mac Mini M4	28	18	22 s
GPU box (GPU)	92	99	6.7 s

7B on the Pi is unusable for interactive work. The NUC's iGPU helps a bit but it's still 56 s. The Mac Mini at 22 s is the cheapest interactive option. The GPU box is the only host that feels truly fast, and it pays for that with 99 W under load.

Long-context summarisation (12 K in, 400 out, 7B model)

Host	tokens/s	total time	DRAM pressure
Pi 5	did not complete	OOM	n/a
NUC 13	6.3	67 s	14 GB used / 16 GB
Mac Mini M4	19	21 s	11 GB used / 16 GB
GPU box (GPU)	71	5.6 s	8 GB GPU / 16 GB

The Pi 5 cannot do this workload — 7B with 12 K of context exceeds 8 GB of RAM. The NUC squeezes through but with no headroom. The Mini handles it cleanly because of unified memory. The GPU is fast but uses the whole rig for a workload that runs once a day.

Cost-per-million-output-tokens

Putting electricity at €0.21/kWh and amortising hardware over 36 months, running each host 24/7 (worst case for self-hosted AI):

Host	Total cost / month	Tokens / month at full load (3B model)	€/M tokens
Pi 5	€4.30	~20.2 M	€0.21
NUC 13	€25	~54 M	€0.46
Mac Mini M4	€23	~122 M	€0.19
GPU box	€38	~368 M	€0.10

The GPU is cheapest per million tokens *only if you keep it busy*. For intermittent traffic — i.e. most personal and small-team use — the GPU's 24/7 idle power makes the per-token cost worse than its theoretical maximum.

The Mac Mini wins on €/M tokens for moderate steady traffic. The Pi wins on absolute monthly bill if your traffic is genuinely small.

When each host genuinely wins

After 21 days of benchmarking and another month of using each in anger, the recommendation is:

Raspberry Pi 5: edge classification, agent orchestration, small

Intel NUC 13 (or Mini PC equivalent): bridge tier. Better than

Mac Mini M4: the best all-rounder for self-hosted AI in 2026.

Single-GPU box: high steady throughput, big models, batch

The shape of your traffic matters more than the headline benchmark

Three lessons from this work:

1. Idle power is a hidden cost. A box drawing 41 W idle costs €75/year in electricity at EU rates *before doing any work*. The Mac Mini at 7.8 W idle is a structural advantage on intermittent workloads.

2. Tokens-per-second is not the only axis. Tokens-per-watt-hour and tokens-per-euro-of-hardware-amortised matter for self-hosting in ways they do not for hyperscale.

3. The Apple Silicon advantage is real. Unified memory, low idle power, and decent throughput at small-to-medium model sizes make the Mac Mini the surprise winner across most rows of these tables.

Reference reading

[Pocket AI hardware buyer's guide 2026](/guides/edge-ai-hardware-2026)
[Raspberry Pi 5 90-day benchmark](/guides/raspberry-pi-5-self-hosted-ai-90-day-benchmark)
[Mac Mini M4 review](/pocket/mac-mini-m4)
[Intel NUC 13 review](/pocket/intel-nuc-13)
[Self-hosted AI cost calculator](/calculator/cost)

The right answer is “it depends on your traffic shape and whether you care about Wh as much as you care about ms.” Most self-hosted AI users care about Wh more than they realise. The Mac Mini is, in 2026, the boring answer that is correct most of the time.