We ran a small internal tool on the OpenAI Assistants API from October 2025 to March 2026. It answered questions from our sales team about deck content — “what was our 2024 churn number?”, “which case study mentions enterprise SSO?”, “summarise the Q4 ICP doc”. Sales loved it. Finance hated the bill: €1,420/month at peak.
In late February the bill, the sandbox concerns we'd been deferring, and the new OpenClaw 2026.4 release converged. We migrated. This article is the week-by-week record.
Week 0: the inventory
Before touching anything, we wrote down what the existing system actually did. This took two days and was more useful than any other single thing we did during the migration.
The Assistant had:
- Three function tools (lookup_deck, lookup_metric, escalate_to_human)
- One file_search vector store with 312 PDFs
- A custom system prompt of 1,840 tokens
- A streaming front-end on Slack
- An average thread length of 7 turns before being abandoned
- ~14,000 messages/month across ~600 active threads
Cost breakdown of the €1,420/month:
- Token charges (gpt-4o on threads + retrieval): €960
- File_search storage and retrieval ops: €380
- Function-call overhead and metadata: ~€80
We were not going to replicate file_search performance with hand-rolled embeddings. That became a guiding constraint.
Week 1: the scaffolding
Set up a €25/month Hetzner CX32 (8 GB RAM, 4 vCPU, Frankfurt). Installed OpenClaw 2026.4.7 with sandbox-on, Caddy reverse proxy, Tailscale for admin access. This took an afternoon.
Hooked OpenClaw to OpenRouter for the LLM (we deliberately did not run a local model — see the “what we kept” section). Tested basic chat: worked.
Spent the rest of the week porting the system prompt. OpenClaw's prompt
schema is more structured than Assistants' free-form text — there are
slots for persona, tools, safety, examples. We had to break the
1,840-token prompt across those slots. The first attempt produced
noticeably worse responses than Assistants. We iterated.
Week 2: tools and the function-calling translation
OpenClaw's tool schema is JSON-schema based, similar to OpenAI's function calling but stricter about argument validation. Two of our three tools ported cleanly. The third (escalate_to_human) had been written sloppily on the OpenAI side — it accepted free-form arguments and we'd been getting away with it. OpenClaw rejected the schema. We spent two days tightening the contract.
Net result: better validation, fewer broken escalations in production, but two days of unscheduled work.
We discovered during this week that OpenClaw 2026.4 has a built-in audit log of every tool call — timestamp, arguments, return value, duration. Assistants does not give you this. We hadn't realised how much we wanted it until we had it.
Week 3: the retrieval problem
This was the hardest week.
OpenAI's file_search uses an opaque vector store that you don't manage. You upload PDFs, OpenAI extracts, chunks, embeds, indexes, retrieves. You don't see the chunks. You don't choose the embedding model. It just works (or doesn't).
Self-hosting forces choices. We tried three setups:
1. Built-in OpenClaw retrieval (Marcoeko encoder). Retrieval was OK on plain-text PDFs but mediocre on PDFs with tables. Our deck library has many tables.
2. Qdrant with bge-large-en-v1.5 embeddings. Better quality but chunking decisions kept biting us. Tables got split mid-row.
3. Qdrant with bge-large-en-v1.5 + a custom chunker that respects table boundaries. This finally worked, but it took five days to build the chunker.
We ended up with a Qdrant instance on the same VPS, ingesting our 312 PDFs nightly via a small Python script. End-to-end retrieval quality is 80–85% of what file_search delivered. Not parity. Acceptable.
Week 4: streaming and Slack
OpenClaw exposes a Server-Sent Events streaming endpoint that maps relatively cleanly to what our Slack adapter expected from OpenAI's streaming. Integration took three days.
The wrinkle was thread state. OpenAI Assistants persists threads
server-side; OpenClaw pushes that responsibility to the client. We
added a Postgres threads table on the VPS, key on (slack_user_id,
slack_channel_id). State lives there now. We could have used SQLite —
in retrospect we should have. Postgres for ~600 small rows is overkill.
Week 5: parallel run and dogfooding
We ran both systems in parallel for the entire week, with a feature flag in Slack routing each user to one or the other. We collected feedback through a mandatory thumbs-up/down at the end of each thread.
Results:
- 412 threads total during the week
- OpenAI side: 67% thumbs up, 12% thumbs down, 21% no rating
- OpenClaw side: 61% thumbs up, 16% thumbs down, 23% no rating
Six-point gap. Sales noticed. Common complaints on OpenClaw:
- “It missed the table on slide 4.” (Retrieval quality gap.)
- “It doesn't remember what I asked yesterday.” (We hadn't enabled
- “It's slower.” (Average response: 6.1 s vs OpenAI 3.4 s.)
We addressed memory in week 6. Speed and retrieval gaps remain real.
Week 6: cutover, polish, and the cost crash
Cut over fully on day 38. OpenAI account paused but not closed.
Tightened the system prompt based on dogfooding feedback. Added cross-session memory via OpenClaw's built-in store. Tuned retrieval top-k from 5 to 8 to compensate for the chunker's occasional misses.
End-of-week thumbs-up rate climbed to 64%. Within five percentage points of the OpenAI baseline.
The cost crash
| Month | Setup | Bill |
|---|---|---|
| Jan 2026 | OpenAI Assistants only | €1,420 |
| Feb 2026 | OpenAI Assistants only | €1,392 |
| Mar 2026 | OpenAI + parallel OpenClaw | €1,460 |
| Apr 2026 | OpenClaw only | €218 |
The €218 breaks down as:
- Hetzner CX32: €25
- OpenRouter LLM (Claude Haiku for cheap classification, Sonnet for
- Qdrant storage and ops: ~€8
- Domain, monitoring, misc: €13
That's an 85% cost reduction at slightly worse quality and slightly slower latency. Sales accepted the trade-off when we showed them the old vs. new bills.
What we kept on OpenAI
We did not migrate everything.
- Whisper transcription. OpenAI's Whisper-1 endpoint is still the
- DALL-E for one-off internal graphics. Low volume, not worth
- A fallback path on OpenClaw failure. OpenClaw has been stable
What we'd do differently
- Start retrieval engineering in week 1, not week 3. It was the bottleneck.
- Run the parallel-run period earlier and longer. We learned more in
- Use SQLite for thread state. Postgres was overkill.
- Write the chunker first, before ingesting any documents.
- Keep OpenAI Whisper. Don't migrate things that work and don't hurt.
Reference reading
- [OpenClaw vs Hermes Agent — head-to-head](/compare/openclaw-vs-hermes)
- [Self-hosted RAG on a €25 VPS](/setup/document-rag-self-hosted)
- [How to choose a self-hosted AI agent in 2026](/guides/how-to-choose-self-hosted-ai-agent-2026)
- [Solo developer self-hosted AI — a year-long retrospective](/guides/solo-developer-self-hosted-ai-retrospective-2026)
Self-hosting is not free. The bill drops by ~85%; the engineer-time to get there does not. The decision is real cost vs. real time vs. real control. Run the numbers for your specific workload before you commit.