Self-hostedLlamavsClaudeAPI:ourrealcostbreakdown
When a token bill is the problem, and when a GPU is. One month of real numbers from a working agency.
Every studio and agency we talk to right now has the same quiet anxiety: the AI workflows that were free to experiment with last year are about to have a bill attached. Before that bill lands, it is worth doing the math honestly.
We run AI inside our own workflows and inside OHM Agency's daily work — copy, summaries, asset tagging, brief processing. Across both, our monthly token-equivalent volume is around 18 million input tokens and 4 million output tokens. That is a realistic mid-sized agency footprint.
On Claude Sonnet and GPT-4-class APIs, that volume costs roughly $170 to $260 per month depending on the model mix. Not ruinous. But add a client-facing product and it scales with usage, not with team size. That is the scary shape.
We tested the same workloads on a self-hosted stack: Llama 3 8B and Mistral Small running on a single RTX 4090 (consumer card, already in the studio) via Ollama. Total operational cost: roughly €40 per month in electricity, no token meter. Quality on summarisation and tagging is indistinguishable from Sonnet. On complex multi-step reasoning the frontier models still win, which is why we keep a small Claude budget for the hard 10%.
The honest conclusion: self-host wins when the workload is high-volume and structurally simple (tagging, classification, summaries, first drafts). API wins when the quality ceiling matters more than the cost. Most agency work is the first kind.
There are hidden costs nobody flags: running your own inference means you own eval, drift, and the day the model misbehaves. Budget a week a quarter for this. But compared to an open-ended token bill attached to client growth, a fixed €40 electric and one engineering week a quarter is a trade most studios would take.
If you are running anything more than occasional AI in your workflow, the question is no longer whether to self-host at all. It is which 80% of your workload goes onto your own GPU, and which 20% stays on the frontier API.