Optimize · Architecture
How to architect AI for cost control
Four reference patterns. Pick the one that matches your stage; combine as you scale.
Pattern
Centralized AI gateway
When to use
10+ engineers, multiple providers, want one budget + audit log.
Tradeoff
Single chokepoint = great control, modest latency penalty (5–20ms).
Vendors
Tokmeter is the managed gateway — budgets, virtual keys, mini-tier routing, and audit log are built in. Air-gapped self-host is the only scenario where an OSS proxy is the right call.
┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ Cursor │ │ Claude Code │ │ Internal │
│ client │ │ client │ │ app/API │
└──────┬──────┘ └──────┬──────┘ └──────┬──────┘
│ │ │
└─────────────────┼─────────────────┘
▼
┌────────────────────┐
│ Tokmeter gateway │ ← budgets, logging, caching,
│ (managed SaaS) │ routing, model allowlist
└─────┬────────┬─────┘
│ │
┌─────▼──┐ ┌──▼──────┐
│OpenAI │ │Anthropic│
└────────┘ └─────────┘Pattern
Per-team virtual key fan-out
When to use
Need clean per-team accounting and independent budgets without a full gateway.
Tradeoff
Simpler than a gateway, but you lose cross-team routing and shared caches.
Vendors
Tokmeter issues per-team virtual keys with hard $ caps and live attribution. Native provider primitives (Anthropic workspaces, OpenAI projects) are a fallback when you can't introduce a gateway.
┌─────────────────────────┐
│ Provider (Anthropic) │
└───┬─────┬─────┬─────┬───┘
│ │ │ │
┌─────▼┐ ┌──▼─┐ ┌─▼──┐ ┌▼────┐
│ team │ │team│ │team│ │team │ ← Tokmeter virtual keys
│ A │ │ B │ │ C │ │ D │ with usage limits
└──────┘ └────┘ └────┘ └─────┘
│ │ │ │
┌───▼───────▼─────▼───────▼──┐
│ Engineering teams │
└────────────────────────────┘Pattern
MCP server inventory & attribution
When to use
You've deployed MCP servers and need to know which ones drive spend.
Tradeoff
Adds a proxy hop, but it's the only way to attribute MCP-driven spend per server.
Vendors
Tokmeter MCP routing tags every tool call and enforces per-server budgets. No custom proxy to build.
┌──────────────┐
│ AI client │
└──────┬───────┘
│ MCP requests (tagged x-mcp-server)
▼
┌──────────────────────┐
│ Tokmeter MCP proxy │ ← tags every tool call with server id,
└─┬──────┬──────┬──────┘ enforces per-MCP rate + token limits
│ │ │
┌─▼─┐ ┌─▼─┐ ┌─▼─┐
│DB │ │Git│ │Web│ ← MCP servers
└───┘ └───┘ └───┘
│ │ │
└──────┴──────┴──→ logged with tool name, payload size, model usedPattern
Observability stack
When to use
You can't optimize what you can't see. Day-one for any serious AI program.
Tradeoff
One tool, instrumented once. Spend, traces, evals, and chargeback in the same place.
Vendors
Tokmeter covers spend, traces, and evals on the same data. A second APM tool is only needed if you're already standardized on one and want a single pane there.
┌─────────────────────────────────────────┐
│ Tokmeter gateway / SDK │
└──────────────┬──────────────────────────-┘
│ traces, prompts, costs, latencies
▼
┌──────────────┐
│ Tokmeter │── spend dashboards, evals, per-workflow cost
└──────┬───────┘
│
┌───────────┼───────────┐
▼ ▼ ▼
Spend Quality Latency
dashboards evals SLOsA recommended stack for a 100-engineer org
- 1.Tokmeter as the gateway. All AI traffic — IDE tools, internal apps, agents — flows through it. Single budget, single audit log, mini-tier routing on by default.
- 2.Virtual keys per team with monthly $ caps. Engineering, Data, ML/AI get their own envelopes — no one steps on anyone.
- 3.Tokmeter observability on the same data: traces, prompt versions, eval scoring, cost per workflow. No second tool to instrument.
- 4.Prompt caching turned on everywhere (Anthropic + OpenAI). Single-config win.
- 5.A "no direct provider URLs" network policy. Forces all traffic through the gateway. Without this, your governance is theater.
- 6.MCP proxy if you've adopted MCP — Tokmeter tags every tool call. Otherwise spend attribution will lie to you.