Docs

Promptoll is OpenAI-compatible and Anthropic-compatible. One API key, every model.

Quickstart

1. Create an account
Sign up and create an API key from Console → API Keys.

2. Point the OpenAI SDK at Promptoll

# OpenAI SDK
from openai import OpenAI
client = OpenAI(base_url="https://api.promptoll.com/v1", api_key="zai_•••")

# Anthropic SDK
from anthropic import Anthropic
client = Anthropic(base_url="https://api.promptoll.com", api_key="zai_•••")

3. Pick a model — or let us pick

client.chat.completions.create(
    model="gpt-4o",          # or "auto", "quality"
    messages=[{"role": "user", "content": "hi"}],
)

Authentication

API keys are prefixed with zai_. Create and manage them in Console → API Keys.

Base URL	https://api.promptoll.com
OpenAI header	Authorization: Bearer zai_•••
Anthropic header	x-api-key: zai_•••
Key format	zai_<16 random chars>

Claude Code & Anthropic SDK

Promptoll also speaks the Anthropic Messages API. Use your zai_ key with the official anthropic SDK or Claude Code — no proxy, no wrapper needed.

Python — anthropic SDK

from anthropic import Anthropic

client = Anthropic(
    base_url="https://api.promptoll.com",
    api_key="zai_•••",            # your Promptoll key
)

message = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=1024,
    messages=[{"role": "user", "content": "Hello!"}],
)
print(message.content[0].text)

# Streaming
with client.messages.stream(
    model="claude-sonnet-4-6",
    max_tokens=1024,
    messages=[{"role": "user", "content": "Hello!"}],
) as stream:
    for text in stream.text_stream:
        print(text, end="", flush=True)

Claude Code

Set these env vars in your project root or shell profile, then restart Claude Code:

export ANTHROPIC_BASE_URL=https://api.promptoll.com
export ANTHROPIC_API_KEY=zai_•••

Tool use, multi-turn, and streaming are fully supported. Billing works the same as other models.

Endpoint Reference

POST/v1/chat/completions

Fully compatible with the OpenAI Chat Completions API. Drop Promptoll's base_url into any existing OpenAI SDK integration.

Parameter	Type	Required	Description
model	string	required	Model ID or routing mode. See "Available Models" below.
messages	array	required	Array of {role, content} objects. Same format as OpenAI.
stream	boolean	optional	Enable SSE streaming. Default: false.
temperature	number	optional	Sampling temperature, 0–2. Default: 1.
max_tokens	integer	optional	Maximum completion tokens.
top_p	number	optional	Nucleus sampling threshold, 0–1.

Media Endpoints

All media endpoints share the same base URL and API key as chat. Same OpenAI SDK — just call the matching method.

Endpoint	Billing unit	Model examples
POST`/v1/embeddings` Text embeddings — returns vector representation of input text.	per 1k tokens	text-embedding-3-small/large
POST`/v1/audio/transcriptions` Speech-to-text — transcribes an audio file. Supports JSON, verbose JSON, text, SRT, and VTT output.	per second of audio	whisper-1 ($0.00009/sec)
POST`/v1/audio/speech` Text-to-speech — streams audio back. Supports mp3, opus, aac, flac, wav, pcm formats.	per 1k characters	tts-1 ($0.01339/1k chars), tts-1-hd ($0.02678/1k chars)
POST`/v1/images/generations` Image generation — creates images from a text prompt. HD quality available for DALL-E 3.	per image	dall-e-3 ($0.0357), dall-e-3-hd ($0.0714)
POST`/v1/images/edits` Image editing — edits an existing image with a text prompt. Optionally supply a mask.	per image	dall-e-2
POST`/v1/videos` Create a video job (async). seconds must be a string "4"/"8"/"12". Returns a job id + status=queued	per second of video	sora-2, veo-3.1, vidu2.0
GET`/v1/videos/{id}` Poll job status: queued → in_progress → completed	free	—
GET`/v1/videos/{id}/content` Download the finished video (mp4). 404 until the job completes	free	—

Usage examples

from openai import OpenAI
client = OpenAI(base_url="https://api.promptoll.com/v1", api_key="zai_•••")

# Embeddings
resp = client.embeddings.create(model="text-embedding-3-small", input="Hello world")
vec = resp.data[0].embedding          # list of floats

# Audio transcription
with open("meeting.mp3", "rb") as f:
    result = client.audio.transcriptions.create(model="whisper-1", file=f)
print(result.text)

# Text-to-speech — streams mp3 to disk
with client.audio.speech.with_streaming_response.create(
    model="tts-1", voice="alloy", input="Welcome to Promptoll!",
) as response:
    response.stream_to_file("output.mp3")

# Image generation
img = client.images.generate(
    model="dall-e-3", prompt="a futuristic city at night", n=1, size="1024x1024"
)
print(img.data[0].url)

Video generation (async, 3 steps)

Video is asynchronous: create a job, poll until completed, then download. seconds must be the string "4" / "8" / "12".

import time, httpx

BASE = "https://api.promptoll.com/v1"
H = {"Authorization": "Bearer zai_•••"}

# 1) Create the job  (seconds MUST be a string: "4" | "8" | "12")
job = httpx.post(f"{BASE}/videos", headers=H, json={
    "model": "sora-2",
    "prompt": "a dog running on the beach at sunset",
    "seconds": "4",
    "size": "720x1280",
}).json()
vid = job["id"]                       # -> "video_xxx", status: "queued"

# 2) Poll until completed  (queued -> in_progress -> completed)
while True:
    job = httpx.get(f"{BASE}/videos/{vid}", headers=H).json()
    if job["status"] in ("completed", "failed"):
        break
    time.sleep(5)

# 3) Download the result
if job["status"] == "completed":
    r = httpx.get(f"{BASE}/videos/{vid}/content", headers=H)
    open("out.mp4", "wb").write(r.content)

Available Models

Prices are per 1 million tokens. Pass the exact model ID in the model field. Each model lists its sweet spot — pick by use case, not just price.

★ Recommended for most users in this tier

Model	Provider	Input / 1M tokens	Output / 1M tokens
★gpt-5.5 OpenAI flagship — complex reasoning, agentic workflows, multi-step tool use	OpenAI	$4.46	$26.77
gpt-5.4 Balanced GPT-5.x tier — general production traffic	OpenAI	$2.23	$13.39
★gpt-5.4-mini Best price/quality in the GPT-5.x line	OpenAI	$0.67	$4.02
gpt-5.4-nano Lightweight tasks — classification, extraction, simple Q&A	OpenAI	$0.18	$1.12
o3 Hardest math, science, and step-by-step reasoning	OpenAI	$1.78	$7.14
★o3-mini Coding & logic at a fraction of o3's cost — most loved by devs	OpenAI	$0.98	$3.93
o4-mini Multimodal reasoning (vision + text) at low cost	OpenAI	$0.98	$3.93
gpt-4o Multimodal generalist — vision, audio, text in one call	OpenAI	$2.23	$8.93
gpt-4.1 1M-token context window — long doc analysis, repo-scale code	OpenAI	$1.78	$7.14
gpt-4.1-mini Long context at low cost — RAG over big knowledge bases	OpenAI	$0.36	$1.43
gpt-4o-mini High-volume cheap chat — chatbots, customer service	OpenAI	$0.13	$0.54
gpt-4.1-nano Cheapest tier — bulk classification, embeddings-style tasks	OpenAI	$0.09	$0.36
gpt-5.3-codex Agentic code generation — complete repos, long-horizon tasks	OpenAI Codex	$1.56	$12.49
★claude-opus-4-7 Top-tier writing, deep analysis, nuanced long-form reasoning	Anthropic	$4.99	$74.81
claude-opus-4-6 Previous Opus generation — keep for compatibility	Anthropic	$4.99	$74.81
claude-opus-4-5-20251101 Older Opus snapshot — pin if you need exact reproducibility	Anthropic	$4.99	$74.81
★claude-sonnet-4-6 Coding flagship — refactors, code reviews, agent loops	Anthropic	$2.99	$14.96
claude-sonnet-4-5-20250929 Previous Sonnet — pin for stable behavior	Anthropic	$2.99	$14.96
claude-haiku-4-5-20251001 Fast cheap chat with Claude's writing style	Anthropic	$1.00	$4.99
★deepseek-v4-pro Strongest open-source flagship — coding & Chinese-heavy reasoning	DeepSeek	$1.39	$2.78
★deepseek-v4-flash Best Chinese price/perf — high-volume Chinese chat	DeepSeek	$0.11	$0.22
deepseek-v3.1 Cheap general chat — fallback for v4 when budget-tight	DeepSeek	$0.22	$0.88
gemini-2.5-pro-thinking Long-context with thinking traces — research-grade reasoning	Google	$1.25	$9.97
★gemini-2.5-pro Long context generalist — RAG, document QA	Google	$1.25	$9.97
gemini-2.5-flash-thinking Cheap reasoning with thinking traces — math homework helpers	Google	$0.30	$2.49
★gemini-2.5-flash Multimodal at scale — best $/req for image+text bulk	Google	$0.30	$2.49
★grok-4.3 Real-time knowledge + 1M context — news, trends, research	xAI	$1.12	$2.23
★kimi-k2.6 Kimi latest flagship — Chinese long-form writing & creative tasks	Moonshot	$0.95	$3.99
kimi-k2.5 Chinese long-form writing & creative tasks	Moonshot	$0.60	$2.99
sonar Live web search — fact-grounded answers	Perplexity	$0.25	$0.25

Routing modes

Instead of a model ID, pass one of the special routing keywords:

auto	Cheapest model in your catalog
quality	Highest quality tier available
<model-id>	Use this model. If arbitrage is enabled, we may swap for a cheaper equivalent.

Response headers

X-Request-ID	Request ID — pass your own X-Request-ID to correlate your logs with ours, or use the one we generate.
X-Model-Used	Model that actually produced the response
X-Latency-Ms	Upstream + routing time in ms
X-Price-Microcents	Amount charged for this request (µ¢)

Reliability & retries

Promptoll automatically retries non-streaming upstream calls once on transient failures (5xx, 429, timeouts, connection errors) with ~250ms jittered backoff. You don't need your own retry layer for blips — but adding exponential backoff for repeated 5xx/429 from us is still recommended.
Streaming requests are NOT auto-retried — once bytes are sent we can't safely re-issue. Most OpenAI-compatible SDKs handle stream-restart on their side; rely on that.
We never retry 4xx (except 429): 400/401/403/422 mean the payload is wrong, retrying just wastes a round trip.
Pass X-Request-ID: <your-id> on every call (8–64 chars, [A-Za-z0-9_-]). We echo it back in the response and write it into our backend logs — so when something fails you can hand us the ID and we can grep down to the exact request.

curl https://api.promptoll.com/v1/chat/completions \
  -H "Authorization: Bearer zai_•••" \
  -H "X-Request-ID: my-app-req-7f3a92c1" \
  -H "Content-Type: application/json" \
  -d '{"model":"auto","messages":[{"role":"user","content":"hi"}]}'

# response includes:
# X-Request-ID: my-app-req-7f3a92c1   ← echoed back, save this for debugging

Streaming

Pass stream=True for SSE. Promptoll forwards every chunk unchanged and bills on the final usage event.

client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "Count to 10"}],
    stream=True,
)

Prompt Caching

When your requests share a long repeated prefix (system prompt, few-shot examples, document context), providers can cache that prefix and charge a fraction of the normal input price. Promptoll detects cache hits automatically and passes the discount through to your bill.

Support by provider

Provider / Models	Cache type	Min tokens	Cache-read price	Usage field
OpenAI — GPT-4o, GPT-4.1, o3, o4-mini, GPT-5.x	Automatic (identical prefix)	1,024 tokens	50% of input (GPT-4o: $1.25/M)	usage.prompt_tokens_details.cached_tokens
Anthropic — Claude Opus, Sonnet, Haiku	Explicit cache_control	1,024 (Haiku: 4,096)	10% of input (Sonnet: $0.30/M)	usage.cache_read_input_tokens
DeepSeek — v3, v3.1, r1, v4-flash, v4-pro	Automatic	64 tokens	≈ 2% of input (v4-flash: $0.028/M)	usage.prompt_cache_hit_tokens
xAI — grok-4.3	Automatic	varies	$0.20/M	usage.prompt_tokens_details.cached_tokens
Moonshot — kimi-k2.5, kimi-k2.6	Automatic	varies	k2.5: $0.10/M, k2.6: $0.16/M	usage.prompt_tokens_details.cached_tokens
Others (Gemini, GLM, Qwen…)	Not supported or requires separate setup	—	Full input rate	—

Enabling caching with Claude (explicit)

Claude is the only provider that requires explicit opt-in. Add cache_control to the blocks you want cached — typically the system prompt or long document context. Subsequent calls that share the same prefix hit the cache.

from anthropic import Anthropic

client = Anthropic(
    base_url="https://api.promptoll.com",
    api_key="zai_•••",
)

# First call: writes the system prompt to cache (cache_creation_input_tokens > 0)
# Subsequent calls with the same system prompt: reads from cache (cache_read_input_tokens > 0)
message = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=1024,
    system=[{
        "type": "text",
        "text": "You are a helpful assistant. " + "<your long system prompt here>",
        "cache_control": {"type": "ephemeral"},   # mark this block for caching
    }],
    messages=[{"role": "user", "content": "Hello!"}],
)

usage = message.usage
print(f"input tokens:          {usage.input_tokens}")
print(f"cache write tokens:    {usage.cache_creation_input_tokens}")   # only on first call
print(f"cache read tokens:     {usage.cache_read_input_tokens}")       # on cache-hit calls
print(f"output tokens:         {usage.output_tokens}")

Reading cache stats from the response

All providers surface cache information in the usage object. The exact field names differ by provider, but Promptoll normalises them:

# OpenAI SDK (GPT / DeepSeek / Grok / Kimi — all normalised to the same field)
resp = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "hi"}],
)
cached   = resp.usage.prompt_tokens_details.cached_tokens   # 0 when not cached
uncached = resp.usage.prompt_tokens - cached
print(f"cache hit: {cached} tokens  ({cached / resp.usage.prompt_tokens * 100:.0f}%)")

💡 Your Dashboard → Request Logs now shows cached_prompt_tokens per request. Filter by model to compare cache hit rates across your workloads.

Error Codes

Code	Name	Description
401	Unauthorized	Missing or invalid API key.
402	Insufficient Balance	Account balance is too low. Top up in Billing.
422	Unprocessable Entity	Invalid request parameters.
429	Too Many Requests	Rate limit exceeded. Retry with exponential backoff.
502	Bad Gateway	Upstream model provider failed. We already retried once before returning this — it's a real outage, not a blip. Wait a few seconds and retry.
503	Service Unavailable	Promptoll itself is degraded. Rare.

All errors return a JSON body: {"error": {"message": "...", "type": "...", "code": "..."}}

Code Examples

curl https://api.promptoll.com/v1/chat/completions \
  -H "Authorization: Bearer zai_•••" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gpt-4o",
    "messages": [{"role": "user", "content": "Hello!"}]
  }'