Documentation Index
Fetch the complete documentation index at: https://docs.baseten.co/llms.txt
Use this file to discover all available pages before exploring further.
Setup
To get started, sign into Baseten with Truss and then install the OpenAI SDK.
Sign in to Basetenuvx truss login --browser
Pick the model you want to deploy. Each tab is a self-contained recipe.
deepseek-ai/DeepSeek-V4-Flash is a VERIFY-parameter MoE model (VERIFY active per token) with up to 128K context.This preset serves DeepSeek V4 Flash on B200:4 with FP8 KV cache, the deep_gemm_mega_moe backend, expert parallelism, and MTP speculative decoding, tuned for low time-to-first-token.Write the config
Create and move into the project directory:mkdir deepseek-v4-flash-latency && cd deepseek-v4-flash-latency
Then create a file named config.yaml and paste the following:model_name: "model:deepseek-v4-flash preset:latency"
model_metadata:
example_model_input:
messages:
- role: user
content: "What is the meaning of life?"
stream: true
model: deepseek-ai/DeepSeek-V4-Flash
max_tokens: 32768
temperature: 1.0
tags:
- openai-compatible
base_image:
image: vllm/vllm-openai:v0.20.0
weights:
- source: "hf://deepseek-ai/DeepSeek-V4-Flash@main"
mount_location: "/models/deepseek-v4-flash"
auth_secret_name: "hf_access_token"
resources:
accelerator: B200:4
use_gpu: true
runtime:
predict_concurrency: 64
health_checks:
restart_check_delay_seconds: 1800
restart_threshold_seconds: 1200
stop_traffic_threshold_seconds: 120
environment_variables:
HF_HUB_ENABLE_HF_TRANSFER: "1"
VLLM_LOGGING_LEVEL: WARNING
VLLM_ENGINE_READY_TIMEOUT_S: "3600"
COMPILATION_CONFIG: '{"cudagraph_mode":"FULL_AND_PIECEWISE","custom_ops":["all"]}'
secrets:
hf_access_token: null
docker_server:
start_command: >-
sh -c "vllm serve /models/deepseek-v4-flash
--served-model-name deepseek-ai/DeepSeek-V4-Flash
--host 0.0.0.0
--port 8000
--trust-remote-code
--kv-cache-dtype fp8
--block-size 256
--tensor-parallel-size 4
--moe-backend deep_gemm_mega_moe
--enable-expert-parallel
--attention_config.use_fp4_indexer_cache=True
--tokenizer-mode deepseek_v4
--tool-call-parser deepseek_v4
--enable-auto-tool-choice
--reasoning-parser deepseek_v4
--speculative_config.method mtp
--speculative_config.num_speculative_tokens 2"
readiness_endpoint: /health
liveness_endpoint: /health
predict_endpoint: /v1/chat/completions
server_port: 8000
The container loads DeepSeek V4 Flash weights to /models/deepseek-v4-flash and serves the OpenAI-compatible API on port 8000. FP8 KV cache and the deep_gemm_mega_moe backend keep memory bandwidth in check, and the MTP speculator runs two draft tokens per step to amortize sampling cost.Flags
The start_command passes these flags to the engine. Each one controls a runtime or serving behavior:| Flag | Value | What it does |
|---|
--trust-remote-code | (no value) | Execute model-specific Python from the checkpoint (required for many Qwen, Phi, and custom architectures). |
--kv-cache-dtype | fp8 | KV cache numeric precision. fp8: ~2× KV cache density with negligible quality impact on most models. |
--block-size | 256 | KV cache block size in tokens for paged attention. Larger blocks reduce fragmentation overhead; smaller blocks pack short requests more tightly. |
--tensor-parallel-size | 4 | Number of GPUs to shard the model across. |
--moe-backend | deep_gemm_mega_moe | MoE expert dispatch kernel. Engine-specific values select between routing implementations tuned for different hardware or model layouts. |
--enable-expert-parallel | (no value) | Shard MoE expert weights across tensor-parallel ranks instead of replicating them, reducing per-GPU memory for large MoE models. |
--attention_config.use_fp4_indexer_cache | True | Use the FP4 indexer cache path for attention, lowering KV cache memory at the cost of indexer precision. |
--tokenizer-mode | deepseek_v4 | Selects a custom tokenizer implementation. Required for models that ship a non-standard tokenizer alongside the checkpoint. |
--tool-call-parser | deepseek_v4 | Server-side parser that emits structured tool_calls on the response. |
--enable-auto-tool-choice | (no value) | Let the model choose when to call tools without requiring tool_choice: "required". |
--reasoning-parser | deepseek_v4 | Server-side parser that separates reasoning output into reasoning_content. |
--speculative_config.method | mtp | Speculative decoding method. mtp: Multi-token prediction head speculation. |
--speculative_config.num_speculative_tokens | 2 | Number of tokens the draft speculator proposes per step. |
Deploy
Push the config to Baseten:You should see output similar to:✨ Model deepseek-v4-flash-latency was successfully pushed ✨
🪵 View logs for your deployment at https://app.baseten.co/models/abcd1234/logs/wxyz5678
Your model ID is the string after /models/ in the logs URL (abcd1234 in the example). Use it wherever you see {model_id} in the next section.Call the model
Your deployment serves an OpenAI-compatible API. Replace {model_id} with your model ID and make sure BASETEN_API_KEY is set.Now call your deployment to run inference:import os
from openai import OpenAI
client = OpenAI(
api_key=os.environ["BASETEN_API_KEY"],
base_url="https://model-{model_id}.api.baseten.co/environments/production/sync/v1",
)
response = client.chat.completions.create(
model="deepseek-ai/DeepSeek-V4-Flash",
messages=[
{"role": "user", "content": "What is machine learning?"}
],
)
print(response.choices[0].message.content)
curl -s https://model-{model_id}.api.baseten.co/environments/production/sync/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Api-Key $BASETEN_API_KEY" \
-d '{
"model": "deepseek-ai/DeepSeek-V4-Flash",
"messages": [
{"role": "user", "content": "What is machine learning?"}
]
}'
The server parses the model’s chain of thought into a separate reasoning_content field on the response. Read it alongside the final answer:response = client.chat.completions.create(
model="deepseek-ai/DeepSeek-V4-Flash",
messages=[
{"role": "user", "content": "How many r's in strawberry?"}
],
)
print(response.choices[0].message.reasoning_content) # chain of thought
print(response.choices[0].message.content) # final answer
To let the model call tools, pass a tools array. The server returns structured tool_calls on the response:tools = [{
"type": "function",
"function": {
"name": "get_weather",
"parameters": {
"type": "object",
"properties": {"location": {"type": "string"}},
"required": ["location"],
},
},
}]
response = client.chat.completions.create(
model="deepseek-ai/DeepSeek-V4-Flash",
messages=[
{"role": "user", "content": "What's the weather in Paris?"}
],
tools=tools,
)
print(response.choices[0].message.tool_calls)
deepseek-ai/DeepSeek-V4-Pro is a VERIFY-parameter MoE model (VERIFY active per token) with up to 128K context.This preset serves DeepSeek V4 Pro on B200:8 with FP8 KV cache, the deep_gemm_mega_moe backend, expert parallelism, and MTP speculative decoding, tuned for low time-to-first-token at full scale.Write the config
Create and move into the project directory:mkdir deepseek-v4-pro-latency && cd deepseek-v4-pro-latency
Then create a file named config.yaml and paste the following:model_name: "model:deepseek-v4-pro preset:latency"
model_metadata:
example_model_input:
messages:
- role: user
content: "What is the meaning of life?"
stream: true
model: deepseek-ai/DeepSeek-V4-Pro
max_tokens: 32768
temperature: 1.0
tags:
- openai-compatible
base_image:
image: vllm/vllm-openai:v0.20.0
weights:
- source: "hf://deepseek-ai/DeepSeek-V4-Pro@main"
mount_location: "/models/deepseek-v4-pro"
auth_secret_name: "hf_access_token"
resources:
accelerator: B200:8
use_gpu: true
runtime:
predict_concurrency: 64
health_checks:
restart_check_delay_seconds: 1800
restart_threshold_seconds: 1200
stop_traffic_threshold_seconds: 120
environment_variables:
HF_HUB_ENABLE_HF_TRANSFER: "1"
VLLM_LOGGING_LEVEL: WARNING
VLLM_ENGINE_READY_TIMEOUT_S: "3600"
COMPILATION_CONFIG: '{"cudagraph_mode":"FULL_AND_PIECEWISE","custom_ops":["all"]}'
secrets:
hf_access_token: null
docker_server:
start_command: >-
sh -c "vllm serve /models/deepseek-v4-pro
--served-model-name deepseek-ai/DeepSeek-V4-Pro
--host 0.0.0.0
--port 8000
--trust-remote-code
--kv-cache-dtype fp8
--block-size 256
--tensor-parallel-size 8
--moe-backend deep_gemm_mega_moe
--enable-expert-parallel
--attention_config.use_fp4_indexer_cache=True
--tokenizer-mode deepseek_v4
--tool-call-parser deepseek_v4
--enable-auto-tool-choice
--reasoning-parser deepseek_v4
--speculative_config.method mtp
--speculative_config.num_speculative_tokens 2"
readiness_endpoint: /health
liveness_endpoint: /health
predict_endpoint: /v1/chat/completions
server_port: 8000
The container loads DeepSeek V4 Pro weights to /models/deepseek-v4-pro and serves the OpenAI-compatible API on port 8000. Tensor parallelism is set to 8 across the B200 fleet, FP8 KV cache and the deep_gemm_mega_moe backend hold memory bandwidth, and the MTP speculator runs two draft tokens per step.Flags
The start_command passes these flags to the engine. Each one controls a runtime or serving behavior:| Flag | Value | What it does |
|---|
--trust-remote-code | (no value) | Execute model-specific Python from the checkpoint (required for many Qwen, Phi, and custom architectures). |
--kv-cache-dtype | fp8 | KV cache numeric precision. fp8: ~2× KV cache density with negligible quality impact on most models. |
--block-size | 256 | KV cache block size in tokens for paged attention. Larger blocks reduce fragmentation overhead; smaller blocks pack short requests more tightly. |
--tensor-parallel-size | 8 | Number of GPUs to shard the model across. |
--moe-backend | deep_gemm_mega_moe | MoE expert dispatch kernel. Engine-specific values select between routing implementations tuned for different hardware or model layouts. |
--enable-expert-parallel | (no value) | Shard MoE expert weights across tensor-parallel ranks instead of replicating them, reducing per-GPU memory for large MoE models. |
--attention_config.use_fp4_indexer_cache | True | Use the FP4 indexer cache path for attention, lowering KV cache memory at the cost of indexer precision. |
--tokenizer-mode | deepseek_v4 | Selects a custom tokenizer implementation. Required for models that ship a non-standard tokenizer alongside the checkpoint. |
--tool-call-parser | deepseek_v4 | Server-side parser that emits structured tool_calls on the response. |
--enable-auto-tool-choice | (no value) | Let the model choose when to call tools without requiring tool_choice: "required". |
--reasoning-parser | deepseek_v4 | Server-side parser that separates reasoning output into reasoning_content. |
--speculative_config.method | mtp | Speculative decoding method. mtp: Multi-token prediction head speculation. |
--speculative_config.num_speculative_tokens | 2 | Number of tokens the draft speculator proposes per step. |
Deploy
Push the config to Baseten:You should see output similar to:✨ Model deepseek-v4-pro-latency was successfully pushed ✨
🪵 View logs for your deployment at https://app.baseten.co/models/abcd1234/logs/wxyz5678
Your model ID is the string after /models/ in the logs URL (abcd1234 in the example). Use it wherever you see {model_id} in the next section.Call the model
Your deployment serves an OpenAI-compatible API. Replace {model_id} with your model ID and make sure BASETEN_API_KEY is set.Now call your deployment to run inference:import os
from openai import OpenAI
client = OpenAI(
api_key=os.environ["BASETEN_API_KEY"],
base_url="https://model-{model_id}.api.baseten.co/environments/production/sync/v1",
)
response = client.chat.completions.create(
model="deepseek-ai/DeepSeek-V4-Pro",
messages=[
{"role": "user", "content": "What is machine learning?"}
],
)
print(response.choices[0].message.content)
curl -s https://model-{model_id}.api.baseten.co/environments/production/sync/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Api-Key $BASETEN_API_KEY" \
-d '{
"model": "deepseek-ai/DeepSeek-V4-Pro",
"messages": [
{"role": "user", "content": "What is machine learning?"}
]
}'
The server parses the model’s chain of thought into a separate reasoning_content field on the response. Read it alongside the final answer:response = client.chat.completions.create(
model="deepseek-ai/DeepSeek-V4-Pro",
messages=[
{"role": "user", "content": "How many r's in strawberry?"}
],
)
print(response.choices[0].message.reasoning_content) # chain of thought
print(response.choices[0].message.content) # final answer
To let the model call tools, pass a tools array. The server returns structured tool_calls on the response:tools = [{
"type": "function",
"function": {
"name": "get_weather",
"parameters": {
"type": "object",
"properties": {"location": {"type": "string"}},
"required": ["location"],
},
},
}]
response = client.chat.completions.create(
model="deepseek-ai/DeepSeek-V4-Pro",
messages=[
{"role": "user", "content": "What's the weather in Paris?"}
],
tools=tools,
)
print(response.choices[0].message.tool_calls)