DeepSeek V4 - Baseten

Reasoning Tool calling Agentic Long context

Setup

To get started, sign into Baseten with Truss and then install the OpenAI SDK.

Sign in to Baseten

uvx truss login --browser

Install the OpenAI SDK

uv pip install openai

Pick the model you want to deploy. Each tab is a self-contained recipe.

Flash
Pro

deepseek-ai/DeepSeek-V4-Flash is a VERIFY-parameter MoE model (VERIFY active per token) with up to 128K context.This preset serves DeepSeek V4 Flash on B200:4 with FP8 KV cache, the deep_gemm_mega_moe backend, expert parallelism, and MTP speculative decoding, tuned for low time-to-first-token.

Hardware

B200 × 4

Engine

vLLM 0.20.0

Context

128K

Concurrency

Write the config

Create and move into the project directory:

mkdir deepseek-v4-flash-latency && cd deepseek-v4-flash-latency

Then create a file named config.yaml and paste the following:

config.yaml

model_name: "model:deepseek-v4-flash preset:latency"

model_metadata:
  example_model_input:
    messages:
      - role: user
        content: "What is the meaning of life?"
    stream: true
    model: deepseek-ai/DeepSeek-V4-Flash
    max_tokens: 32768
    temperature: 1.0
  tags:
    - openai-compatible

base_image:
  image: vllm/vllm-openai:v0.20.0

weights:
  - source: "hf://deepseek-ai/DeepSeek-V4-Flash@main"
    mount_location: "/models/deepseek-v4-flash"
    auth_secret_name: "hf_access_token"

resources:
  accelerator: B200:4
  use_gpu: true

runtime:
  predict_concurrency: 64
  health_checks:
    restart_check_delay_seconds: 1800
    restart_threshold_seconds: 1200
    stop_traffic_threshold_seconds: 120

environment_variables:
  HF_HUB_ENABLE_HF_TRANSFER: "1"
  VLLM_LOGGING_LEVEL: WARNING
  VLLM_ENGINE_READY_TIMEOUT_S: "3600"
  COMPILATION_CONFIG: '{"cudagraph_mode":"FULL_AND_PIECEWISE","custom_ops":["all"]}'

secrets:
  hf_access_token: null

docker_server:
  start_command: >-
    sh -c "vllm serve /models/deepseek-v4-flash
    --served-model-name deepseek-ai/DeepSeek-V4-Flash
    --host 0.0.0.0
    --port 8000
    --trust-remote-code
    --kv-cache-dtype fp8
    --block-size 256
    --tensor-parallel-size 4
    --moe-backend deep_gemm_mega_moe
    --enable-expert-parallel
    --attention_config.use_fp4_indexer_cache=True
    --tokenizer-mode deepseek_v4
    --tool-call-parser deepseek_v4
    --enable-auto-tool-choice
    --reasoning-parser deepseek_v4
    --speculative_config.method mtp
    --speculative_config.num_speculative_tokens 2"
  readiness_endpoint: /health
  liveness_endpoint: /health
  predict_endpoint: /v1/chat/completions
  server_port: 8000

The container loads DeepSeek V4 Flash weights to /models/deepseek-v4-flash and serves the OpenAI-compatible API on port 8000. FP8 KV cache and the deep_gemm_mega_moe backend keep memory bandwidth in check, and the MTP speculator runs two draft tokens per step to amortize sampling cost.

Flags

The start_command passes these flags to the engine. Each one controls a runtime or serving behavior:

Flag	Value	What it does
`--trust-remote-code`	(no value)	Execute model-specific Python from the checkpoint (required for many Qwen, Phi, and custom architectures).
`--kv-cache-dtype`	`fp8`	KV cache numeric precision. fp8: ~2× KV cache density with negligible quality impact on most models.
`--block-size`	`256`	KV cache block size in tokens for paged attention. Larger blocks reduce fragmentation overhead; smaller blocks pack short requests more tightly.
`--tensor-parallel-size`	`4`	Number of GPUs to shard the model across.
`--moe-backend`	`deep_gemm_mega_moe`	MoE expert dispatch kernel. Engine-specific values select between routing implementations tuned for different hardware or model layouts.
`--enable-expert-parallel`	(no value)	Shard MoE expert weights across tensor-parallel ranks instead of replicating them, reducing per-GPU memory for large MoE models.
`--attention_config.use_fp4_indexer_cache`	`True`	Use the FP4 indexer cache path for attention, lowering KV cache memory at the cost of indexer precision.
`--tokenizer-mode`	`deepseek_v4`	Selects a custom tokenizer implementation. Required for models that ship a non-standard tokenizer alongside the checkpoint.
`--tool-call-parser`	`deepseek_v4`	Server-side parser that emits structured `tool_calls` on the response.
`--enable-auto-tool-choice`	(no value)	Let the model choose when to call tools without requiring `tool_choice: "required"`.
`--reasoning-parser`	`deepseek_v4`	Server-side parser that separates reasoning output into `reasoning_content`.
`--speculative_config.method`	`mtp`	Speculative decoding method. mtp: Multi-token prediction head speculation.
`--speculative_config.num_speculative_tokens`	`2`	Number of tokens the draft speculator proposes per step.

Deploy

Push the config to Baseten:

uvx truss push

You should see output similar to:

✨ Model deepseek-v4-flash-latency was successfully pushed ✨
🪵 View logs for your deployment at https://app.baseten.co/models/abcd1234/logs/wxyz5678

Your model ID is the string after /models/ in the logs URL (abcd1234 in the example). Use it wherever you see {model_id} in the next section.

Call the model

Your deployment serves an OpenAI-compatible API. Replace {model_id} with your model ID and make sure BASETEN_API_KEY is set.Now call your deployment to run inference:

Python
cURL

main.py

import os
from openai import OpenAI

client = OpenAI(
    api_key=os.environ["BASETEN_API_KEY"],
    base_url="https://model-{model_id}.api.baseten.co/environments/production/sync/v1",
)

response = client.chat.completions.create(
    model="deepseek-ai/DeepSeek-V4-Flash",
    messages=[
        {"role": "user", "content": "What is machine learning?"}
    ],
)

print(response.choices[0].message.content)

curl -s https://model-{model_id}.api.baseten.co/environments/production/sync/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Api-Key $BASETEN_API_KEY" \
  -d '{
    "model": "deepseek-ai/DeepSeek-V4-Flash",
    "messages": [
      {"role": "user", "content": "What is machine learning?"}
    ]
  }'

The server parses the model’s chain of thought into a separate reasoning_content field on the response. Read it alongside the final answer:

response = client.chat.completions.create(
    model="deepseek-ai/DeepSeek-V4-Flash",
    messages=[
        {"role": "user", "content": "How many r's in strawberry?"}
    ],
)
print(response.choices[0].message.reasoning_content)  # chain of thought
print(response.choices[0].message.content)            # final answer

To let the model call tools, pass a tools array. The server returns structured tool_calls on the response:

tools = [{
    "type": "function",
    "function": {
        "name": "get_weather",
        "parameters": {
            "type": "object",
            "properties": {"location": {"type": "string"}},
            "required": ["location"],
        },
    },
}]

response = client.chat.completions.create(
    model="deepseek-ai/DeepSeek-V4-Flash",
    messages=[
        {"role": "user", "content": "What's the weather in Paris?"}
    ],
    tools=tools,
)
print(response.choices[0].message.tool_calls)

deepseek-ai/DeepSeek-V4-Pro is a VERIFY-parameter MoE model (VERIFY active per token) with up to 128K context.This preset serves DeepSeek V4 Pro on B200:8 with FP8 KV cache, the deep_gemm_mega_moe backend, expert parallelism, and MTP speculative decoding, tuned for low time-to-first-token at full scale.

Hardware

B200 × 8

Engine

vLLM 0.20.0

Context

128K

Concurrency

Write the config

Create and move into the project directory:

mkdir deepseek-v4-pro-latency && cd deepseek-v4-pro-latency

Then create a file named config.yaml and paste the following:

config.yaml

model_name: "model:deepseek-v4-pro preset:latency"

model_metadata:
  example_model_input:
    messages:
      - role: user
        content: "What is the meaning of life?"
    stream: true
    model: deepseek-ai/DeepSeek-V4-Pro
    max_tokens: 32768
    temperature: 1.0
  tags:
    - openai-compatible

base_image:
  image: vllm/vllm-openai:v0.20.0

weights:
  - source: "hf://deepseek-ai/DeepSeek-V4-Pro@main"
    mount_location: "/models/deepseek-v4-pro"
    auth_secret_name: "hf_access_token"

resources:
  accelerator: B200:8
  use_gpu: true

runtime:
  predict_concurrency: 64
  health_checks:
    restart_check_delay_seconds: 1800
    restart_threshold_seconds: 1200
    stop_traffic_threshold_seconds: 120

environment_variables:
  HF_HUB_ENABLE_HF_TRANSFER: "1"
  VLLM_LOGGING_LEVEL: WARNING
  VLLM_ENGINE_READY_TIMEOUT_S: "3600"
  COMPILATION_CONFIG: '{"cudagraph_mode":"FULL_AND_PIECEWISE","custom_ops":["all"]}'

secrets:
  hf_access_token: null

docker_server:
  start_command: >-
    sh -c "vllm serve /models/deepseek-v4-pro
    --served-model-name deepseek-ai/DeepSeek-V4-Pro
    --host 0.0.0.0
    --port 8000
    --trust-remote-code
    --kv-cache-dtype fp8
    --block-size 256
    --tensor-parallel-size 8
    --moe-backend deep_gemm_mega_moe
    --enable-expert-parallel
    --attention_config.use_fp4_indexer_cache=True
    --tokenizer-mode deepseek_v4
    --tool-call-parser deepseek_v4
    --enable-auto-tool-choice
    --reasoning-parser deepseek_v4
    --speculative_config.method mtp
    --speculative_config.num_speculative_tokens 2"
  readiness_endpoint: /health
  liveness_endpoint: /health
  predict_endpoint: /v1/chat/completions
  server_port: 8000

The container loads DeepSeek V4 Pro weights to /models/deepseek-v4-pro and serves the OpenAI-compatible API on port 8000. Tensor parallelism is set to 8 across the B200 fleet, FP8 KV cache and the deep_gemm_mega_moe backend hold memory bandwidth, and the MTP speculator runs two draft tokens per step.

Flags

The start_command passes these flags to the engine. Each one controls a runtime or serving behavior:

Flag	Value	What it does
`--trust-remote-code`	(no value)	Execute model-specific Python from the checkpoint (required for many Qwen, Phi, and custom architectures).
`--kv-cache-dtype`	`fp8`	KV cache numeric precision. fp8: ~2× KV cache density with negligible quality impact on most models.
`--block-size`	`256`	KV cache block size in tokens for paged attention. Larger blocks reduce fragmentation overhead; smaller blocks pack short requests more tightly.
`--tensor-parallel-size`	`8`	Number of GPUs to shard the model across.
`--moe-backend`	`deep_gemm_mega_moe`	MoE expert dispatch kernel. Engine-specific values select between routing implementations tuned for different hardware or model layouts.
`--enable-expert-parallel`	(no value)	Shard MoE expert weights across tensor-parallel ranks instead of replicating them, reducing per-GPU memory for large MoE models.
`--attention_config.use_fp4_indexer_cache`	`True`	Use the FP4 indexer cache path for attention, lowering KV cache memory at the cost of indexer precision.
`--tokenizer-mode`	`deepseek_v4`	Selects a custom tokenizer implementation. Required for models that ship a non-standard tokenizer alongside the checkpoint.
`--tool-call-parser`	`deepseek_v4`	Server-side parser that emits structured `tool_calls` on the response.
`--enable-auto-tool-choice`	(no value)	Let the model choose when to call tools without requiring `tool_choice: "required"`.
`--reasoning-parser`	`deepseek_v4`	Server-side parser that separates reasoning output into `reasoning_content`.
`--speculative_config.method`	`mtp`	Speculative decoding method. mtp: Multi-token prediction head speculation.
`--speculative_config.num_speculative_tokens`	`2`	Number of tokens the draft speculator proposes per step.

Deploy

Push the config to Baseten:

uvx truss push

You should see output similar to:

✨ Model deepseek-v4-pro-latency was successfully pushed ✨
🪵 View logs for your deployment at https://app.baseten.co/models/abcd1234/logs/wxyz5678

Your model ID is the string after /models/ in the logs URL (abcd1234 in the example). Use it wherever you see {model_id} in the next section.

Call the model

Your deployment serves an OpenAI-compatible API. Replace {model_id} with your model ID and make sure BASETEN_API_KEY is set.Now call your deployment to run inference:

Python
cURL

main.py

import os
from openai import OpenAI

client = OpenAI(
    api_key=os.environ["BASETEN_API_KEY"],
    base_url="https://model-{model_id}.api.baseten.co/environments/production/sync/v1",
)

response = client.chat.completions.create(
    model="deepseek-ai/DeepSeek-V4-Pro",
    messages=[
        {"role": "user", "content": "What is machine learning?"}
    ],
)

print(response.choices[0].message.content)

curl -s https://model-{model_id}.api.baseten.co/environments/production/sync/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Api-Key $BASETEN_API_KEY" \
  -d '{
    "model": "deepseek-ai/DeepSeek-V4-Pro",
    "messages": [
      {"role": "user", "content": "What is machine learning?"}
    ]
  }'

The server parses the model’s chain of thought into a separate reasoning_content field on the response. Read it alongside the final answer:

response = client.chat.completions.create(
    model="deepseek-ai/DeepSeek-V4-Pro",
    messages=[
        {"role": "user", "content": "How many r's in strawberry?"}
    ],
)
print(response.choices[0].message.reasoning_content)  # chain of thought
print(response.choices[0].message.content)            # final answer

To let the model call tools, pass a tools array. The server returns structured tool_calls on the response:

tools = [{
    "type": "function",
    "function": {
        "name": "get_weather",
        "parameters": {
            "type": "object",
            "properties": {"location": {"type": "string"}},
            "required": ["location"],
        },
    },
}]

response = client.chat.completions.create(
    model="deepseek-ai/DeepSeek-V4-Pro",
    messages=[
        {"role": "user", "content": "What's the weather in Paris?"}
    ],
    tools=tools,
)
print(response.choices[0].message.tool_calls)

Documentation Index

​Setup

Hardware

Engine

Context

Concurrency

​Write the config

​Flags

​Deploy

​Call the model

Hardware

Engine

Context

Concurrency

​Write the config

​Flags

​Deploy

​Call the model

Setup

Write the config

Flags

Deploy

Call the model

Write the config

Flags

Deploy

Call the model