Skip to main content

Deploy DeepSeek R1

DeepSeek R1 is a 671B parameter mixture-of-experts model optimized for reasoning tasks. It performs well on math, code generation, and multi-step logic problems. On Baseten, you can deploy it with a config-only Truss using the BIS-LLM engine.

Deploy DeepSeek R1

Deploy DeepSeek R1 using a config.yaml that specifies the BIS-LLM inference stack. The model requires an 8-GPU node of H100 or B200 accelerators.
model_name: deepseek-r1
resources:
  accelerator: H100:8
  cpu: '8'
  memory: 80Gi
  use_gpu: true
trt_llm:
  inference_stack: v2
  build:
    checkpoint_repository:
      source: HF
      repo: "deepseek-ai/DeepSeek-R1"
    quantization_type: fp8
  runtime:
    max_seq_len: 131072
    max_batch_size: 128
    tensor_parallel_size: 8
    enable_chunked_prefill: true
    served_model_name: "deepseek-r1"
Deploy the model using the Truss CLI:
truss push

Run inference

DeepSeek R1 on BIS-LLM provides an OpenAI-compatible API. You can use the standard OpenAI Python SDK or cURL to make requests.
from openai import OpenAI
import os

client = OpenAI(
    api_key=os.environ['BASETEN_API_KEY'],
    base_url="https://model-{MODEL_ID}.api.baseten.co/environments/production/sync/v1"
)

response = client.chat.completions.create(
    model="deepseek-r1",
    messages=[
        {"role": "user", "content": "Explain the concept of quantum entanglement."}
    ],
    max_tokens=1024
)

print(response.choices[0].message.content)

Configuration and tuning

DeepSeek R1 is a Mixture of Experts (MoE) model, meaning only a subset of its 671B parameters are active for any given token. This architecture allows for massive capacity with relatively efficient inference.

Hardware and quantization

We recommend deploying DeepSeek R1 on an 8-GPU H100 node with FP8 quantization. This provides a good balance between inference speed and model quality. For even higher performance, you can use B200 GPUs with FP4 quantization, which significantly reduces memory usage and increases throughput.

Context window

The model supports up to 128k context. When configuring max_seq_len, ensure your hardware has sufficient KV cache memory to support your expected batch size and sequence length.