Deploy DeepSeek R1
Deploy DeepSeek R1
Deploy DeepSeek R1 using aconfig.yaml that specifies the BIS-LLM inference stack. The model requires an 8-GPU node of H100 or B200 accelerators.
Run inference
DeepSeek R1 on BIS-LLM provides an OpenAI-compatible API. You can use the standard OpenAI Python SDK or cURL to make requests.- Python SDK
- cURL
Configuration and tuning
DeepSeek R1 is a Mixture of Experts (MoE) model, meaning only a subset of its 671B parameters are active for any given token. This architecture allows for massive capacity with relatively efficient inference.Hardware and quantization
We recommend deploying DeepSeek R1 on an 8-GPU H100 node with FP8 quantization. This provides a good balance between inference speed and model quality. For even higher performance, you can use B200 GPUs with FP4 quantization, which significantly reduces memory usage and increases throughput.Context window
The model supports up to 128k context. When configuringmax_seq_len, ensure your hardware has sufficient KV cache memory to support your expected batch size and sequence length.
Related
- Model APIs — Instant access to DeepSeek R1 without dedicated infrastructure.
- BIS-LLM documentation — Learn more about the engine powering this deployment.
- Truss documentation — General guide to packaging and deploying models.