- Llama 3.0 and later (including DeepSeek-R1 distills)
- Qwen 2.5 and later (including Math, Coder, and DeepSeek-R1 distills)
- Mistral (all LLMs)
The Engine-Builder does not support vision-language models like Llama 3.2 11B or Pixtral. For these models, we recommend vLLM.
Example: Deploy Qwen 2.5 3B on an H100
This configuration builds an inference engine to serve Qwen 2.5 3B on an H100 GPU. Running this model is fast and cheap, making it a good example for documentation, but the process of deploying it is very similar to larger models like GLM-4.7.Setup
This guide usesuvx to run Truss commands without a separate install step. Sign in to Baseten and install the OpenAI SDK. Browser login opens a tab to approve this device, so there’s no API key to copy and paste.
Sign in to Baseten
Install the OpenAI SDK
Hugging Face access for gated models. Some models require that you accept terms and conditions on Hugging Face before deployment. To prevent issues:
- Accept the license for any gated models you wish to access, like Gemma 3.
- Create a read-only user access token from your Hugging Face account.
- Add the
hf_access_tokensecret to your Baseten workspace.
Configuration
Start with an empty configuration file.config.yaml