- Llama 3.0 and later (including DeepSeek-R1 distills)
- Qwen 2.5 and later (including Math, Coder, and DeepSeek-R1 distills)
- Mistral (all LLMs)
The Engine Builder does not support vision-language models like Llama 3.2 11B or Pixtral. For these models, we recommend vLLM.
Example: Deploy Qwen 2.5 3B on an H100
This configuration builds an inference engine to serve Qwen 2.5 3B on an H100 GPU. Running this model is fast and cheap, making it a good example for documentation, but the process of deploying it is very similar to larger models like Llama 3.3 70B.Setup
Before you deploy a model, you’ll need three quick setup steps.1
Create an API key for your Baseten account
Create an API key and save it as an environment variable:
2
Add an access token for Hugging Face
Some models require that you accept terms and conditions on Hugging Face before deployment. To prevent issues:
- Accept the license for any gated models you wish to access, like Llama 3.3.
- Create a read-only user access token from your Hugging Face account.
- Add the hf_access_tokensecret to your Baseten workspace.
3
Install Truss in your local development environment
Install the latest version of Truss, our open-source model packaging framework, as well as OpenAI’s model inference SDK, with:
Configuration
Start with an empty configuration file.config.yaml