Example: Deploy Qwen 2.5 3B on an A10G via SGLang
This configuration serves Qwen 2.5 3B with SGLang on an A10G GPU. Running this model is fast and cheap, making it a good example for documentation, but the process of deploying it is very similar to larger models like Llama 3.3 70B.Setup
Before you deploy a model, you’ll need three quick setup steps.1
Create an API key for your Baseten account
Create an API key and save it as an environment variable:
2
Add an access token for Hugging Face
Some models require that you accept terms and conditions on Hugging Face before deployment. To prevent issues:
- Accept the license for any gated models you wish to access, like Llama 3.3.
- Create a read-only user access token from your Hugging Face account.
- Add the
hf_access_token
secret to your Baseten workspace.
3
Install Truss in your local development environment
Install the latest version of Truss, our open-source model packaging framework, as well as OpenAI’s model inference SDK, with:
Configuration
Start with an empty configuration file.config.yaml
we created above.
config.yaml
Deployment
Pushing the model to Baseten kicks off a multi-stage deployment process.Inference
This model is OpenAI compatible and can be called using the OpenAI client.call_model.py