Example: Deploy Qwen 2.5 3B on an L4
This configuration serves Qwen 2.5 3B with vLLM on an L4 GPU. The deployment process is the same for larger models like GLM-4.7. Adjust theresources and start_command to match your model’s requirements.
Set up your environment
Before you deploy a model, you’ll need three setup steps.Create an API key for your Baseten account
Create an API key and save it as an environment variable:
Add an access token for Hugging Face
Some models require that you accept terms and conditions on Hugging Face before deployment. To prevent issues:
- Accept the license for any gated models you wish to access, like Gemma 3.
- Create a read-only user access token from your Hugging Face account.
- Add the
hf_access_tokensecret to your Baseten workspace.
Install Truss in your local development environment
Install Truss and the OpenAI SDK:
- uv (recommended)
- pip (macOS/Linux)
- pip (Windows)
Configure the model
Create a directory with aconfig.yaml file:
config.yaml:
config.yaml
base_image specifies the vLLM Docker image. The model_cache pre-downloads the model from Hugging Face and stores it on a cached volume. At startup, truss-transfer-cli loads the cached weights into /app/model_cache/qwen, then vLLM serves the model with --served-model-name to set the model identifier for the OpenAI-compatible API. The health_checks give the server time to load the model before Baseten checks readiness.
Deploy the model
Push the model to Baseten to start the deployment:Call the model
Call the deployed model with the OpenAI client:call_model.py
model_url with the URL from your deployment output.