config.yaml file. You point to a model on Hugging Face, choose a GPU, and Baseten builds a TensorRT-optimized container with an OpenAI-compatible API. No Python code, no Dockerfile, no container management.
This tutorial deploys Qwen 2.5 3B Instruct to a production-ready endpoint on an L4 GPU.
Install and sign in
Before you begin, sign up or sign in to Baseten, then install uv, a fast Python package manager. Install the Truss CLI and connect it to your Baseten account. Browser login opens a tab to approve this device, so there’s no API key to copy and paste.Install Truss
Sign in
Create the config
Create a project directory with aconfig.yaml:
config.yaml file with the following contents:
config.yaml
resources.accelerator: L4runs inference on a single L4 (24 GB VRAM).trt_llmswitches on Engine-Builder-LLM, which compiles the model with TensorRT-LLM.checkpoint_repositorypoints to weights on Hugging Face. Qwen 2.5 3B Instruct is ungated, so no token is needed.quantization_type: fp8halves weight memory by quantizing to 8-bit floats.num_builder_gpus: 1sets the GPU count for the engine-build job. Without it, the CLI warns that FP8 builds can OOM at build time.
Deploy
Push to Baseten:/models/ (for example, abc1d2ef). You’ll need this to call the model’s API. You can also find it in your Baseten dashboard.
Baseten now downloads the model weights, compiles them with TensorRT-LLM, and deploys the resulting container to an L4 GPU. You can watch progress in the logs linked above. When the deployment status shows “Active” in the dashboard, it’s ready for requests.
Call your model
Engine-based deployments serve an OpenAI-compatible API, so any code that works with the OpenAI SDK works with your model. Replace{model_id} with your model ID from the deployment output.
- Python
- cURL
Install the OpenAI SDK if you don’t have it:Create a chat completion:
What just happened
From one config file, Baseten:- Downloaded the Qwen 2.5 3B Instruct weights from Hugging Face.
- Compiled them with TensorRT-LLM and FP8 quantization.
- Packaged the engine into a container on an L4 GPU.
- Exposed an OpenAI-compatible API at the model’s URL.
model.py, no Dockerfile, no inference server configuration. The same pattern works for most popular open-source LLMs, including Llama, Qwen, Mistral, Gemma, and Phi.
Next steps
Engine configuration
Tune max sequence length, batch size, quantization, and runtime settings for your deployment.
Custom model code
Add custom Python when you need preprocessing, postprocessing, or unsupported model architectures.
Autoscaling
Configure replicas, concurrency targets, and scale-to-zero for production traffic.
Promote to production
Move from development to production with
truss push --promote.