Deploy an open-source LLM to Baseten with just a config file and get an OpenAI-compatible API endpoint.
Deploying a model to Baseten turns a Hugging Face model into a production-ready API endpoint. You write a config.yaml that specifies the model, the hardware, and the engine, then truss push builds a TensorRT-optimized container and deploys it. No Python code, no Dockerfile, no container management.This guide walks through deploying Qwen 2.5 3B Instruct, a small but capable LLM, from a config file to a production API. You’ll set up Truss, write a config, deploy to Baseten, call the model’s OpenAI-compatible endpoint, and promote to production.
? 📦 Name this model: Qwen 2.5 3BTruss Qwen 2.5 3B was created in ~/qwen-2.5-3b
This creates a directory with a config.yaml, a model/ directory, and supporting files. For engine-based deployments like this one, you only need config.yaml. The model/ directory is for custom Python code when you need custom preprocessing, postprocessing, or unsupported model architectures.
Push the model to Baseten:We’ll start by deploying in development mode so we can iterate quickly:
Copy
Ask AI
truss push --watch
You should see:
Copy
Ask AI
✨ Model Qwen 2.5 3B was successfully pushed ✨🪵 View logs for your deployment at https://app.baseten.co/models/abc1d2ef/logs/xyz123👀 Watching for changes to truss...
The logs URL contains your model ID, the string after /models/ (e.g., abc1d2ef). You’ll need this to call the model’s API. You can also find it in your Baseten dashboard.Baseten now downloads the model weights from Hugging Face, compiles them with TensorRT-LLM, and deploys the resulting container to an L4 GPU. You can watch progress in the logs linked above.
Engine-based deployments serve an OpenAI-compatible API. Once the deployment shows “Active” in the dashboard, call it using the OpenAI SDK or cURL. Replace {model_id} with your model ID from the deployment output.
Machine learning is a branch of artificial intelligence where systems learnpatterns from data to make predictions or decisions without being explicitlyprogrammed for each task...
Any code that works with the OpenAI SDK works with your deployment. Just point the base_url at your model’s endpoint.
When you change your config.yaml and want to test quickly, use live reload:
Copy
Ask AI
truss watch
You should see:
Copy
Ask AI
🪵 View logs for your deployment at https://app.baseten.co/models/<model_id>/logs/<deployment_id>🚰 Attempting to sync truss with remoteNo changes observed, skipping patching.👀 Watching for changes to truss...
When you save changes, Truss automatically syncs them with the deployed model. This saves time by patching without a full rebuild.If you stopped the watch session, you can re-attach with:
Copy
Ask AI
truss watch
This creates a production deployment with its own endpoint. The API URL changes from /environments/development/ to /environments/production/: