Model deployment overview
Package and deploy AI models on Baseten
At a high level, model deployment has three phases:
- Get model weights for an open source, fine-tuned, or custom-built AI/ML model.
- Implement a model server.
- Run that model server in a container on the cloud behind a secure API endpoint.
It’s easy to download weights for an open source model for the first step, and Baseten entirely automates the third step (cloud deployment). But that second step, implementing the model server, is more complex.
To make it easier for AI engineers to write model serving code, we built Truss.
Truss: a model server abstraction
Truss is an open source framework for writing model server code in Python.
Truss gives you:
- The ability to create a containerized model server without learning Docker.
- An enjoyable and productive dev loop where you can test changes live in a remote development environment that closely mirrors production.
- Compatability across model frameworks like
torch
,transformers
, anddiffusors
; engines likeTensorRT
/TensorRT-LLM
,VLLM
, andTGI
; serving technologies likeTriton
; and any package you can install withpip
orapt
.
We built Truss because containerization technologies like Docker are incredibly powerful, but their abstractions are too general for the problems faced in model serving by AI and ML engineers. Building model-specific optimizations at the infrastructure layer is a distinct skillset to developing AI models, so Truss brings familiar Python-based tooling to the problem of model packaging to empower all developers to build production-ready AI model servers.
Using the Truss CLI
To get started with Truss, install the Truss CLI. We recommend always using the latest version:
pip install --upgrade truss
Start by creating a new Truss for your model:
truss init
After implementing your model server in model.py
and config.yaml
, you can push your model to Baseten:
truss push
This creates a development deployment, which you can patch by saving changes to your Truss code while running:
truss watch
When your model is ready for production, you can promote your deployment with:
truss push --publish
See the Truss CLI reference for more commands and options.
Live reload developer workflow
Waiting for your model server to build and deploy every time you make a change would be a painful developer experience. Instead, work on your model as a development deployment and your changes will be live in seconds.
When you run truss push
, your model is automatically deployed as a development deployment.
When you make a change to a development deployment, your code update is patched onto the running server. This patching process skips building and deploying an image and just runs the load()
command to reload model weights after making any necessary environment updates.
Development deployments are great for rapid iteration, but aren’t suitable for production use. When you’re ready to use your model in your application, promote your deployment to production.
Example model implementations
Tutorials
Step-by-step examples present core concepts for model packaging and deployment.
Truss examples
Source code for dozens of models with various engines, quantizations, and implementations.
Model library
Production-ready models with usage documentation, source code, and one-click deployments.
Model deployment guides
With Truss, you get all of the power and flexibility of Python. You can completely customize your model server behavior and environment so suit your needs. To get you started, we’ve written guides to common steps in model server implementation:
Loading the model:
Running the model:
Setting the environment: