- Loading model weights from Hugging Face
- Running inference on a GPU
- Configuring dependencies and infrastructure
- Iterating with live reload development
- Deploying to production with autoscaling
1. Setup
Before you begin:- Sign up or sign in to Baseten
- Generate an API key and store it securely
- Install Truss, our model packaging framework
New accounts include free credits—this guide should use less than $1 in GPU
costs.
2. Create a Truss
A Truss packages your model into a deployable container with all dependencies and configurations. Create a new Truss:Phi 3 Mini
.
You should see the following file structure:
model/model.py
and config.yaml
.
3. Load model weights
Phi-3-mini-4k-instruct is available on Hugging Face. We’ll load its weights using transformers. Editmodel/model.py
:
model/model.py
4. Implement Model Inference
Define how the model processes incoming requests by implementing thepredict()
function:
model/model.py
- ✅ Accepts a list of messages
- ✅ Uses Hugging Face’s tokenizer
- ✅ Generates a response with max 256 tokens
5. Configure Dependencies & GPU
Inconfig.yaml
, define the Python environment and compute resources:
Set Dependencies
config.yaml
Allocate a GPU
Phi-3-mini needs ~7.6GB VRAM. A T4 GPU (16GB VRAM) is a good choice.config.yaml
6. Deploy the Model
1. Get Your API Key
🔗 Generate an API Key You can generate the API key from the Baseten UI. Click on the User icon at the top-right, then click API keys. Save your API-key, because we will use it in the next step.2. Push Your Model to Baseten
truss
will ask for your API-key and save it for future runs.
Monitor the deployment from your Baseten dashboard.
7. Call the Model API
After the deployment is complete, we can call the model API. First, store the Baseten API key as an environment variable:model_id
from your deployment.
8. Live Reload for Development
Avoid long deploy times when testing changes—use live reload:- Saves time by patching only the updated code
- Skips rebuilding Docker containers
- Keeps the model server running while iterating
model.py
, save, and test the API again.
9. Promote to Production
Once you’re happy with the model, deploy it to production:- ❌ Development: /development/predict
- ✅ Production: /production/predict
Next Steps
🚀 You’ve successfully packaged, deployed, and invoked an AI model with Truss! Explore more:- Learning more about model serving with Truss.
- Example implementations for dozens of open source models.
- Inference examples and Baseten integrations.
- Using autoscaling settings to spin up and down multiple GPU replicas.