Deploy your first model
From model weights to API endpoint
This guide walks through packaging and deploying Phi-3-mini-4k-instruct, a 3.8B parameter LLM, as a production-ready API endpoint.
We’ll cover:
- Loading model weights from Hugging Face
- Running inference on a GPU
- Configuring dependencies and infrastructure
- Iterating with live reload development
- Deploying to production with autoscaling
By the end, you’ll have an AI model running on scalable infrastructure, callable via an API.
1. Setup
Before you begin:
- Sign up or sign in to Baseten
- Generate an API key and store it securely
- Install Truss, our model packaging framework
New accounts include free credits—this guide should use less than $1 in GPU costs.
2. Create a Truss
A Truss packages your model into a deployable container with all dependencies and configurations.
Create a new Truss:
When prompted, give your Truss a name like Phi 3 Mini
.
You should see the following file structure:
You’ll primarily edit model/model.py
and config.yaml
.
3. Load model weights
Phi-3-mini-4k-instruct is available on Hugging Face. We’ll load its weights using transformers.
Edit model/model.py
:
4. Implement Model Inference
Define how the model processes incoming requests by implementing the predict()
function:
This function:
- ✅ Accepts a list of messages
- ✅ Uses Hugging Face’s tokenizer
- ✅ Generates a response with max 256 tokens
5. Configure Dependencies & GPU
In config.yaml
, define the Python environment and compute resources:
Set Dependencies
Allocate a GPU
Phi-3-mini needs ~7.6GB VRAM. A T4 GPU (16GB VRAM) is a good choice.
6. Deploy the Model
1. Get Your API Key
🔗 Generate an API Key
For security, store it as an environment variable:
2. Push Your Model to Baseten
Monitor the deployment from your Baseten dashboard.
7. Call the Model API
Once deployed, call your model from Python:
8. Live Reload for Development
Avoid long deploy times when testing changes—use live reload:
- Saves time by patching only the updated code
- Skips rebuilding Docker containers
- Keeps the model server running while iterating
Make changes to model.py
, save, and test the API again.
9. Promote to Production
Once you’re happy with the model, deploy it to production:
This updates the API endpoint from:
- ❌ Development: /development/predict
- ✅ Production: /production/predict
Next Steps
🚀 You’ve successfully packaged, deployed, and invoked an AI model with Truss!
Explore more:
- Learning more about model serving with Truss.
- Example implementations for dozens of open source models.
- Inference examples and Baseten integrations.
- Using autoscaling settings to spin up and down multiple GPU replicas.