Deploy your first model
From model weights to API endpoint
This guide walks through packaging and deploying Phi-3-mini-4k-instruct, a 3.8B parameter LLM, as a production-ready API endpoint.
We’ll cover:
- Loading model weights from Hugging Face
- Running inference on a GPU
- Configuring dependencies and infrastructure
- Iterating with live reload development
- Deploying to production with autoscaling
By the end, you’ll have an AI model running on scalable infrastructure, callable via an API.
1. Setup
Before you begin:
- Sign up or sign in to Baseten
- Generate an API key and store it securely
- Install Truss, our model packaging framework
New accounts include free credits—this guide should use less than $1 in GPU costs.
2. Create a Truss
A Truss packages your model into a deployable container with all dependencies and configurations.
Create a new Truss:
When prompted, give your Truss a name like Phi 3 Mini
.
You should see the following file structure:
You’ll primarily edit model/model.py
and config.yaml
.
3. Load model weights
Phi-3-mini-4k-instruct is available on Hugging Face. We’ll load its weights using transformers.
Edit model/model.py
:
4. Implement Model Inference
Define how the model processes incoming requests by implementing the predict()
function:
This function:
- ✅ Accepts a list of messages
- ✅ Uses Hugging Face’s tokenizer
- ✅ Generates a response with max 256 tokens
5. Configure Dependencies & GPU
In config.yaml
, define the Python environment and compute resources:
Set Dependencies
Allocate a GPU
Phi-3-mini needs ~7.6GB VRAM. A T4 GPU (16GB VRAM) is a good choice.
6. Deploy the Model
1. Get Your API Key
🔗 Generate an API Key
You can generate the API key from the Baseten UI. Click on the User icon at the top-right, then click API keys. Save your API-key, because we will use it in the next step.
2. Push Your Model to Baseten
Since this is a first-time deployment, truss
will ask for your API-key and save it for future runs.
Monitor the deployment from your Baseten dashboard.
7. Call the Model API
After the deployment is complete, we can call the model API. First, store the Baseten API key as an environment variable:
Below is the client code. Be sure to replace model_id
from your deployment.
8. Live Reload for Development
Avoid long deploy times when testing changes—use live reload:
- Saves time by patching only the updated code
- Skips rebuilding Docker containers
- Keeps the model server running while iterating
Make changes to model.py
, save, and test the API again.
9. Promote to Production
Once you’re happy with the model, deploy it to production:
This updates the API endpoint from:
- ❌ Development: /development/predict
- ✅ Production: /production/predict
Next Steps
🚀 You’ve successfully packaged, deployed, and invoked an AI model with Truss!
Explore more:
- Learning more about model serving with Truss.
- Example implementations for dozens of open source models.
- Inference examples and Baseten integrations.
- Using autoscaling settings to spin up and down multiple GPU replicas.