Deploy your first model

This guide walks through packaging and deploying Phi-3-mini-4k-instruct, a 3.8B parameter LLM, as a production-ready API endpoint.

We’ll cover:

Loading model weights from Hugging Face
Running inference on a GPU
Configuring dependencies and infrastructure
Iterating with live reload development
Deploying to production with autoscaling

By the end, you’ll have an AI model running on scalable infrastructure, callable via an API.

1. Setup

Before you begin:

Sign up or sign in to Baseten
Generate an API key and store it securely
Install Truss, our model packaging framework

pip install --upgrade truss

New accounts include free credits—this guide should use less than $1 in GPU costs.

2. Create a Truss

A Truss packages your model into a deployable container with all dependencies and configurations.

Create a new Truss:

truss init phi-3-mini && cd phi-3-mini

When prompted, give your Truss a name like Phi 3 Mini.

You should see the following file structure:

phi-3-mini/
  data/
  model/
    __init__.py
    model.py
  packages/
  config.yaml

You’ll primarily edit model/model.py and config.yaml.

3. Load model weights

Phi-3-mini-4k-instruct is available on Hugging Face. We’ll load its weights using transformers.

Edit model/model.py:

model/model.py

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

class Model:
    def __init__(self, **kwargs):
        self._model = None
        self._tokenizer = None

    def load(self):
        self._model = AutoModelForCausalLM.from_pretrained(
            "microsoft/Phi-3-mini-4k-instruct",
            device_map="cuda",
            torch_dtype="auto"
        )
        self._tokenizer = AutoTokenizer.from_pretrained(
            "microsoft/Phi-3-mini-4k-instruct"
        )

4. Implement Model Inference

Define how the model processes incoming requests by implementing the predict() function:

model/model.py

class Model:
    ...
    def predict(self, request):
        messages = request.pop("messages")
        model_inputs = self._tokenizer.apply_chat_template(
            messages, tokenize=False, add_generation_prompt=True
        )
        inputs = self._tokenizer(model_inputs, return_tensors="pt").to("cuda")
        with torch.no_grad():
            outputs = self._model.generate(input_ids=inputs["input_ids"], max_length=256)
        return {"output": self._tokenizer.decode(outputs[0], skip_special_tokens=True)}

This function:

✅ Accepts a list of messages
✅ Uses Hugging Face’s tokenizer
✅ Generates a response with max 256 tokens

5. Configure Dependencies & GPU

In config.yaml, define the Python environment and compute resources:

Set Dependencies

config.yaml

requirements:
  - six==1.17.0
  - accelerate==0.30.1
  - einops==0.8.0
  - transformers==4.41.2
  - torch==2.3.0

Allocate a GPU

Phi-3-mini needs ~7.6GB VRAM. A T4 GPU (16GB VRAM) is a good choice.

config.yaml

resources:
  accelerator: T4
  use_gpu: true

6. Deploy the Model

1. Get Your API Key

🔗 Generate an API Key

You can generate the API key from the Baseten UI. Click on the User icon at the top-right, then click API keys. Save your API-key, because we will use it in the next step.

2. Push Your Model to Baseten

truss push

Since this is a first-time deployment, truss will ask for your API-key and save it for future runs.

Monitor the deployment from your Baseten dashboard.

7. Call the Model API

After the deployment is complete, we can call the model API. First, store the Baseten API key as an environment variable:

export BASETEN_API_KEY=<your_api_key>

Below is the client code. Be sure to replace model_id from your deployment.

import requests
import os

model_id = "your_model_id"
baseten_api_key = os.environ["BASETEN_API_KEY"]

resp = requests.post(
    f"https://model-{model_id}.api.baseten.co/development/predict",
    headers={"Authorization": f"Api-Key {baseten_api_key}"},
    json={"messages": ["What is AGI?"]}
)

print(resp.json())

8. Live Reload for Development

Avoid long deploy times when testing changes—use live reload:

truss watch

Saves time by patching only the updated code
Skips rebuilding Docker containers
Keeps the model server running while iterating

Make changes to model.py, save, and test the API again.

9. Promote to Production

Once you’re happy with the model, deploy it to production:

truss push --publish

This updates the API endpoint from:

❌ Development: /development/predict
✅ Production: /production/predict

resp = requests.post(
    f"https://model-{model_id}.api.baseten.co/production/predict",
    headers={"Authorization": f"Api-Key {baseten_api_key}"},
    json={
        "messages": [
            {"role": "user", "content": "What is AGI?"}
        ],
    }
)

Next Steps

🚀 You’ve successfully packaged, deployed, and invoked an AI model with Truss!

Explore more:

Learning more about model serving with Truss.
Example implementations for dozens of open source models.
Inference examples and Baseten integrations.
Using autoscaling settings to spin up and down multiple GPU replicas.

Examples

Model library

Deploy your first model

1. Setup

2. Create a Truss

3. Load model weights

4. Implement Model Inference

5. Configure Dependencies & GPU

Set Dependencies

Allocate a GPU

6. Deploy the Model

1. Get Your API Key

2. Push Your Model to Baseten

7. Call the Model API

8. Live Reload for Development

9. Promote to Production

Next Steps

Examples

Model library

​1. Setup

​2. Create a Truss

​3. Load model weights

​4. Implement Model Inference

​5. Configure Dependencies & GPU

​Set Dependencies

​Allocate a GPU

​6. Deploy the Model

​1. Get Your API Key

​2. Push Your Model to Baseten

​7. Call the Model API

​8. Live Reload for Development

​9. Promote to Production

​Next Steps

1. Setup

2. Create a Truss

3. Load model weights

4. Implement Model Inference

5. Configure Dependencies & GPU

Set Dependencies

Allocate a GPU

6. Deploy the Model

1. Get Your API Key

2. Push Your Model to Baseten

7. Call the Model API

8. Live Reload for Development

9. Promote to Production

Next Steps