Customize a model

Most models on Baseten deploy with just a config.yaml and an inference engine. But when you need custom preprocessing, postprocessing, or want to run a model architecture that the built-in engines don’t support, you can write Python code in a model.py file. Truss provides a Model class with three methods (__init__, load, and predict) that give you full control over how your model initializes, loads weights, and handles requests. This guide walks through deploying Phi-3-mini-4k-instruct, a 3.8B parameter LLM, using custom Python code. If you haven’t deployed a config-only model yet, start with Deploy your first model.

Set up your environment

Before you begin, sign up or sign in to Baseten.

Install Truss

Truss is Baseten’s model packaging framework. It handles containerization, dependencies, and deployment configuration.

uv (recommended)
pip (macOS/Linux)
pip (Windows)

uv is a fast Python package manager:

uv venv && source .venv/bin/activate
uv pip install truss

These commands create a virtual environment, activate it, and install Truss:

python3 -m venv .venv && source .venv/bin/activate
pip install --upgrade truss

These commands create a virtual environment, activate it, and install Truss:

python3 -m venv .venv && .venv\Scripts\activate
pip install --upgrade truss

New accounts include free credits. This guide uses less than $1 in GPU costs.

Create a Truss project

Create a new Truss:

truss init phi-3-mini && cd phi-3-mini

When prompted, give your Truss a name like Phi 3 Mini. This command scaffolds a project with the following structure:

phi-3-mini/
  model/
    __init__.py
    model.py
  config.yaml
  data/
  packages/

The key files are:

model/model.py: Your model code with load() and predict() methods.
config.yaml: Dependencies, resources, and deployment settings.
data/: Optional directory for data files bundled with your model.
packages/: Optional directory for local Python packages.

Truss uses this structure to build and deploy your model automatically. You define your model in model.py and your infrastructure in config.yaml, no Dockerfiles or container management required.

Implement model code

Replace the contents of model/model.py with the following code. This loads Phi-3-mini-4k-instruct using the transformers library and PyTorch:

model/model.py

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

class Model:
    def __init__(self, **kwargs):
        self._model = None
        self._tokenizer = None

    def load(self):
        self._model = AutoModelForCausalLM.from_pretrained(
            "microsoft/Phi-3-mini-4k-instruct",
            device_map="cuda",
            torch_dtype="auto"
        )
        self._tokenizer = AutoTokenizer.from_pretrained(
            "microsoft/Phi-3-mini-4k-instruct"
        )

    def predict(self, request):
        messages = request.pop("messages")
        model_inputs = self._tokenizer.apply_chat_template(
            messages, tokenize=False, add_generation_prompt=True
        )
        inputs = self._tokenizer(model_inputs, return_tensors="pt").to("cuda")
        with torch.no_grad():
            outputs = self._model.generate(input_ids=inputs["input_ids"], max_length=256)
        return {"output": self._tokenizer.decode(outputs[0], skip_special_tokens=True)}

Truss models follow a three-method pattern that separates initialization from inference:

Method	When it’s called	What to do here
`__init__`	Once when the class is created	Initialize variables, store configuration, set secrets.
`load`	Once at startup, before any requests	Load model weights, tokenizers, and other heavy resources.
`predict`	On every API request	Process input, run inference, return response.

The load method runs during the container’s cold start, before your model receives traffic. This keeps expensive operations (like downloading large model weights) out of the request path.

Understand the request/response flow

The predict method receives request, a dictionary containing the JSON body from the API call:

# API call with: {"messages": [{"role": "user", "content": "Hello"}]}
def predict(self, request):
    messages = request.pop("messages")  # Extract from request
    # ... run inference ...
    return {"output": result}  # Return dict becomes JSON response

Whatever dictionary you return becomes the API response. You control the input parameters and output format.

GPU and memory patterns

A few patterns in this code are common across GPU models:

device_map="cuda": Loads model weights directly to GPU.
.to("cuda"): Moves input tensors to GPU for inference.
torch.no_grad(): Disables gradient tracking to save memory (gradients aren’t needed for inference).

Configure dependencies and GPU

The config.yaml file defines your model’s environment and compute resources.

Set Python version and dependencies

config.yaml

python_version: py311
requirements:
  - six==1.17.0
  - accelerate==0.30.1
  - einops==0.8.0
  - transformers==4.41.2
  - torch==2.3.0

Key configuration options:

Field	Purpose	Example
`python_version`	Python version for your container.	`py39`, `py310`, `py311`, `py312`
`requirements`	Python packages to install (pip format).	`torch==2.3.0`
`system_packages`	System-level dependencies (apt packages).	`ffmpeg`, `libsm6`

For the complete list of configuration options, see the Truss reference config.

Always pin exact versions (e.g., torch==2.3.0 not torch>=2.0). This ensures reproducible builds and your model behaves the same way every time it’s deployed.

Allocate a GPU

The resources section specifies what hardware your model runs on:

config.yaml

resources:
  accelerator: T4
  use_gpu: true

Match your GPU to your model’s VRAM requirements. For Phi-3-mini (approximately 7.6 GB), a T4 (16 GB) provides headroom for inference.

GPU	VRAM	Good for
T4	16 GB	Small models, embeddings, fine-tuned models.
L4	24 GB	Medium models (7B parameters).
A10G	24 GB	Medium models, image generation.
A100	40/80 GB	Large models (13B-70B parameters).
H100	80 GB	Very large models, high throughput.

A rough rule for estimating VRAM: 2 GB per billion parameters for float16 models. A 7B model needs approximately 14 GB VRAM minimum.

Deploy the model

Authenticate with Baseten

Generate an API key from Baseten settings, then log in:

truss login

You should see:

💻 Let's add a Baseten remote!
🤫 Quietly paste your API_KEY:

Paste your API key when prompted. Truss saves your credentials for future deployments.

Push your model to Baseten

truss push --watch

You should see:

✨ Model Phi 3 Mini was successfully pushed ✨

🪵  View logs for your deployment at https://app.baseten.co/models/abc1d2ef/logs/xyz123

The logs URL contains your model ID, the string after /models/ (e.g., abc1d2ef). You’ll need this to call the model’s API. You can also find it in your Baseten dashboard.

Call the model API

After the deployment shows “Active” in the dashboard, call the model API:

Truss CLI
cURL
Python

From your Truss project directory, run:

truss predict --data '{"messages": [{"role": "user", "content": "What is AGI?"}]}'

You should see:

Calling predict on development deployment...
{
  "output": "AGI stands for Artificial General Intelligence..."
}

The Truss CLI uses your saved credentials and automatically targets the correct deployment.

Replace YOUR_MODEL_ID with your model ID (e.g., abc1d2ef):

curl -X POST https://model-YOUR_MODEL_ID.api.baseten.co/development/predict \
  -H "Authorization: Api-Key $BASETEN_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"messages": [{"role": "user", "content": "What is AGI?"}]}'

You should see:

{"output": "AGI stands for Artificial General Intelligence..."}

Replace YOUR_MODEL_ID with your model ID:

main.py

import requests
import os

model_id = "YOUR_MODEL_ID"  # Replace with your model ID (e.g., "abc1d2ef")
baseten_api_key = os.environ["BASETEN_API_KEY"]

resp = requests.post(
    f"https://model-{model_id}.api.baseten.co/development/predict",
    headers={"Authorization": f"Api-Key {baseten_api_key}"},
    json={
        "messages": [
            {"role": "user", "content": "What is AGI?"}
        ]
    }
)

print(resp.json())

You should see:

{"output": "AGI stands for Artificial General Intelligence..."}

Use live reload for development

To avoid long deploy times when testing changes, use live reload:

truss watch

You should see:

🪵  View logs for your deployment at https://app.baseten.co/models/<model_id>/logs/<deployment_id>
🚰 Attempting to sync truss with remote
No changes observed, skipping patching.
👀 Watching for changes to truss...

When you save changes to model.py, Truss automatically patches the deployed model:

Changes detected, creating patch...
Created patch to update model code file: model/model.py
Model Phi 3 Mini patched successfully.

This saves time by patching only the updated code without rebuilding Docker containers or restarting the model server.

Promote to production

Once you’re happy with the model, deploy it to production:

truss push --promote

This changes the API endpoint from /development/predict to /production/predict:

curl -X POST https://model-YOUR_MODEL_ID.api.baseten.co/production/predict \
  -H "Authorization: Api-Key $BASETEN_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"messages": [{"role": "user", "content": "What is AGI?"}]}'

Your model ID is the string after /models/ in the logs URL from truss push. You can also find it in your Baseten dashboard.

Next steps

Model configuration

Full reference for dependencies, secrets, resources, and deployment settings.

Model implementation

Advanced patterns including streaming, async, and custom health checks.

Autoscaling

Scale GPU replicas based on demand with configurable concurrency targets.

Deploy your first model

Deploy a model with just a config file, no custom Python needed.

Examples

​Set up your environment

​Install Truss

​Create a Truss project

​Implement model code

​Understand the request/response flow

​GPU and memory patterns

​Configure dependencies and GPU

​Set Python version and dependencies

​Allocate a GPU

​Deploy the model

​Authenticate with Baseten

​Push your model to Baseten

​Call the model API

​Use live reload for development

​Promote to production

​Next steps

Model configuration

Model implementation

Autoscaling

Deploy your first model

Set up your environment

Install Truss

Create a Truss project

Implement model code

Understand the request/response flow

GPU and memory patterns

Configure dependencies and GPU

Set Python version and dependencies

Allocate a GPU

Deploy the model

Authenticate with Baseten

Push your model to Baseten

Call the model API

Use live reload for development

Promote to production

Next steps