Skip to main content
Most models on Baseten deploy with just a config.yaml and an inference engine. But when you need custom preprocessing, postprocessing, or want to run a model architecture that the built-in engines don’t support, you can write Python code in a model.py file. Truss provides a Model class with three methods (__init__, load, and predict) that give you full control over how your model initializes, loads weights, and handles requests. This guide walks through deploying Phi-3-mini-4k-instruct, a 3.8B parameter LLM, using custom Python code. If you haven’t deployed a config-only model yet, start with Deploy your first model.

Set up your environment

Before you begin, sign up or sign in to Baseten.

Install Truss

Truss is Baseten’s model packaging framework. It handles containerization, dependencies, and deployment configuration.
New accounts include free credits. This guide uses less than $1 in GPU costs.

Create a Truss project

Create a new Truss:
truss init phi-3-mini && cd phi-3-mini
When prompted, give your Truss a name like Phi 3 Mini. This command scaffolds a project with the following structure:
phi-3-mini/
  model/
    __init__.py
    model.py
  config.yaml
  data/
  packages/
The key files are:
  • model/model.py: Your model code with load() and predict() methods.
  • config.yaml: Dependencies, resources, and deployment settings.
  • data/: Optional directory for data files bundled with your model.
  • packages/: Optional directory for local Python packages.
Truss uses this structure to build and deploy your model automatically. You define your model in model.py and your infrastructure in config.yaml, no Dockerfiles or container management required.

Implement model code

Replace the contents of model/model.py with the following code. This loads Phi-3-mini-4k-instruct using the transformers library and PyTorch:
model/model.py
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

class Model:
    def __init__(self, **kwargs):
        self._model = None
        self._tokenizer = None

    def load(self):
        self._model = AutoModelForCausalLM.from_pretrained(
            "microsoft/Phi-3-mini-4k-instruct",
            device_map="cuda",
            torch_dtype="auto"
        )
        self._tokenizer = AutoTokenizer.from_pretrained(
            "microsoft/Phi-3-mini-4k-instruct"
        )

    def predict(self, request):
        messages = request.pop("messages")
        model_inputs = self._tokenizer.apply_chat_template(
            messages, tokenize=False, add_generation_prompt=True
        )
        inputs = self._tokenizer(model_inputs, return_tensors="pt").to("cuda")
        with torch.no_grad():
            outputs = self._model.generate(input_ids=inputs["input_ids"], max_length=256)
        return {"output": self._tokenizer.decode(outputs[0], skip_special_tokens=True)}
Truss models follow a three-method pattern that separates initialization from inference:
MethodWhen it’s calledWhat to do here
__init__Once when the class is createdInitialize variables, store configuration, set secrets.
loadOnce at startup, before any requestsLoad model weights, tokenizers, and other heavy resources.
predictOn every API requestProcess input, run inference, return response.
The load method runs during the container’s cold start, before your model receives traffic. This keeps expensive operations (like downloading large model weights) out of the request path.

Understand the request/response flow

The predict method receives request, a dictionary containing the JSON body from the API call:
# API call with: {"messages": [{"role": "user", "content": "Hello"}]}
def predict(self, request):
    messages = request.pop("messages")  # Extract from request
    # ... run inference ...
    return {"output": result}  # Return dict becomes JSON response
Whatever dictionary you return becomes the API response. You control the input parameters and output format.

GPU and memory patterns

A few patterns in this code are common across GPU models:
  • device_map="cuda": Loads model weights directly to GPU.
  • .to("cuda"): Moves input tensors to GPU for inference.
  • torch.no_grad(): Disables gradient tracking to save memory (gradients aren’t needed for inference).

Configure dependencies and GPU

The config.yaml file defines your model’s environment and compute resources.

Set Python version and dependencies

config.yaml
python_version: py311
requirements:
  - six==1.17.0
  - accelerate==0.30.1
  - einops==0.8.0
  - transformers==4.41.2
  - torch==2.3.0
Key configuration options:
FieldPurposeExample
python_versionPython version for your container.py39, py310, py311, py312
requirementsPython packages to install (pip format).torch==2.3.0
system_packagesSystem-level dependencies (apt packages).ffmpeg, libsm6
For the complete list of configuration options, see the Truss reference config.
Always pin exact versions (e.g., torch==2.3.0 not torch>=2.0). This ensures reproducible builds and your model behaves the same way every time it’s deployed.

Allocate a GPU

The resources section specifies what hardware your model runs on:
config.yaml
resources:
  accelerator: T4
  use_gpu: true
Match your GPU to your model’s VRAM requirements. For Phi-3-mini (approximately 7.6 GB), a T4 (16 GB) provides headroom for inference.
GPUVRAMGood for
T416 GBSmall models, embeddings, fine-tuned models.
L424 GBMedium models (7B parameters).
A10G24 GBMedium models, image generation.
A10040/80 GBLarge models (13B-70B parameters).
H10080 GBVery large models, high throughput.
A rough rule for estimating VRAM: 2 GB per billion parameters for float16 models. A 7B model needs approximately 14 GB VRAM minimum.

Deploy the model

Authenticate with Baseten

Generate an API key from Baseten settings, then log in:
truss login
You should see:
💻 Let's add a Baseten remote!
🤫 Quietly paste your API_KEY:
Paste your API key when prompted. Truss saves your credentials for future deployments.

Push your model to Baseten

truss push
You should see:
✨ Model Phi 3 Mini was successfully pushed ✨

🪵  View logs for your deployment at https://app.baseten.co/models/abc1d2ef/logs/xyz123
The logs URL contains your model ID, the string after /models/ (e.g., abc1d2ef). You’ll need this to call the model’s API. You can also find it in your Baseten dashboard.

Call the model API

After the deployment shows “Active” in the dashboard, call the model API:
From your Truss project directory, run:
truss predict --data '{"messages": [{"role": "user", "content": "What is AGI?"}]}'
You should see:
Calling predict on development deployment...
{
  "output": "AGI stands for Artificial General Intelligence..."
}
The Truss CLI uses your saved credentials and automatically targets the correct deployment.

Use live reload for development

To avoid long deploy times when testing changes, use live reload:
truss watch
You should see:
🪵  View logs for your deployment at https://app.baseten.co/models/<model_id>/logs/<deployment_id>
🚰 Attempting to sync truss with remote
No changes observed, skipping patching.
👀 Watching for changes to truss...
When you save changes to model.py, Truss automatically patches the deployed model:
Changes detected, creating patch...
Created patch to update model code file: model/model.py
Model Phi 3 Mini patched successfully.
This saves time by patching only the updated code without rebuilding Docker containers or restarting the model server.

Promote to production

Once you’re happy with the model, deploy it to production:
truss push --publish
This changes the API endpoint from /development/predict to /production/predict:
curl -X POST https://model-YOUR_MODEL_ID.api.baseten.co/production/predict \
  -H "Authorization: Api-Key $BASETEN_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"messages": [{"role": "user", "content": "What is AGI?"}]}'
Your model ID is the string after /models/ in the logs URL from truss push --publish. You can also find it in your Baseten dashboard.

Next steps