config.yaml and an inference engine. But when you need custom preprocessing, postprocessing, or want to run a model architecture that the built-in engines don’t support, you can write Python code in a model.py file. Truss provides a Model class with three methods (__init__, load, and predict) that give you full control over how your model initializes, loads weights, and handles requests.
This guide walks through deploying Phi-3-mini-4k-instruct, a 3.8B parameter LLM, using custom Python code. If you haven’t deployed a config-only model yet, start with Deploy your first model.
Set up your environment
Before you begin, sign up or sign in to Baseten.Install Truss
Truss is Baseten’s model packaging framework. It handles containerization, dependencies, and deployment configuration.- uv (recommended)
- pip (macOS/Linux)
- pip (Windows)
uv is a fast Python package manager:
Create a Truss project
Create a new Truss:Phi 3 Mini.
This command scaffolds a project with the following structure:
model/model.py: Your model code withload()andpredict()methods.config.yaml: Dependencies, resources, and deployment settings.data/: Optional directory for data files bundled with your model.packages/: Optional directory for local Python packages.
model.py and your infrastructure in config.yaml, no
Dockerfiles or container management required.
Implement model code
Replace the contents ofmodel/model.py with the following code. This loads Phi-3-mini-4k-instruct using the transformers library and PyTorch:
model/model.py
| Method | When it’s called | What to do here |
|---|---|---|
__init__ | Once when the class is created | Initialize variables, store configuration, set secrets. |
load | Once at startup, before any requests | Load model weights, tokenizers, and other heavy resources. |
predict | On every API request | Process input, run inference, return response. |
load method runs during the container’s cold start, before your model receives traffic. This keeps expensive operations (like downloading large model weights) out of the request path.
Understand the request/response flow
Thepredict method receives request, a dictionary containing the JSON body from the API call:
GPU and memory patterns
A few patterns in this code are common across GPU models:device_map="cuda": Loads model weights directly to GPU..to("cuda"): Moves input tensors to GPU for inference.torch.no_grad(): Disables gradient tracking to save memory (gradients aren’t needed for inference).
Configure dependencies and GPU
Theconfig.yaml file defines your model’s environment and compute resources.
Set Python version and dependencies
config.yaml
| Field | Purpose | Example |
|---|---|---|
python_version | Python version for your container. | py39, py310, py311, py312 |
requirements | Python packages to install (pip format). | torch==2.3.0 |
system_packages | System-level dependencies (apt packages). | ffmpeg, libsm6 |
Always pin exact versions (e.g.,
torch==2.3.0 not torch>=2.0). This ensures reproducible builds and your model behaves the same way every time it’s deployed.Allocate a GPU
Theresources section specifies what hardware your model runs on:
config.yaml
| GPU | VRAM | Good for |
|---|---|---|
| T4 | 16 GB | Small models, embeddings, fine-tuned models. |
| L4 | 24 GB | Medium models (7B parameters). |
| A10G | 24 GB | Medium models, image generation. |
| A100 | 40/80 GB | Large models (13B-70B parameters). |
| H100 | 80 GB | Very large models, high throughput. |
Deploy the model
Authenticate with Baseten
Generate an API key from Baseten settings, then log in:Push your model to Baseten
/models/ (e.g., abc1d2ef). You’ll need this to call the model’s API. You can also find it in your Baseten dashboard.
Call the model API
After the deployment shows “Active” in the dashboard, call the model API:- Truss CLI
- cURL
- Python
From your Truss project directory, run:You should see:The Truss CLI uses your saved credentials and automatically targets the correct deployment.
Use live reload for development
To avoid long deploy times when testing changes, use live reload:model.py, Truss automatically patches the deployed model:
Promote to production
Once you’re happy with the model, deploy it to production:/development/predict to /production/predict:
Next steps
Model configuration
Full reference for dependencies, secrets, resources, and deployment settings.
Model implementation
Advanced patterns including streaming, async, and custom health checks.
Autoscaling
Scale GPU replicas based on demand with configurable concurrency targets.
Deploy your first model
Deploy a model with just a config file, no custom Python needed.