Skip to main content
Deploying a model to Baseten turns your model code into a production-ready API endpoint. You package your model with Truss, push it to Baseten, and receive a URL you can call from any application. This guide walks through deploying Phi-3-mini-4k-instruct, a 3.8B parameter LLM, from local code to a production API. You’ll create a Truss project, write model code, configure dependencies and GPU resources, deploy to Baseten, and call your model’s API endpoint.

Set up your environment

Before you begin, sign up or sign in to Baseten.

Install Truss

Truss is Baseten’s model packaging framework. It handles containerization, dependencies, and deployment configuration.
Using a virtual environment is recommended to avoid dependency conflicts with other Python projects.
New accounts include free credits; this guide should use less than $1 in GPU costs.

Create a Truss

A Truss packages your model into a deployable container with all dependencies and configurations. Create a new Truss:
truss init phi-3-mini && cd phi-3-mini
When prompted, give your Truss a name like Phi 3 Mini. This command scaffolds a project with the following structure:
phi-3-mini/
  model/
    __init__.py
    model.py
  config.yaml
  data/
  packages/
The key files are:
  • model/model.py: Your model code with load() and predict() methods.
  • config.yaml: Dependencies, resources, and deployment settings.
  • data/: Optional directory for data files bundled with your model.
  • packages/: Optional directory for local Python packages.
Truss uses this structure to build and deploy your model automatically. You define your model in model.py and your infrastructure in config.yaml, no Dockerfiles or container management required.

Implement model code

In this example, you’ll implement the model code for Phi-3-mini-4k-instruct. You’ll use the transformers library to load the model and tokenizer and PyTorch to run inference. Replace the contents of model/model.py with the following code:
model/model.py
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

class Model:
    def __init__(self, **kwargs):
        self._model = None
        self._tokenizer = None

    def load(self):
        self._model = AutoModelForCausalLM.from_pretrained(
            "microsoft/Phi-3-mini-4k-instruct",
            device_map="cuda",
            torch_dtype="auto"
        )
        self._tokenizer = AutoTokenizer.from_pretrained(
            "microsoft/Phi-3-mini-4k-instruct"
        )

    def predict(self, request):
        messages = request.pop("messages")
        model_inputs = self._tokenizer.apply_chat_template(
            messages, tokenize=False, add_generation_prompt=True
        )
        inputs = self._tokenizer(model_inputs, return_tensors="pt").to("cuda")
        with torch.no_grad():
            outputs = self._model.generate(input_ids=inputs["input_ids"], max_length=256)
        return {"output": self._tokenizer.decode(outputs[0], skip_special_tokens=True)}
Truss models follow a three-method pattern that separates initialization from inference:
MethodWhen it’s calledWhat to do here
__init__Once when the class is createdInitialize variables, store configuration, set secrets
loadOnce at startup, before any requestsLoad model weights, tokenizers, and other heavy resources
predictOn every API requestProcess input, run inference, return response
Why separate load from __init__? The load method runs during the container’s cold start, before your model receives traffic. This keeps expensive operations (like downloading large model weights) out of the request path.

Understand the request/response flow

The predict method receives request, a dictionary containing the JSON body from the API call:
# API call with: {"messages": [{"role": "user", "content": "Hello"}]}
def predict(self, request):
    messages = request.pop("messages")  # Extract from request
    # ... run inference ...
    return {"output": result}  # Return dict becomes JSON response
Whatever dictionary you return becomes the API response. You control the input parameters and output format.

GPU and memory patterns

A few patterns in this code are common across GPU models:
  • device_map="cuda": Loads model weights directly to GPU.
  • .to("cuda"): Moves input tensors to GPU for inference.
  • torch.no_grad(): Disables gradient tracking to save memory (gradients aren’t needed for inference).

Configure dependencies and GPU

The config.yaml file defines your model’s environment and compute resources. This configuration determines how your container is built and what hardware it runs on.

Set Python version and dependencies

config.yaml
python_version: py311
requirements:
  - six==1.17.0
  - accelerate==0.30.1
  - einops==0.8.0
  - transformers==4.41.2
  - torch==2.3.0
Key configuration options:
FieldPurposeExample
python_versionPython version for your containerpy39, py310, py311, py312
requirementsPython packages to install (pip format)torch==2.3.0
system_packagesSystem-level dependencies (apt packages)ffmpeg, libsm6
For the complete list of configuration options, see the Truss configuration reference.
Always pin exact versions (e.g., torch==2.3.0 not torch>=2.0). This ensures reproducible builds and your model behaves the same way every time it’s deployed.

Allocate a GPU

The resources section specifies what hardware your model runs on:
config.yaml
resources:
  accelerator: T4
  use_gpu: true
Choosing the right GPU: Match your GPU to your model’s VRAM requirements. For Phi-3-mini (~7.6GB), a T4 (16GB) provides headroom for inference.
GPUVRAMGood for
T416GBSmall models, embeddings, fine-tuned models
L424GBMedium models (7B parameters)
A10G24GBMedium models, image generation
A10040/80GBLarge models (13B-70B parameters)
H10080GBVery large models, high throughput
Estimating VRAM: A rough rule is 2GB of VRAM per billion parameters for float16 models. A 7B model needs ~14GB VRAM minimum.

Deploy the model

Authenticate with Baseten

First, generate an API key from the Baseten settings. Then log in:
truss login
The expected output is:
💻 Let's add a Baseten remote!
🤫 Quietly paste your API_KEY:
Paste your API key when prompted. Truss saves your credentials for future deployments.

Push your model to Baseten

truss push
The expected output is:
Deploying truss using T4x4x16 instance type.
✨ Model Phi 3 Mini was successfully pushed ✨

| --------------------------------------------------------------------------------------- |
| Y-------------------------------------------------------------------------------------o |
| iterate quickly during the deployment process.                                          |
|                                                                                         |
| When you are ready to publish your deployed model as a new deployment,                  |
| pass '--publish' to the 'truss push' command. To monitor changes to your model and      |
| rapidly iterate, run the 'truss watch' command.                                         |
| --------------------------------------------------------------------------------------- |

🪵  View logs for your deployment at https://app.baseten.co/models/abc1d2ef/logs/xyz123
In this example, the logs URL contains two IDs:
  • Model ID: The string after /models/ (e.g., abc1d2ef) which you’ll use this to call the model API.
  • Deployment ID: The string after /logs/ (e.g., xyz123) identifies this specific deployment.
You can also find your model ID in your Baseten dashboard by clicking on your model.

Call the model API

After the deployment is complete, you can call the model API:
From your Truss project directory, run:
truss predict --data '{"messages": [{"role": "user", "content": "What is AGI?"}]}'
The expected output is:
Calling predict on development deployment...
{
  "output": "AGI stands for Artificial General Intelligence..."
}
The Truss CLI uses your saved credentials and automatically targets the correct deployment.

Use live reload for development

To avoid long deploy times when testing changes, use live reload:
truss watch
The expected output is:
🪵  View logs for your deployment at https://app.baseten.co/models/<model_id>/logs/<deployment_id>
🚰 Attempting to sync truss with remote
No changes observed, skipping patching.
👀 Watching for changes to truss...
When you save changes to model.py, Truss automatically patches the deployed model:
Changes detected, creating patch...
Created patch to update model code file: model/model.py
Model Phi 3 Mini patched successfully.
This saves time by patching only the updated code without rebuilding Docker containers or restarting the model server.

Promote to production

Once you’re happy with the model, deploy it to production:
truss push --publish
This changes the API endpoint from /development/predict to /production/predict:
curl -X POST https://model-YOUR_MODEL_ID.api.baseten.co/production/predict \
  -H "Authorization: Api-Key $BASETEN_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"messages": [{"role": "user", "content": "What is AGI?"}]}'
To call your production endpoint, you need your model ID. The output of truss push --publish includes a logs URL:
🪵  View logs for your deployment at https://app.baseten.co/models/abc1d2ef/logs/xyz123
Your model ID is the string after /models/ (e.g., abc1d2ef). You can also find it in your Baseten dashboard.

Next steps

Now that you’ve deployed your first model, continue learning: