Skip to main content

Setup

To get started, sign into Baseten with Truss and then install the OpenAI SDK.
Sign in to Baseten
uvx truss login --browser
Install the OpenAI SDK
uv pip install openai
meta-llama/Llama-3.2-3B-Instruct is a 3B-parameter dense model with up to 125K context. This preset serves Llama 3.2 3B Instruct on a single H100 40GB through Baseten Inference Stack (TensorRT-LLM), optimized for the lowest Llama 3.2 latency on Baseten.

Hardware

H100_40GB

Engine

TRT-LLM

Context

128K

Write the config

Create and move into the project directory:
mkdir llama-3.2-3b-instruct-latency && cd llama-3.2-3b-instruct-latency
Then create a file named config.yaml and paste the following:
config.yaml
model_metadata:
  example_model_input:
    max_tokens: 512
    messages:
      - content: Tell me everything you know about optimized inference.
        role: user
    stream: true
    temperature: 0.5
  tags:
    - openai-compatible
model_name: "model:llama-3.2-3b-instruct preset:latency"
python_version: py39
resources:
  accelerator: H100_40GB
  cpu: "1"
  memory: 10Gi
  use_gpu: true
trt_llm:
  build:
    base_model: decoder
    checkpoint_repository:
      repo: meta-llama/Llama-3.2-3B-Instruct
      revision: main
      source: HF
    max_seq_len: 131072
    quantization_type: fp8_kv
    tensor_parallel_count: 1
  runtime:
    enable_chunked_context: true

Key parameters

Baseten Inference Stack (BIS) reads these fields from the trt_llm block. Each one shapes how the engine is built and served:
ParameterValue
Max sequence length131072
Chunked prefillenabled
Quantizationfp8_kv
Base model typedecoder

Deploy

Push the config to Baseten:
uvx truss push
You should see output similar to:
✨ Model llama-3.2-3b-instruct-latency was successfully pushed ✨
🪵 View logs for your deployment at https://app.baseten.co/models/abcd1234/logs/wxyz5678
Your model ID is the string after /models/ in the logs URL (abcd1234 in the example). Use it wherever you see {model_id} in the next section.

Call the model

Your deployment serves an OpenAI-compatible API. Replace {model_id} with your model ID and make sure BASETEN_API_KEY is set. Now call your deployment to run inference:
main.py
import os
from openai import OpenAI

client = OpenAI(
    api_key=os.environ["BASETEN_API_KEY"],
    base_url="https://model-{model_id}.api.baseten.co/environments/production/sync/v1",
)

response = client.chat.completions.create(
    model="",
    messages=[
        {"role": "user", "content": "What is machine learning?"}
    ],
)

print(response.choices[0].message.content)