Llama 3.2

Setup

To get started, sign into Baseten with Truss and then install the OpenAI SDK.

Sign in to Baseten

uvx truss login --browser

Install the OpenAI SDK

uv pip install openai

meta-llama/Llama-3.2-3B-Instruct is a 3B-parameter dense model with up to 125K context. This preset serves Llama 3.2 3B Instruct on a single H100 40GB through Baseten Inference Stack (TensorRT-LLM), optimized for the lowest Llama 3.2 latency on Baseten.

Hardware

H100_40GB

Engine

TRT-LLM

Context

128K

Write the config

Create and move into the project directory:

mkdir llama-3.2-3b-instruct-latency && cd llama-3.2-3b-instruct-latency

Then create a file named config.yaml and paste the following:

config.yaml

model_metadata:
  example_model_input:
    max_tokens: 512
    messages:
      - content: Tell me everything you know about optimized inference.
        role: user
    stream: true
    temperature: 0.5
  tags:
    - openai-compatible
model_name: "model:llama-3.2-3b-instruct preset:latency"
python_version: py39
resources:
  accelerator: H100_40GB
  cpu: "1"
  memory: 10Gi
  use_gpu: true
trt_llm:
  build:
    base_model: decoder
    checkpoint_repository:
      repo: meta-llama/Llama-3.2-3B-Instruct
      revision: main
      source: HF
    max_seq_len: 131072
    quantization_type: fp8_kv
    tensor_parallel_count: 1
  runtime:
    enable_chunked_context: true

Key parameters

Baseten Inference Stack (BIS) reads these fields from the trt_llm block. Each one shapes how the engine is built and served:

Parameter	Value
Max sequence length	`131072`
Chunked prefill	`enabled`
Quantization	`fp8_kv`
Base model type	`decoder`

Deploy

Push the config to Baseten:

uvx truss push

You should see output similar to:

✨ Model llama-3.2-3b-instruct-latency was successfully pushed ✨
🪵 View logs for your deployment at https://app.baseten.co/models/abcd1234/logs/wxyz5678

Your model ID is the string after /models/ in the logs URL (abcd1234 in the example). Use it wherever you see {model_id} in the next section.

Call the model

Your deployment serves an OpenAI-compatible API. Replace {model_id} with your model ID and make sure BASETEN_API_KEY is set. Now call your deployment to run inference:

Python
cURL

main.py

import os
from openai import OpenAI

client = OpenAI(
    api_key=os.environ["BASETEN_API_KEY"],
    base_url="https://model-{model_id}.api.baseten.co/environments/production/sync/v1",
)

response = client.chat.completions.create(
    model="",
    messages=[
        {"role": "user", "content": "What is machine learning?"}
    ],
)

print(response.choices[0].message.content)

curl -s https://model-{model_id}.api.baseten.co/environments/production/sync/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Api-Key $BASETEN_API_KEY" \
  -d '{
    "model": "",
    "messages": [
      {"role": "user", "content": "What is machine learning?"}
    ]
  }'

Examples

Models

Engines

Custom Docker servers

Custom Python models

Chains

Setup

Hardware

Engine

Context

Write the config

Key parameters

Deploy

Call the model

Examples

Models

Engines

Custom Docker servers

Custom Python models

Chains

Documentation Index

​Setup

Hardware

Engine

Context

​Write the config

​Key parameters

​Deploy

​Call the model

Setup

Write the config

Key parameters

Deploy

Call the model