Documentation Index
Fetch the complete documentation index at: https://docs.baseten.co/llms.txt
Use this file to discover all available pages before exploring further.
Setup
To get started, sign into Baseten with Truss and then install the OpenAI SDK.
Sign in to Basetenuvx truss login --browser
meta-llama/Llama-3.2-3B-Instruct is a 3B-parameter dense model with up to 125K context.
This preset serves Llama 3.2 3B Instruct on a single H100 40GB through Baseten Inference Stack (TensorRT-LLM), optimized for the lowest Llama 3.2 latency on Baseten.
Write the config
Create and move into the project directory:
mkdir llama-3.2-3b-instruct-latency && cd llama-3.2-3b-instruct-latency
Then create a file named config.yaml and paste the following:
model_metadata:
example_model_input:
max_tokens: 512
messages:
- content: Tell me everything you know about optimized inference.
role: user
stream: true
temperature: 0.5
tags:
- openai-compatible
model_name: "model:llama-3.2-3b-instruct preset:latency"
python_version: py39
resources:
accelerator: H100_40GB
cpu: "1"
memory: 10Gi
use_gpu: true
trt_llm:
build:
base_model: decoder
checkpoint_repository:
repo: meta-llama/Llama-3.2-3B-Instruct
revision: main
source: HF
max_seq_len: 131072
quantization_type: fp8_kv
tensor_parallel_count: 1
runtime:
enable_chunked_context: true
Key parameters
Baseten Inference Stack (BIS) reads these fields from the trt_llm block. Each one shapes how the engine is built and served:
| Parameter | Value |
|---|
| Max sequence length | 131072 |
| Chunked prefill | enabled |
| Quantization | fp8_kv |
| Base model type | decoder |
Deploy
Push the config to Baseten:
You should see output similar to:
✨ Model llama-3.2-3b-instruct-latency was successfully pushed ✨
🪵 View logs for your deployment at https://app.baseten.co/models/abcd1234/logs/wxyz5678
Your model ID is the string after /models/ in the logs URL (abcd1234 in the example). Use it wherever you see {model_id} in the next section.
Call the model
Your deployment serves an OpenAI-compatible API. Replace {model_id} with your model ID and make sure BASETEN_API_KEY is set.
Now call your deployment to run inference:
import os
from openai import OpenAI
client = OpenAI(
api_key=os.environ["BASETEN_API_KEY"],
base_url="https://model-{model_id}.api.baseten.co/environments/production/sync/v1",
)
response = client.chat.completions.create(
model="",
messages=[
{"role": "user", "content": "What is machine learning?"}
],
)
print(response.choices[0].message.content)
curl -s https://model-{model_id}.api.baseten.co/environments/production/sync/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Api-Key $BASETEN_API_KEY" \
-d '{
"model": "",
"messages": [
{"role": "user", "content": "What is machine learning?"}
]
}'