Documentation Index
Fetch the complete documentation index at: https://docs.baseten.co/llms.txt
Use this file to discover all available pages before exploring further.
Setup
To get started, sign into Baseten with Truss and then install the OpenAI SDK.
Sign in to Basetenuvx truss login --browser
nvidia/Llama-3.3-70B-Instruct-FP8 is a 70B-parameter dense model with up to 128K context.
This preset serves Llama 3.3 70B Instruct on H100:4 through Baseten Inference Stack (TensorRT-LLM) with FP8 weights and tensor parallelism. It targets low time-to-first-token on the 70B chat model.
Write the config
Create and move into the project directory:
mkdir llama-3.3-70b-instruct-latency && cd llama-3.3-70b-instruct-latency
Then create a file named config.yaml and paste the following:
model_name: "model:llama-3.3-70b-instruct preset:latency"
model_metadata:
tags:
- openai-compatible
example_model_input:
stream: true
model: nvidia/Llama-3.3-70B-Instruct-FP8
messages:
- role: user
content: Tell me everything you know about optimized inference.
max_tokens: 512
temperature: 0.5
python_version: py313
secrets:
hf_access_token: null
weights:
- source: hf://nvidia/Llama-3.3-70B-Instruct-FP8@main
allow_patterns:
- "*.safetensors"
- "*.json"
- "*.model"
- tokenizer.model
- "*.tiktoken"
- "*.jinja"
mount_location: /app/model_cache/llama-3-3-70b-instruct
ignore_patterns:
- original/*
- "*.pth"
auth_secret_name: hf_access_token
resources:
cpu: "4"
memory: 40Gi
use_gpu: true
accelerator: H100:4
data_dir: data
runtime:
predict_concurrency: 128
streaming_read_timeout: 60
trt_llm:
build:
checkpoint_repository:
repo: michaelfeil/empty-model
source: HF
revision: main
runtime_secret_name: hf_access_token
runtime:
max_seq_len: 131072
patch_kwargs:
model_path: /app/model_cache/llama-3-3-70b-instruct
model_path_for_tokenizer: /app/model_cache/llama-3-3-70b-instruct
cuda_graph_config:
enable_padding: true
max_batch_size: 128
max_batch_size: 128
max_num_tokens: 8192
served_model_name: nvidia/Llama-3.3-70B-Instruct-FP8
tensor_parallel_size: 4
enable_chunked_prefill: true
inference_stack: v2
version_overrides:
v2_llm_version: null
This config tells Baseten to compile a TensorRT-LLM engine for Llama 3.3 70B Instruct on four H100 GPUs, sharding FP8 weights from nvidia/Llama-3.3-70B-Instruct-FP8 across the four ranks. The runtime targets low time-to-first-token at moderate concurrency: 128 in-flight requests, chunked prefill, and CUDA graphs sized to the batch ceiling so each new request hits a warm engine.
Key parameters
Baseten Inference Stack (BIS) reads these fields from the trt_llm block. Each one shapes how the engine is built and served:
| Parameter | Value |
|---|
| Tensor parallel size | 4 |
| Max sequence length | 131072 |
| Max batch size | 128 |
| Max batched tokens | 8192 |
| Chunked prefill | enabled |
| Inference stack | v2 |
| Served model name | nvidia/Llama-3.3-70B-Instruct-FP8 |
Deploy
Push the config to Baseten:
You should see output similar to:
✨ Model llama-3.3-70b-instruct-latency was successfully pushed ✨
🪵 View logs for your deployment at https://app.baseten.co/models/abcd1234/logs/wxyz5678
Your model ID is the string after /models/ in the logs URL (abcd1234 in the example). Use it wherever you see {model_id} in the next section.
Call the model
Your deployment serves an OpenAI-compatible API. Replace {model_id} with your model ID and make sure BASETEN_API_KEY is set.
Now call your deployment to run inference:
import os
from openai import OpenAI
client = OpenAI(
api_key=os.environ["BASETEN_API_KEY"],
base_url="https://model-{model_id}.api.baseten.co/environments/production/sync/v1",
)
response = client.chat.completions.create(
model="nvidia/Llama-3.3-70B-Instruct-FP8",
messages=[
{"role": "user", "content": "What is machine learning?"}
],
)
print(response.choices[0].message.content)
curl -s https://model-{model_id}.api.baseten.co/environments/production/sync/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Api-Key $BASETEN_API_KEY" \
-d '{
"model": "nvidia/Llama-3.3-70B-Instruct-FP8",
"messages": [
{"role": "user", "content": "What is machine learning?"}
]
}'