Skip to main content

Setup

To get started, sign into Baseten with Truss and then install the OpenAI SDK.
Sign in to Baseten
uvx truss login --browser
Install the OpenAI SDK
uv pip install openai
nvidia/Llama-3.3-70B-Instruct-FP8 is a 70B-parameter dense model with up to 128K context. This preset serves Llama 3.3 70B Instruct on H100:4 through Baseten Inference Stack (TensorRT-LLM) with FP8 weights and tensor parallelism. It targets low time-to-first-token on the 70B chat model.

Hardware

H100 × 4

Engine

TRT-LLM v2

Context

128K

Concurrency

128

Write the config

Create and move into the project directory:
mkdir llama-3.3-70b-instruct-latency && cd llama-3.3-70b-instruct-latency
Then create a file named config.yaml and paste the following:
config.yaml
model_name: "model:llama-3.3-70b-instruct preset:latency"

model_metadata:
  tags:
    - openai-compatible
  example_model_input:
    stream: true
    model: nvidia/Llama-3.3-70B-Instruct-FP8
    messages:
      - role: user
        content: Tell me everything you know about optimized inference.
    max_tokens: 512
    temperature: 0.5

python_version: py313

secrets:
  hf_access_token: null

weights:
  - source: hf://nvidia/Llama-3.3-70B-Instruct-FP8@main
    allow_patterns:
      - "*.safetensors"
      - "*.json"
      - "*.model"
      - tokenizer.model
      - "*.tiktoken"
      - "*.jinja"
    mount_location: /app/model_cache/llama-3-3-70b-instruct
    ignore_patterns:
      - original/*
      - "*.pth"
    auth_secret_name: hf_access_token

resources:
  cpu: "4"
  memory: 40Gi
  use_gpu: true
  accelerator: H100:4

data_dir: data

runtime:
  predict_concurrency: 128
  streaming_read_timeout: 60

trt_llm:
  build:
    checkpoint_repository:
      repo: michaelfeil/empty-model
      source: HF
      revision: main
      runtime_secret_name: hf_access_token
  runtime:
    max_seq_len: 131072
    patch_kwargs:
      model_path: /app/model_cache/llama-3-3-70b-instruct
      model_path_for_tokenizer: /app/model_cache/llama-3-3-70b-instruct
      cuda_graph_config:
        enable_padding: true
        max_batch_size: 128
    max_batch_size: 128
    max_num_tokens: 8192
    served_model_name: nvidia/Llama-3.3-70B-Instruct-FP8
    tensor_parallel_size: 4
    enable_chunked_prefill: true
  inference_stack: v2
  version_overrides:
    v2_llm_version: null
This config tells Baseten to compile a TensorRT-LLM engine for Llama 3.3 70B Instruct on four H100 GPUs, sharding FP8 weights from nvidia/Llama-3.3-70B-Instruct-FP8 across the four ranks. The runtime targets low time-to-first-token at moderate concurrency: 128 in-flight requests, chunked prefill, and CUDA graphs sized to the batch ceiling so each new request hits a warm engine.

Key parameters

Baseten Inference Stack (BIS) reads these fields from the trt_llm block. Each one shapes how the engine is built and served:
ParameterValue
Tensor parallel size4
Max sequence length131072
Max batch size128
Max batched tokens8192
Chunked prefillenabled
Inference stackv2
Served model namenvidia/Llama-3.3-70B-Instruct-FP8

Deploy

Push the config to Baseten:
uvx truss push
You should see output similar to:
✨ Model llama-3.3-70b-instruct-latency was successfully pushed ✨
🪵 View logs for your deployment at https://app.baseten.co/models/abcd1234/logs/wxyz5678
Your model ID is the string after /models/ in the logs URL (abcd1234 in the example). Use it wherever you see {model_id} in the next section.

Call the model

Your deployment serves an OpenAI-compatible API. Replace {model_id} with your model ID and make sure BASETEN_API_KEY is set. Now call your deployment to run inference:
main.py
import os
from openai import OpenAI

client = OpenAI(
    api_key=os.environ["BASETEN_API_KEY"],
    base_url="https://model-{model_id}.api.baseten.co/environments/production/sync/v1",
)

response = client.chat.completions.create(
    model="nvidia/Llama-3.3-70B-Instruct-FP8",
    messages=[
        {"role": "user", "content": "What is machine learning?"}
    ],
)

print(response.choices[0].message.content)