Llama 3.3

Setup

To get started, sign into Baseten with Truss and then install the OpenAI SDK.

Sign in to Baseten

uvx truss login --browser

Install the OpenAI SDK

uv pip install openai

nvidia/Llama-3.3-70B-Instruct-FP8 is a 70B-parameter dense model with up to 128K context. This preset serves Llama 3.3 70B Instruct on H100:4 through Baseten Inference Stack (TensorRT-LLM) with FP8 weights and tensor parallelism. It targets low time-to-first-token on the 70B chat model.

Hardware

H100 × 4

Engine

TRT-LLM v2

Context

128K

Concurrency

128

Write the config

Create and move into the project directory:

mkdir llama-3.3-70b-instruct-latency && cd llama-3.3-70b-instruct-latency

Then create a file named config.yaml and paste the following:

config.yaml

model_name: "model:llama-3.3-70b-instruct preset:latency"

model_metadata:
  tags:
    - openai-compatible
  example_model_input:
    stream: true
    model: nvidia/Llama-3.3-70B-Instruct-FP8
    messages:
      - role: user
        content: Tell me everything you know about optimized inference.
    max_tokens: 512
    temperature: 0.5

python_version: py313

secrets:
  hf_access_token: null

weights:
  - source: hf://nvidia/Llama-3.3-70B-Instruct-FP8@main
    allow_patterns:
      - "*.safetensors"
      - "*.json"
      - "*.model"
      - tokenizer.model
      - "*.tiktoken"
      - "*.jinja"
    mount_location: /app/model_cache/llama-3-3-70b-instruct
    ignore_patterns:
      - original/*
      - "*.pth"
    auth_secret_name: hf_access_token

resources:
  cpu: "4"
  memory: 40Gi
  use_gpu: true
  accelerator: H100:4

data_dir: data

runtime:
  predict_concurrency: 128
  streaming_read_timeout: 60

trt_llm:
  build:
    checkpoint_repository:
      repo: michaelfeil/empty-model
      source: HF
      revision: main
      runtime_secret_name: hf_access_token
  runtime:
    max_seq_len: 131072
    patch_kwargs:
      model_path: /app/model_cache/llama-3-3-70b-instruct
      model_path_for_tokenizer: /app/model_cache/llama-3-3-70b-instruct
      cuda_graph_config:
        enable_padding: true
        max_batch_size: 128
    max_batch_size: 128
    max_num_tokens: 8192
    served_model_name: nvidia/Llama-3.3-70B-Instruct-FP8
    tensor_parallel_size: 4
    enable_chunked_prefill: true
  inference_stack: v2
  version_overrides:
    v2_llm_version: null

This config tells Baseten to compile a TensorRT-LLM engine for Llama 3.3 70B Instruct on four H100 GPUs, sharding FP8 weights from nvidia/Llama-3.3-70B-Instruct-FP8 across the four ranks. The runtime targets low time-to-first-token at moderate concurrency: 128 in-flight requests, chunked prefill, and CUDA graphs sized to the batch ceiling so each new request hits a warm engine.

Key parameters

Baseten Inference Stack (BIS) reads these fields from the trt_llm block. Each one shapes how the engine is built and served:

Parameter	Value
Tensor parallel size	`4`
Max sequence length	`131072`
Max batch size	`128`
Max batched tokens	`8192`
Chunked prefill	`enabled`
Inference stack	`v2`
Served model name	`nvidia/Llama-3.3-70B-Instruct-FP8`

Deploy

Push the config to Baseten:

uvx truss push

You should see output similar to:

✨ Model llama-3.3-70b-instruct-latency was successfully pushed ✨
🪵 View logs for your deployment at https://app.baseten.co/models/abcd1234/logs/wxyz5678

Your model ID is the string after /models/ in the logs URL (abcd1234 in the example). Use it wherever you see {model_id} in the next section.

Call the model

Your deployment serves an OpenAI-compatible API. Replace {model_id} with your model ID and make sure BASETEN_API_KEY is set. Now call your deployment to run inference:

Python
cURL

main.py

import os
from openai import OpenAI

client = OpenAI(
    api_key=os.environ["BASETEN_API_KEY"],
    base_url="https://model-{model_id}.api.baseten.co/environments/production/sync/v1",
)

response = client.chat.completions.create(
    model="nvidia/Llama-3.3-70B-Instruct-FP8",
    messages=[
        {"role": "user", "content": "What is machine learning?"}
    ],
)

print(response.choices[0].message.content)

curl -s https://model-{model_id}.api.baseten.co/environments/production/sync/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Api-Key $BASETEN_API_KEY" \
  -d '{
    "model": "nvidia/Llama-3.3-70B-Instruct-FP8",
    "messages": [
      {"role": "user", "content": "What is machine learning?"}
    ]
  }'

Examples

Models

Engines

Custom Docker servers

Custom Python models

Chains

Setup

Hardware

Engine

Context

Concurrency

Write the config

Key parameters

Deploy

Call the model

Examples

Models

Engines

Custom Docker servers

Custom Python models

Chains

Documentation Index

​Setup

Hardware

Engine

Context

Concurrency

​Write the config

​Key parameters

​Deploy

​Call the model

Setup

Write the config

Key parameters

Deploy

Call the model