Llama 3.1

Setup

To get started, sign into Baseten with Truss and then install the OpenAI SDK.

Sign in to Baseten

uvx truss login --browser

Install the OpenAI SDK

uv pip install openai

nvidia/Llama-3.1-8B-Instruct-FP8 is an 8B-parameter dense model with up to 128K context. This preset serves Llama 3.1 8B Instruct on a single B200 through Baseten Inference Stack (TensorRT-LLM) with FP8 weights, an FP8 KV cache, and EAGLE3 speculative decoding. It targets high concurrent throughput.

Hardware

B200

Engine

TRT-LLM v2

Context

128K

Concurrency

512

Write the config

Create and move into the project directory:

mkdir llama-3.1-8b-instruct-throughput && cd llama-3.1-8b-instruct-throughput

Then create a file named config.yaml and paste the following:

config.yaml

model_name: "model:llama-3.1-8b-instruct preset:throughput"
model_metadata:
  example_model_input:
    messages:
      - role: user
        content: "Write FizzBuzz in Python"
    stream: true
    model: "nvidia/Llama-3.1-8B-Instruct-FP8"
    max_tokens: 512
    temperature: 0.5
  tags:
    - openai-compatible

resources:
  accelerator: B200
  cpu: "1"
  memory: 10Gi
  use_gpu: true

weights:
  - source: "hf://nvidia/Llama-3.1-8B-Instruct-FP8@main"
    mount_location: "/app/model_cache/trt_model"
    auth_secret_name: "hf_access_token"
  - source: "hf://yuhuili/EAGLE3-LLaMA3.1-Instruct-8B@main"
    mount_location: "/app/model_cache/eagle3_draft"
    auth_secret_name: "hf_access_token"

secrets:
  hf_access_token: null

trt_llm:
  build:
    checkpoint_repository:
      repo: michaelfeil/empty-model
      revision: main
      source: HF
  inference_stack: v2
  runtime:
    enable_chunked_prefill: true
    max_batch_size: 512
    max_num_tokens: 16384
    max_seq_len: 131072
    tensor_parallel_size: 1
    served_model_name: nvidia/Llama-3.1-8B-Instruct-FP8
    patch_kwargs:
      model_path: /app/model_cache/trt_model
      backend: pytorch
      sampler_type: TorchSampler
      guided_decoding_backend: xgrammar
      max_beam_width: 1
      max_input_len: 131072
      trust_remote_code: 1
      cuda_graph_config:
        enable_padding: true
        max_batch_size: 512
      kv_cache_config:
        dtype: fp8
        enable_block_reuse: true
        free_gpu_memory_fraction: 0.9
      speculative_config:
        decoding_type: Eagle
        max_draft_len: 3
        speculative_model_dir: /app/model_cache/eagle3_draft
        eagle3_one_model: true
  version_overrides:
    v2_llm_version: null

runtime:
  predict_concurrency: 512

This config tells Baseten to compile a TensorRT-LLM engine for Llama 3.1 8B Instruct on a single B200, pulling FP8 weights from nvidia/Llama-3.1-8B-Instruct-FP8 and an EAGLE3 draft speculator from yuhuili/EAGLE3-LLaMA3.1-Instruct-8B. The runtime is tuned for high concurrent throughput: 512 in-flight requests, chunked prefill, an FP8 KV cache, and CUDA graphs sized to the same batch ceiling so the engine stays hot under load.

Key parameters

Baseten Inference Stack (BIS) reads these fields from the trt_llm block. Each one shapes how the engine is built and served:

Parameter	Value
Tensor parallel size	`1`
Max sequence length	`131072`
Max batch size	`512`
Max batched tokens	`16384`
Chunked prefill	`enabled`
Inference stack	`v2`
Served model name	`nvidia/Llama-3.1-8B-Instruct-FP8`

Deploy

Push the config to Baseten:

uvx truss push

You should see output similar to:

✨ Model llama-3.1-8b-instruct-throughput was successfully pushed ✨
🪵 View logs for your deployment at https://app.baseten.co/models/abcd1234/logs/wxyz5678

Your model ID is the string after /models/ in the logs URL (abcd1234 in the example). Use it wherever you see {model_id} in the next section.

Call the model

Your deployment serves an OpenAI-compatible API. Replace {model_id} with your model ID and make sure BASETEN_API_KEY is set. Now call your deployment to run inference:

Python
cURL

main.py

import os
from openai import OpenAI

client = OpenAI(
    api_key=os.environ["BASETEN_API_KEY"],
    base_url="https://model-{model_id}.api.baseten.co/environments/production/sync/v1",
)

response = client.chat.completions.create(
    model="nvidia/Llama-3.1-8B-Instruct-FP8",
    messages=[
        {"role": "user", "content": "What is machine learning?"}
    ],
)

print(response.choices[0].message.content)

curl -s https://model-{model_id}.api.baseten.co/environments/production/sync/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Api-Key $BASETEN_API_KEY" \
  -d '{
    "model": "nvidia/Llama-3.1-8B-Instruct-FP8",
    "messages": [
      {"role": "user", "content": "What is machine learning?"}
    ]
  }'

Examples

Models

Engines

Custom Docker servers

Custom Python models

Chains

Setup

Hardware

Engine

Context

Concurrency

Write the config

Key parameters

Deploy

Call the model

Examples

Models

Engines

Custom Docker servers

Custom Python models

Chains

Documentation Index

​Setup

Hardware

Engine

Context

Concurrency

​Write the config

​Key parameters

​Deploy

​Call the model

Setup

Write the config

Key parameters

Deploy

Call the model