Documentation Index
Fetch the complete documentation index at: https://docs.baseten.co/llms.txt
Use this file to discover all available pages before exploring further.
Setup
To get started, sign into Baseten with Truss and then install the OpenAI SDK.
Sign in to Basetenuvx truss login --browser
nvidia/Llama-3.1-8B-Instruct-FP8 is an 8B-parameter dense model with up to 128K context.
This preset serves Llama 3.1 8B Instruct on a single B200 through Baseten Inference Stack (TensorRT-LLM) with FP8 weights, an FP8 KV cache, and EAGLE3 speculative decoding. It targets high concurrent throughput.
Write the config
Create and move into the project directory:
mkdir llama-3.1-8b-instruct-throughput && cd llama-3.1-8b-instruct-throughput
Then create a file named config.yaml and paste the following:
model_name: "model:llama-3.1-8b-instruct preset:throughput"
model_metadata:
example_model_input:
messages:
- role: user
content: "Write FizzBuzz in Python"
stream: true
model: "nvidia/Llama-3.1-8B-Instruct-FP8"
max_tokens: 512
temperature: 0.5
tags:
- openai-compatible
resources:
accelerator: B200
cpu: "1"
memory: 10Gi
use_gpu: true
weights:
- source: "hf://nvidia/Llama-3.1-8B-Instruct-FP8@main"
mount_location: "/app/model_cache/trt_model"
auth_secret_name: "hf_access_token"
- source: "hf://yuhuili/EAGLE3-LLaMA3.1-Instruct-8B@main"
mount_location: "/app/model_cache/eagle3_draft"
auth_secret_name: "hf_access_token"
secrets:
hf_access_token: null
trt_llm:
build:
checkpoint_repository:
repo: michaelfeil/empty-model
revision: main
source: HF
inference_stack: v2
runtime:
enable_chunked_prefill: true
max_batch_size: 512
max_num_tokens: 16384
max_seq_len: 131072
tensor_parallel_size: 1
served_model_name: nvidia/Llama-3.1-8B-Instruct-FP8
patch_kwargs:
model_path: /app/model_cache/trt_model
backend: pytorch
sampler_type: TorchSampler
guided_decoding_backend: xgrammar
max_beam_width: 1
max_input_len: 131072
trust_remote_code: 1
cuda_graph_config:
enable_padding: true
max_batch_size: 512
kv_cache_config:
dtype: fp8
enable_block_reuse: true
free_gpu_memory_fraction: 0.9
speculative_config:
decoding_type: Eagle
max_draft_len: 3
speculative_model_dir: /app/model_cache/eagle3_draft
eagle3_one_model: true
version_overrides:
v2_llm_version: null
runtime:
predict_concurrency: 512
This config tells Baseten to compile a TensorRT-LLM engine for Llama 3.1 8B Instruct on a single B200, pulling FP8 weights from nvidia/Llama-3.1-8B-Instruct-FP8 and an EAGLE3 draft speculator from yuhuili/EAGLE3-LLaMA3.1-Instruct-8B. The runtime is tuned for high concurrent throughput: 512 in-flight requests, chunked prefill, an FP8 KV cache, and CUDA graphs sized to the same batch ceiling so the engine stays hot under load.
Key parameters
Baseten Inference Stack (BIS) reads these fields from the trt_llm block. Each one shapes how the engine is built and served:
| Parameter | Value |
|---|
| Tensor parallel size | 1 |
| Max sequence length | 131072 |
| Max batch size | 512 |
| Max batched tokens | 16384 |
| Chunked prefill | enabled |
| Inference stack | v2 |
| Served model name | nvidia/Llama-3.1-8B-Instruct-FP8 |
Deploy
Push the config to Baseten:
You should see output similar to:
✨ Model llama-3.1-8b-instruct-throughput was successfully pushed ✨
🪵 View logs for your deployment at https://app.baseten.co/models/abcd1234/logs/wxyz5678
Your model ID is the string after /models/ in the logs URL (abcd1234 in the example). Use it wherever you see {model_id} in the next section.
Call the model
Your deployment serves an OpenAI-compatible API. Replace {model_id} with your model ID and make sure BASETEN_API_KEY is set.
Now call your deployment to run inference:
import os
from openai import OpenAI
client = OpenAI(
api_key=os.environ["BASETEN_API_KEY"],
base_url="https://model-{model_id}.api.baseten.co/environments/production/sync/v1",
)
response = client.chat.completions.create(
model="nvidia/Llama-3.1-8B-Instruct-FP8",
messages=[
{"role": "user", "content": "What is machine learning?"}
],
)
print(response.choices[0].message.content)
curl -s https://model-{model_id}.api.baseten.co/environments/production/sync/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Api-Key $BASETEN_API_KEY" \
-d '{
"model": "nvidia/Llama-3.1-8B-Instruct-FP8",
"messages": [
{"role": "user", "content": "What is machine learning?"}
]
}'