Setup
To get started, sign into Baseten with Truss and then install the OpenAI SDK.Sign in to Baseten
Install the OpenAI SDK
Hardware
B200
Engine
TRT-LLM v2
Context
128K
Concurrency
512
Write the config
Create and move into the project directory:config.yaml and paste the following:
config.yaml
nvidia/Llama-3.1-8B-Instruct-FP8 and an EAGLE3 draft speculator from yuhuili/EAGLE3-LLaMA3.1-Instruct-8B. The runtime is tuned for high concurrent throughput: 512 in-flight requests, chunked prefill, an FP8 KV cache, and CUDA graphs sized to the same batch ceiling so the engine stays hot under load.
Key parameters
Baseten Inference Stack (BIS) reads these fields from thetrt_llm block. Each one shapes how the engine is built and served:
| Parameter | Value |
|---|---|
| Tensor parallel size | 1 |
| Max sequence length | 131072 |
| Max batch size | 512 |
| Max batched tokens | 16384 |
| Chunked prefill | enabled |
| Inference stack | v2 |
| Served model name | nvidia/Llama-3.1-8B-Instruct-FP8 |
Deploy
Push the config to Baseten:/models/ in the logs URL (abcd1234 in the example). Use it wherever you see {model_id} in the next section.
Call the model
Your deployment serves an OpenAI-compatible API. Replace{model_id} with your model ID and make sure BASETEN_API_KEY is set.
Now call your deployment to run inference:
- Python
- cURL
main.py