Documentation Index
Fetch the complete documentation index at: https://docs.baseten.co/llms.txt
Use this file to discover all available pages before exploring further.
To get the best performance, we recommend using our TensorRT-LLM Engine-Builder when deploying LLMs. Models deployed with the Engine-Builder are OpenAI compatible, support structured output and function calling, and offer deploy-time post-training quantization to FP8 with Hopper GPUs and NVFP4 with Blackwell GPUs.
The Engine-Builder supports LLMs from the following families, both foundation models and fine-tunes:
- Llama 3.0 and later (including DeepSeek-R1 distills)
- Qwen 2.5 and later (including Math, Coder, and DeepSeek-R1 distills)
- Mistral (all LLMs)
You can find preset Engine-Builder configs for common models in the Engine-Builder reference.
The Engine-Builder does not support vision-language models like Llama 3.2 11B or Pixtral. For these models, we recommend vLLM.
Example: Deploy Qwen 2.5 3B on an H100
This configuration builds an inference engine to serve Qwen 2.5 3B on an H100 GPU. Running this model is fast and cheap, making it a good example for documentation, but the process of deploying it is very similar to larger models like GLM-4.7.
Setup
Before you deploy a model, you’ll need three quick setup steps.
Create an API key for your Baseten account
Create an API key and save it as an environment variable:export BASETEN_API_KEY="abcd.123456"
Add an access token for Hugging Face
Some models require that you accept terms and conditions on Hugging Face before deployment. To prevent issues:
- Accept the license for any gated models you wish to access, like Gemma 3.
- Create a read-only user access token from your Hugging Face account.
- Add the
hf_access_token secret to your Baseten workspace.
Install the OpenAI SDK
This guide uses uvx to run Truss commands without a separate install step. Install the OpenAI SDK for calling the model:uv venv && source .venv/bin/activate
uv pip install openai
Configuration
Start with an empty configuration file.
mkdir qwen-2-5-3b-engine
touch qwen-2-5-3b-engine/config.yaml
This configuration file specifies model information and Engine-Builder arguments. You can find details on each config option in the Engine-Builder reference.
Below is an example for Qwen 2.5 3B.
model_metadata:
tags:
- openai-compatible
example_model_input: # Loads sample request into Baseten playground
messages:
- role: system
content: "You are a helpful assistant."
- role: user
content: "What does Tongyi Qianwen mean?"
stream: true
max_tokens: 512
temperature: 0.6 # Check recommended temperature per model
repo_id: Qwen/Qwen2.5-3B-Instruct
model_name: Qwen 2.5 3B Instruct
python_version: py39
resources: # Engine-Builder GPU cannot be changed post-deployment
accelerator: H100
use_gpu: true
secrets: {}
trt_llm:
build:
base_model: decoder
checkpoint_repository:
repo: Qwen/Qwen2.5-3B-Instruct
source: HF
num_builder_gpus: 1
quantization_type: no_quant # `fp8_kv` often recommended for large models
max_seq_len: 32768 # option to very the max sequence length, e.g. 131072 for Llama models
tensor_parallel_count: 1 # Set equal to number of GPUs
plugin_configuration:
use_paged_context_fmha: true
use_fp8_context_fmha: false # Set to true when using `fp8_kv`
paged_kv_cache: true
runtime:
batch_scheduler_policy: max_utilization
enable_chunked_context: true
request_default_max_tokens: 32768 # 131072 for Llama models
Deployment
Pushing the model to Baseten kicks off a multi-stage build and deployment process.
uvx truss push qwen-2-5-3b-engine
Upon deployment, check your terminal logs or Baseten account to find the URL for the model server.
Inference
This model is OpenAI compatible and can be called using the OpenAI client.
import os
from openai import OpenAI
# https://model-XXXXXXX.api.baseten.co/environments/production/sync/v1
model_url = ""
client = OpenAI(
base_url=model_url,
api_key=os.environ.get("BASETEN_API_KEY"),
)
stream = client.chat.completions.create(
model="baseten",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What does Tongyi Qianwen mean?"}
],
stream=True,
)
for chunk in stream:
if chunk.choices[0].delta.content is not None:
print(chunk.choices[0].delta.content, end="")
That’s it! You have successfully deployed and called an LLM optimized with the TensorRT-LLM Engine-Builder. Check the Engine-Builder reference for details on each config option.