Fast LLMs with TensorRT-LLM

To get the best performance, we recommend using our TensorRT-LLM Engine Builder when deploying LLMs. Models deployed with the Engine Builder are OpenAI compatible, support structured output and function calling, and offer deploy-time post-training quantization to FP8 with Hopper GPUs. The Engine Builder supports LLMs from the following families, both foundation models and fine-tunes:

Llama 3.0 and later (including DeepSeek-R1 distills)
Qwen 2.5 and later (including Math, Coder, and DeepSeek-R1 distills)
Mistral (all LLMs)

You can download preset Engine Builder configs for common models from the model library.

The Engine Builder does not support vision-language models like Llama 3.2 11B or Pixtral. For these models, we recommend vLLM.

Example: Deploy Qwen 2.5 3B on an H100

This configuration builds an inference engine to serve Qwen 2.5 3B on an H100 GPU. Running this model is fast and cheap, making it a good example for documentation, but the process of deploying it is very similar to larger models like Llama 3.3 70B.

Setup

Before you deploy a model, you’ll need three quick setup steps.

Create an API key for your Baseten account

Create an API key and save it as an environment variable:

export BASETEN_API_KEY="abcd.123456"

Add an access token for Hugging Face

Some models require that you accept terms and conditions on Hugging Face before deployment. To prevent issues:

Accept the license for any gated models you wish to access, like Llama 3.3.
Create a read-only user access token from your Hugging Face account.
Add the hf_access_token secret to your Baseten workspace.

Install Truss in your local development environment

Install the latest version of Truss, our open-source model packaging framework, as well as OpenAI’s model inference SDK, with:

pip install --upgrade truss openai

Configuration

Start with an empty configuration file.

mkdir qwen-2-5-3b-engine
touch qwen-2-5-3b-engine/config.yaml

This configuration file specifies model information and Engine Builder arguments. You can find dozens of examples in the model library as well as details on each config option in the engine builder reference. Below is an example for Qwen 2.5 3B.

config.yaml

model_metadata:
  example_model_input: # Loads sample request into Baseten playground
    messages:
        - role: system
        content: "You are a helpful assistant."
        - role: user
        content: "What does Tongyi Qianwen mean?"
    stream: true
    max_tokens: 512
    temperature: 0.6  # Check recommended temperature per model
  repo_id: Qwen/Qwen2.5-3B-Instruct
model_name: Qwen 2.5 3B Instruct
python_version: py39
resources: # Engine Builder GPU cannot be changed post-deployment
  accelerator: H100
  use_gpu: true
secrets: {}
trt_llm:
  build:
    base_model: decoder 
    checkpoint_repository:
      repo: Qwen/Qwen2.5-3B-Instruct
      source: HF
    num_builder_gpus: 1
    quantization_type: no_quant # `fp8_kv` often recommended for large models
    max_seq_len: 32768 # option to very the max sequence length, e.g. 131072 for Llama models
    tensor_parallel_count: 1 # Set equal to number of GPUs
    plugin_configuration:
      use_paged_context_fmha: true
      use_fp8_context_fmha: false # Set to true when using `fp8_kv`
      paged_kv_cache: true
  runtime:
    batch_scheduler_policy: max_utilization
    enable_chunked_context: true
    request_default_max_tokens: 32768 # 131072 for Llama models

Deployment

Pushing the model to Baseten kicks off a multi-stage build and deployment process.

truss push qwen-2-5-3b-engine --publish

Upon deployment, check your terminal logs or Baseten account to find the URL for the model server.

Inference

This model is OpenAI compatible and can be called using the OpenAI client.

import os
from openai import OpenAI

# https://model-XXXXXXX.api.baseten.co/environments/production/sync/v1
model_url = ""

client = OpenAI(
    base_url=model_url,
    api_key=os.environ.get("BASETEN_API_KEY"),
)

stream = client.chat.completions.create(
    model="baseten",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "What does Tongyi Qianwen mean?"}
    ],
    stream=True,
)

for chunk in stream:
    if chunk.choices[0].delta.content is not None:
        print(chunk.choices[0].delta.content, end="")

That’s it! You have successfully deployed and called an LLM optimized with the TensorRT-LLM Engine Builder. Check the model library for more examples and the engine builder reference for details on each config option.

Examples

Model library

Fast LLMs with TensorRT-LLM

Example: Deploy Qwen 2.5 3B on an H100

Setup

Configuration

Deployment

Inference

Examples

Model library

​Example: Deploy Qwen 2.5 3B on an H100

​Setup

​Configuration

​Deployment

​Inference

Example: Deploy Qwen 2.5 3B on an H100

Setup

Configuration

Deployment

Inference