Deploying a TensorRT-LLM model with the Engine Builder is a three-step process:

  1. Pick a model and GPU instance
  2. Write your engine configuration and optional model serving code
  3. Deploy your packaged model and the engine will be built automatically

In this guide, we’ll walk through the process of using the engine builder end-to-end. To make this tutorial as quick and cheap as possible, we’ll use a 1.1 billion parameter TinyLlama model on an A10G GPU.

We also have production-ready examples for Llama 3, Mistral, and Whisper.

Prerequisites

Before you deploy a model, you’ll need three quick setup steps.

1

Create an API key for your Baseten account

Create an API key and save it as an environment variable:

export BASETEN_API_KEY="abcd.123456"
2

Add an access token for Hugging Face

Some models require that you accept terms and conditions on Hugging Face before deployment. To prevent issues:

  1. Accept the license for any gated models you wish to access, like Llama 3.
  2. Create a read-only user access token from your Hugging Face account.
  3. Add the hf_access_token secret to your Baseten workspace.
3

Install Truss in your local development environment

Install the latest version of Truss, our open-source model packaging framework, with:

pip install --upgrade truss

Configure your engine

We’ll start by creating a new Truss:

truss init tinyllama-trt
cd tinyllama-trt

In the newly created tinyllama-trt/ folder, open config.yaml. In this file, we’ll configure our model serving engine:

config.yaml
model_name: tinyllama-trt
python_version: py310
resources:
  accelerator: A10G
  use_gpu: True
trt_llm:
  build:
    max_input_len: 2048
    max_output_len: 2048
    max_batch_size: 1
    max_beam_width: 1
    base_model: llama
    quantization_type: no_quant
    checkpoint_repository:
      repo: TinyLlama/TinyLlama-1.1B-Chat-v1.0
      source: HF

This build configuration sets a number of important parameters:

  • max_input_len and max_output_len control the sequence shapes for input and output. We want to match these as closely as possible to expected real-world use to improve engine performance.
  • max_batch_size lets us trade off between latency and throughput/cost. Larger batches increase the total number of requests that can be processed at once but decrease the perceived speed of each request.
  • max_beam_width is always set to 1 as we don’t currently perform beam search.
  • base_model determines which type of supported model architecture to build the engine for.
  • quantization_type asks if the model should be quantized on deployment. no_quant will run the model in standard fp16 precision.
  • checkpoint_repository determines where to load the weights from, in this case a Hugging Face repository for TinyLlama.

The config.yaml file also contains Baseten-specific configuration for model name, GPU type, and model serving environment.

Delete or update model.py

The config.yaml file above specifies a complete TensorRT-LLM engine. However, we also provide further control in the model/model.py file in Truss.

If you do not need to add any custom logic in model/model.py, instead delete the file. Otherwise, you’ll get the following error on deployment:

truss.errors.ValidationError: Model class `__init__` method
is required to have `trt_llm` as an argument.
Please add that argument.

The model/model.py file is useful for custom behaviors like applying a prompt template.

model/model.py
from typing import Any
from transformers import AutoTokenizer

class Model:
    def __init__(self, trt_llm, **kwargs) -> None:
        self._engine = trt_llm["engine"]
        self._model = None
        self._tokenizer = None

    def load(self) -> None:
        self._tokenizer = AutoTokenizer.from_pretrained("TinyLlama/TinyLlama-1.1B-Chat-v1.0")

    async def predict(self, model_input: Any) -> Any:
        # Apply chat template to prompt
        model_input["prompt"] = self._tokenizer.apply_chat_template(model_input["prompt"], tokenize=False)
        return await self._engine.predict(model_input)

Including a model/model.py file is optional. If the file is not present, the TensorRT-LLM engine will run according to its base spec.

Deploy and build

To deploy your model and have the TensorRT-LLM engine automatically build, run:

truss push --publish

This will create a new deployment in your Baseten workspace. Navigate the model dashboard to see engine building and model deployment logs.

The engines are stored in Baseten but owned by the user — we’re working on a mechanism for downloading them. In the meantime, reach out if you need access to an engine that you created using the Engine Builder.

Call deployed model

When your model is deployed, you can call it via its API endpoint:

call_model.py
import requests

# Model ID for production deployment
model_id = ""
# Read secrets from environment variables
baseten_api_key = os.environ["BASETEN_API_KEY"]

# Call model endpoint
resp = requests.post(
    f"https://model-{model_id}.api.baseten.co/production/predict",
    headers={"Authorization": f"Api-Key {baseten_api_key}"},
    json={
      "messages": [{"role": "user", "content": "How awesome is TensorRT-LLM?"}],
      "max_tokens": 1024
    },
    stream=True
)

# Print the generated tokens as they get streamed
for content in resp.iter_content():
    print(content.decode("utf-8"), end="", flush=True)

Supported parameters for LLMs:

prompt
string
required

The input text prompt to guide the language model’s generation.

One of prompt XOR messages is required.

messages
List[Dict]

A list of dictionaries representing the message history, typically used in conversational contexts.

One of prompt XOR messages is required.

max_tokens
int

The maximum number of tokens to generate in the output. Controls the length of the generated text.

beam_width
int
default: "1"

The number of beams used in beam search. Maximum of 1.

repetition_penalty
float

A penalty applied to repeated tokens to discourage the model from repeating the same words or phrases.

presence_penalty
float

A penalty applied to tokens already present in the prompt to encourage the generation of new topics.

temperature
float

Controls the randomness of the output. Lower values make the output more deterministic, while higher values increase randomness.

length_penalty
float

A penalty applied to the length of the generated sequence to control verbosity. Higher values make the model favor shorter outputs.

end_id
int

The token ID that indicates the end of the generated sequence.

pad_id
int

The token ID used for padding sequences to a uniform length.

runtime_top_k
int

Limits the sampling pool to the top k tokens, ensuring the model only considers the most likely tokens at each step.

runtime_top_p
float

Applies nucleus sampling to limit the sampling pool to a cumulative probability p, ensuring only the most likely tokens are considered.