Build your first LLM engine
Automatically build and deploy a TensorRT-LLM model serving engine
Deploying a TensorRT-LLM model with the Engine Builder is a three-step process:
- Pick a model and GPU instance
- Write your engine configuration and optional model serving code
- Deploy your packaged model and the engine will be built automatically
In this guide, we’ll walk through the process of using the engine builder end-to-end. To make this tutorial as quick and cheap as possible, we’ll use a 1.1 billion parameter TinyLlama model on an A10G GPU.
We also have production-ready examples for Llama 3, Mistral, and Whisper.
Prerequisites
Before you deploy a model, you’ll need three quick setup steps.
Create an API key for your Baseten account
Create an API key and save it as an environment variable:
export BASETEN_API_KEY="abcd.123456"
Add an access token for Hugging Face
Some models require that you accept terms and conditions on Hugging Face before deployment. To prevent issues:
- Accept the license for any gated models you wish to access, like Llama 3.
- Create a read-only user access token from your Hugging Face account.
- Add the
hf_access_token
secret to your Baseten workspace.
Install Truss in your local development environment
Install the latest version of Truss, our open-source model packaging framework, with:
pip install --upgrade truss
Configure your engine
We’ll start by creating a new Truss:
truss init tinyllama-trt
cd tinyllama-trt
In the newly created tinyllama-trt/
folder, open config.yaml
. In this file, we’ll configure our model serving engine:
model_name: tinyllama-trt
python_version: py310
resources:
accelerator: A10G
use_gpu: True
trt_llm:
build:
max_input_len: 2048
max_output_len: 2048
max_batch_size: 1
max_beam_width: 1
base_model: llama
quantization_type: no_quant
checkpoint_repository:
repo: TinyLlama/TinyLlama-1.1B-Chat-v1.0
source: HF
This build configuration sets a number of important parameters:
max_input_len
andmax_output_len
control the sequence shapes for input and output. We want to match these as closely as possible to expected real-world use to improve engine performance.max_batch_size
lets us trade off between latency and throughput/cost. Larger batches increase the total number of requests that can be processed at once but decrease the perceived speed of each request.max_beam_width
is always set to1
as we don’t currently perform beam search.base_model
determines which type of supported model architecture to build the engine for.quantization_type
asks if the model should be quantized on deployment.no_quant
will run the model in standardfp16
precision.checkpoint_repository
determines where to load the weights from, in this case a Hugging Face repository for TinyLlama.
The config.yaml
file also contains Baseten-specific configuration for model name, GPU type, and model serving environment.
Delete or update model.py
The config.yaml
file above specifies a complete TensorRT-LLM engine. However, we also provide further control in the model/model.py
file in Truss.
If you do not need to add any custom logic in model/model.py
, instead delete the file. Otherwise, you’ll get the following error on deployment:
truss.errors.ValidationError: Model class `__init__` method
is required to have `trt_llm` as an argument.
Please add that argument.
The model/model.py
file is useful for custom behaviors like applying a prompt template.
from typing import Any
from transformers import AutoTokenizer
class Model:
def __init__(self, trt_llm, **kwargs) -> None:
self._engine = trt_llm["engine"]
self._model = None
self._tokenizer = None
def load(self) -> None:
self._tokenizer = AutoTokenizer.from_pretrained("TinyLlama/TinyLlama-1.1B-Chat-v1.0")
async def predict(self, model_input: Any) -> Any:
# Apply chat template to prompt
model_input["prompt"] = self._tokenizer.apply_chat_template(model_input["prompt"], tokenize=False)
return await self._engine.predict(model_input)
Including a model/model.py
file is optional. If the file is not present, the TensorRT-LLM engine will run according to its base spec.
Deploy and build
To deploy your model and have the TensorRT-LLM engine automatically build, run:
truss push --publish
This will create a new deployment in your Baseten workspace. Navigate the model dashboard to see engine building and model deployment logs.
The engines are stored in Baseten but owned by the user — we’re working on a mechanism for downloading them. In the meantime, reach out if you need access to an engine that you created using the Engine Builder.
Call deployed model
When your model is deployed, you can call it via its API endpoint:
import requests
# Model ID for production deployment
model_id = ""
# Read secrets from environment variables
baseten_api_key = os.environ["BASETEN_API_KEY"]
# Call model endpoint
resp = requests.post(
f"https://model-{model_id}.api.baseten.co/production/predict",
headers={"Authorization": f"Api-Key {baseten_api_key}"},
json={
"messages": [{"role": "user", "content": "How awesome is TensorRT-LLM?"}],
"max_tokens": 1024
},
stream=True
)
# Print the generated tokens as they get streamed
for content in resp.iter_content():
print(content.decode("utf-8"), end="", flush=True)
Supported parameters for LLMs:
The input text prompt to guide the language model’s generation.
One of prompt
XOR messages
is required.
A list of dictionaries representing the message history, typically used in conversational contexts.
One of prompt
XOR messages
is required.
The maximum number of tokens to generate in the output. Controls the length of the generated text.
The number of beams used in beam search. Maximum of 1
.
A penalty applied to repeated tokens to discourage the model from repeating the same words or phrases.
A penalty applied to tokens already present in the prompt to encourage the generation of new topics.
Controls the randomness of the output. Lower values make the output more deterministic, while higher values increase randomness.
A penalty applied to the length of the generated sequence to control verbosity. Higher values make the model favor shorter outputs.
The token ID that indicates the end of the generated sequence.
The token ID used for padding sequences to a uniform length.
Limits the sampling pool to the top k
tokens, ensuring the model only considers the most likely tokens at each step.
Applies nucleus sampling to limit the sampling pool to a cumulative probability p
, ensuring only the most likely tokens are considered.