Build your first LLM engine
Automatically build and deploy a TensorRT-LLM model serving engine
Deploying a TensorRT-LLM model with the Engine Builder is a three-step process:
- Pick a model and GPU instance
- Write your engine configuration and optional model serving code
- Deploy your packaged model and the engine will be built automatically
In this guide, we’ll walk through the process of using the engine builder end-to-end. To make this tutorial as quick and cheap as possible, we’ll use a 1.1 billion parameter TinyLlama model on an A10G GPU.
We also have production-ready examples for Llama 3 and Mistral.
Prerequisites
Before you deploy a model, you’ll need three quick setup steps.
Create an API key for your Baseten account
Create an API key and save it as an environment variable:
Add an access token for Hugging Face
Some models require that you accept terms and conditions on Hugging Face before deployment. To prevent issues:
- Accept the license for any gated models you wish to access, like Llama 3.
- Create a read-only user access token from your Hugging Face account.
- Add the
hf_access_token
secret to your Baseten workspace.
Install Truss in your local development environment
Install the latest version of Truss, our open-source model packaging framework, with:
Configure your engine
We’ll start by creating a new Truss:
In the newly created tinyllama-trt/
folder, open config.yaml
. In this file, we’ll configure our model serving engine:
This build configuration sets a number of important parameters:
max_seq_len
controls the maximum number of total tokens supported by the engine. We want to match this as closely as possible to expected real-world use to improve engine performance.base_model
determines which type of supported model architecture to build the engine for.quantization_type
asks if the model should be quantized on deployment.no_quant
will run the model in standardfp16
precision.checkpoint_repository
determines where to load the weights from, in this case a Hugging Face repository for TinyLlama.
The config.yaml
file also contains Baseten-specific configuration for model name, GPU type, and model serving environment.
Delete or update model.py
The config.yaml
file above specifies a complete TensorRT-LLM engine. However, we also provide further control in the model/model.py
file in Truss.
If you do not need to add any custom logic in model/model.py
, instead delete the file. Otherwise, you’ll get the following error on deployment:
The model/model.py
file is useful for custom behaviors like applying a prompt template.
Including a model/model.py
file is optional. If the file is not present, the TensorRT-LLM engine will run according to its base spec.
Deploy and build
To deploy your model and have the TensorRT-LLM engine automatically build, run:
This will create a new deployment in your Baseten workspace. Navigate the model dashboard to see engine building and model deployment logs.
The engines are stored in Baseten but owned by the user — we’re working on a mechanism for downloading them. In the meantime, reach out if you need access to an engine that you created using the Engine Builder.
Call deployed model
When your model is deployed, you can call it via its API endpoint:
Supported parameters for LLMs:
The input text prompt to guide the language model’s generation.
One of prompt
XOR messages
is required.
A list of dictionaries representing the message history, typically used in conversational contexts.
One of prompt
XOR messages
is required.
The maximum number of tokens to generate in the output. Controls the length of the generated text.
The number of beams used in beam search. Maximum of 1
.
A penalty applied to repeated tokens to discourage the model from repeating the same words or phrases.
A penalty applied to tokens already present in the prompt to encourage the generation of new topics.
Controls the randomness of the output. Lower values make the output more deterministic, while higher values increase randomness.
A penalty applied to the length of the generated sequence to control verbosity. Higher values make the model favor shorter outputs.
The token ID that indicates the end of the generated sequence.
The token ID used for padding sequences to a uniform length.
Limits the sampling pool to the top k
tokens, ensuring the model only considers the most likely tokens at each step.
Applies nucleus sampling to limit the sampling pool to a cumulative probability p
, ensuring only the most likely tokens are considered.