Engine control in Python
Use model.py
to customize engine behavior
When you create a new Truss with truss init
, it creates two files: config.yaml
and model/model.py
. While you configure the Engine Builder in config.yaml
, you may use model/model.py
to access and control the engine object during inference.
You have two options:
- Delete the
model/model.py
file and your TensorRT-LLM engine will run according to its base spec. - Update the code to support TensorRT-LLM.
You must either update model/model.py
to pass trt_llm
as an argument to the __init__
method OR delete the file. Otherwise you will get an error on deployment as the default model/model.py
file is not written for TensorRT-LLM.
The engine
object is a property of the trt_llm
argument and must be initialized in __init__
to be accessed in load()
(which runs once on server start-up) and predict()
(which runs for each request handled by the server).
This example applies a chat template with the Llama 3.1 8B tokenizer to the model prompt:
from typing import Any
from transformers import AutoTokenizer
class Model:
def __init__(self, trt_llm, **kwargs) -> None:
self._secrets = kwargs["secrets"]
self._engine = trt_llm["engine"]
self._model = None
self._tokenizer = None
def load(self) -> None:
self._tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.1-8B-Instruct", token=self._secrets["hf_access_token"])
async def predict(self, model_input: Any) -> Any:
# Apply chat template to prompt
model_input["prompt"] = self._tokenizer.apply_chat_template(model_input["prompt"], tokenize=False)
return await self._engine.predict(model_input)