Engine builder overview

If you have a foundation model like Llama 3 or a fine-tuned variant and want to create a low-latency, high-throughput model inference server, TensorRT-LLM via the Engine Builder is likely the tool for you. TensorRT-LLM is an open source performance optimization toolbox created by NVIDIA. It helps you build TensorRT engines for large language models like Llama and Mistral as well as certain other models like Whisper and large vision models. Baseten’s TensorRT-LLM Engine Builder simplifies and automates the process of using TensorRT-LLM for development and production. All you need to do is write a few lines of configuration and an optimized model serving engine will be built automatically during the model deployment process.

FAQs

Where are the engines stored?

The engines are stored in Baseten but owned by the user — we’re working on a mechanism for downloading them. In the meantime, reach out if you need access to an engine that you created using the Engine Builder.

Does the Engine Builder support quantization?

Yes. The Engine Builder can perform post-training quantization during the building process. For supported options, see quantization in the config reference.

Can I customize the engine behavior?

For further control over the TensorRT-LLM engine during inference, use the model/model.py file to access the engine object at runtime. See controlling engines with Python for details.

Get started

Concepts

Development

Deployment

Inference

Training

Observability

Troubleshooting

Engine builder overview

FAQs

Where are the engines stored?

Does the Engine Builder support quantization?

Can I customize the engine behavior?

Get started

Concepts

Development

Deployment

Inference

Training

Observability

Troubleshooting

​FAQs

​Where are the engines stored?

​Does the Engine Builder support quantization?

​Can I customize the engine behavior?

FAQs

Where are the engines stored?

Does the Engine Builder support quantization?

Can I customize the engine behavior?