To get the best performance, we recommend using our TensorRT-LLM Engine Builder when deploying LLMs. Models deployed with the Engine Builder are OpenAI compatible, support structured output and function calling, and offer deploy-time post-training quantization to FP8 with Hopper GPUs.The Engine Builder supports LLMs from the following families, both foundation models and fine-tunes:
Llama 3.0 and later (including DeepSeek-R1 distills)
Qwen 2.5 and later (including Math, Coder, and DeepSeek-R1 distills)
Mistral (all LLMs)
You can download preset Engine Builder configs for common models from the model library.
The Engine Builder does not support vision-language models like Llama 3.2 11B or Pixtral. For these models, we recommend vLLM.
This configuration builds an inference engine to serve Qwen 2.5 3B on an H100 GPU. Running this model is fast and cheap, making it a good example for documentation, but the process of deploying it is very similar to larger models like Llama 3.3 70B.
This configuration file specifies model information and Engine Builder arguments. You can find dozens of examples in the model library as well as details on each config option in the engine builder reference.Below is an example for Qwen 2.5 3B.
config.yaml
Copy
Ask AI
model_metadata: example_model_input: # Loads sample request into Baseten playground messages: - role: system content: "You are a helpful assistant." - role: user content: "What does Tongyi Qianwen mean?" stream: true max_tokens: 512 temperature: 0.6 # Check recommended temperature per model repo_id: Qwen/Qwen2.5-3B-Instructmodel_name: Qwen 2.5 3B Instructpython_version: py39resources: # Engine Builder GPU cannot be changed post-deployment accelerator: H100 use_gpu: truesecrets: {}trt_llm: build: base_model: decoder checkpoint_repository: repo: Qwen/Qwen2.5-3B-Instruct source: HF num_builder_gpus: 1 quantization_type: no_quant # `fp8_kv` often recommended for large models max_seq_len: 32768 # option to very the max sequence length, e.g. 131072 for Llama models tensor_parallel_count: 1 # Set equal to number of GPUs plugin_configuration: use_paged_context_fmha: true use_fp8_context_fmha: false # Set to true when using `fp8_kv` paged_kv_cache: true runtime: batch_scheduler_policy: max_utilization enable_chunked_context: true request_default_max_tokens: 32768 # 131072 for Llama models
This model is OpenAI compatible and can be called using the OpenAI client.
Copy
Ask AI
import osfrom openai import OpenAI# https://model-XXXXXXX.api.baseten.co/environments/production/sync/v1model_url = ""client = OpenAI( base_url=model_url, api_key=os.environ.get("BASETEN_API_KEY"),)stream = client.chat.completions.create( model="baseten", messages=[ {"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "What does Tongyi Qianwen mean?"} ], stream=True,)for chunk in stream: if chunk.choices[0].delta.content is not None: print(chunk.choices[0].delta.content, end="")
That’s it! You have successfully deployed and called an LLM optimized with the TensorRT-LLM Engine Builder. Check the model library for more examples and the engine builder reference for details on each config option.