Qwen with TensorRT-LLM
Build an optimized inference engine for Qwen
This configuration builds an inference engine to serve Qwen 2.5 3B on an A10G GPU. It is very similar to the configuration for any other Qwen model, including fine-tuned variants.
Recommended basic GPU configurations for Qwen 2.5 sizes:
Size and variant | FP16 unquantized | FP8 quantized |
---|---|---|
3B (Instruct) | A10G | N/A |
7B (Instruct, Math, Code) | H100_40GB | N/A |
14B (Instruct) | H100 | H100_40GB |
32B (Instruct) | H100:2 | H100 |
72B (Instruct, Math) | H100:4 | H100:2 |
If you use multiple GPUs, make sure to match num_builder_gpus
and tensor_parallel_count
in the config. When quantizing, you may need to double the number of builder GPUs.
Setup
See the end-to-end engine builder tutorial prerequisites for full setup instructions.
Please upgrade to the latest version of Truss with pip install --upgrade truss
before following this example.
Configuration
This configuration file specifies model information and Engine Builder arguments. For a different Qwen model, change the model_name
, accelerator
, and repo
fields, along with any changes to the build
arguments.
Deployment
Usage
The input text prompt to guide the language modelβs generation.
One of prompt
XOR messages
is required.
A list of dictionaries representing the message history, typically used in conversational contexts.
One of prompt
XOR messages
is required.
The maximum number of tokens to generate in the output. Controls the length of the generated text.
The number of beams used in beam search. Maximum of 1
.
A penalty applied to repeated tokens to discourage the model from repeating the same words or phrases.
A penalty applied to tokens already present in the prompt to encourage the generation of new topics.
Controls the randomness of the output. Lower values make the output more deterministic, while higher values increase randomness.
A penalty applied to the length of the generated sequence to control verbosity. Higher values make the model favor shorter outputs.
The token ID that indicates the end of the generated sequence.
The token ID used for padding sequences to a uniform length.
Limits the sampling pool to the top k
tokens, ensuring the model only considers the most likely tokens at each step.
Applies nucleus sampling to limit the sampling pool to a cumulative probability p
, ensuring only the most likely tokens are considered.