This configuration builds an inference engine to serve Qwen 2.5 3B on an A10G GPU. It is very similar to the configuration for any other Qwen model, including fine-tuned variants.

Recommended basic GPU configurations for Qwen 2.5 sizes:

Size and variantFP16 unquantizedFP8 quantized
3B (Instruct)A10GN/A
7B (Instruct, Math, Code)H100_40GBN/A
14B (Instruct)H100H100_40GB
32B (Instruct)H100:2H100
72B (Instruct, Math)H100:4H100:2

If you use multiple GPUs, make sure to match num_builder_gpus and tensor_parallel_count in the config. When quantizing, you may need to double the number of builder GPUs.

Setup

See the end-to-end engine builder tutorial prerequisites for full setup instructions.

Please upgrade to the latest version of Truss with pip install --upgrade truss before following this example.

pip install --upgrade truss
mkdir qwen-engine
touch qwen-engine/config.yaml

Configuration

This configuration file specifies model information and Engine Builder arguments. For a different Qwen model, change the model_name, accelerator, and repo fields, along with any changes to the build arguments.

config.yaml
model_name: Qwen 2.5 3B Instruct
resources:
  accelerator: A10G
  use_gpu: true
trt_llm:
  build:
    base_model: qwen
    checkpoint_repository:
      repo: Qwen/Qwen2.5-3B-Instruct
      source: HF
    max_seq_len: 8192
    num_builder_gpus: 1
    quantization_type: no_quant
    tensor_parallel_count: 1

Deployment

truss push --publish

Usage

call_model.py
import requests
import os

# Model ID for production deployment
model_id = ""
# Read secrets from environment variables
baseten_api_key = os.environ["BASETEN_API_KEY"]

# Call model endpoint
resp = requests.post(
    f"https://model-{model_id}.api.baseten.co/production/predict",
    headers={"Authorization": f"Api-Key {baseten_api_key}"},
    json={
      "messages": [
        {"role": "system", "content": "You are Qwen, created by Alibaba Cloud. You are a helpful assistant."},
        {"role": "user", "content": "What does Tongyi Qianwen mean?"}
        ],
      "max_tokens": 512
    },
    stream=True
)

# Print the generated tokens as they get streamed
for content in resp.iter_content():
    print(content.decode("utf-8"), end="", flush=True)
prompt
string

The input text prompt to guide the language model’s generation.

One of prompt XOR messages is required.

messages
List[Dict]

A list of dictionaries representing the message history, typically used in conversational contexts.

One of prompt XOR messages is required.

max_tokens
int

The maximum number of tokens to generate in the output. Controls the length of the generated text.

beam_width
int
default: "1"

The number of beams used in beam search. Maximum of 1.

repetition_penalty
float

A penalty applied to repeated tokens to discourage the model from repeating the same words or phrases.

presence_penalty
float

A penalty applied to tokens already present in the prompt to encourage the generation of new topics.

temperature
float

Controls the randomness of the output. Lower values make the output more deterministic, while higher values increase randomness.

length_penalty
float

A penalty applied to the length of the generated sequence to control verbosity. Higher values make the model favor shorter outputs.

end_id
int

The token ID that indicates the end of the generated sequence.

pad_id
int

The token ID used for padding sequences to a uniform length.

runtime_top_k
int

Limits the sampling pool to the top k tokens, ensuring the model only considers the most likely tokens at each step.

runtime_top_p
float

Applies nucleus sampling to limit the sampling pool to a cumulative probability p, ensuring only the most likely tokens are considered.