Skip to main content

Deploy Qwen 2.5 32B Coder

Qwen 2.5 32B Coder is a state-of-the-art language model optimized specifically for coding, math, and reasoning. Part of the Qwen 2.5 family from Alibaba Cloud, it rivals much larger models in programming proficiency while remaining small enough for efficient deployment. On Baseten, you can deploy it using the Engine-Builder-LLM for high-performance inference with OpenAI compatibility.

Deploy Qwen 2.5 32B Coder

Deploy this model using a config.yaml that leverages Baseten’s Engine-Builder-LLM. The model fits on a single H100 or A100 GPU, though we recommend an H100 for the best performance with FP8 quantization.
model_name: qwen-2-5-32b-coder
resources:
  accelerator: H100
  cpu: '1'
  memory: 24Gi
  use_gpu: true
trt_llm:
  build:
    checkpoint_repository:
      repo: Qwen/Qwen2.5-Coder-32B-Instruct
      source: HF
    num_builder_gpus: 2
    quantization_type: fp8
    max_seq_len: 32768
    tensor_parallel_count: 1
  runtime:
    enable_chunked_context: true
    request_default_max_tokens: 32768
Push your model to Baseten using the Truss CLI:
truss push

Run inference

Once deployed, the model provides an OpenAI-compatible chat completions endpoint. You can use the standard OpenAI SDK to integrate it into your coding assistants or automation workflows.
from openai import OpenAI
import os

client = OpenAI(
    api_key=os.environ['BASETEN_API_KEY'],
    base_url="https://model-{MODEL_ID}.api.baseten.co/environments/production/sync/v1"
)

response = client.chat.completions.create(
    model="qwen-coder",
    messages=[
        {"role": "user", "content": "Write a Python function to calculate the Fibonacci sequence."}
    ],
    max_tokens=512
)

print(response.choices[0].message.content)

Configuration and tuning

For coding tasks, latency is often the most critical metric. We’ve configured this model with several optimizations to ensure fast, reliable responses.

Hardware and quantization

We use FP8 quantization to reduce the model’s memory footprint and increase throughput without significantly impacting its coding accuracy. Running on an H100 allows for high-speed computation while the 80GB of VRAM provides ample space for long context windows.

Lookahead decoding

For even lower latency on predictable coding patterns, you can enable lookahead decoding. This technique speculates on future tokens by looking for common n-gram patterns in the generated text, which is particularly effective for the structured, repetitive nature of source code.