Deploy Qwen 2.5 32B Coder
Deploy Qwen 2.5 32B Coder
Deploy this model using aconfig.yaml that leverages Baseten’s Engine-Builder-LLM. The model fits on a single H100 or A100 GPU, though we recommend an H100 for the best performance with FP8 quantization.
Run inference
Once deployed, the model provides an OpenAI-compatible chat completions endpoint. You can use the standard OpenAI SDK to integrate it into your coding assistants or automation workflows.- Python SDK
- cURL
Configuration and tuning
For coding tasks, latency is often the most critical metric. We’ve configured this model with several optimizations to ensure fast, reliable responses.Hardware and quantization
We use FP8 quantization to reduce the model’s memory footprint and increase throughput without significantly impacting its coding accuracy. Running on an H100 allows for high-speed computation while the 80GB of VRAM provides ample space for long context windows.Lookahead decoding
For even lower latency on predictable coding patterns, you can enable lookahead decoding. This technique speculates on future tokens by looking for common n-gram patterns in the generated text, which is particularly effective for the structured, repetitive nature of source code.Related
- Model APIs — Instant access to Qwen models via shared endpoints.
- Engine-Builder-LLM documentation — Details on the engine used for this model.
- Truss examples — Source code for this Truss.