Skip to main content

Setup

To get started, sign into Baseten with Truss and then install the OpenAI SDK.
Sign in to Baseten
uvx truss login --browser
Install the OpenAI SDK
uv pip install openai
RedHatAI/Qwen3-VL-32B-Instruct-NVFP4 is a 32B-parameter dense model. This preset serves the RedHatAI NVFP4 quantization of Qwen3-VL-32B-Instruct on a single RTX PRO 6000 Blackwell GPU, optimized for throughput on vision-language workloads.

Hardware

RTX_PRO_6000

Engine

vLLM (0.22.0-cu129 build)

Concurrency

8

Write the config

Create and move into the project directory:
mkdir qwen3-vl-32b-throughput && cd qwen3-vl-32b-throughput
Then create a file named config.yaml and paste the following:
config.yaml
model_name: "model:qwen3-vl-32b preset:throughput"
model_metadata:
  description: >-
    Qwen3-VL-32B-Instruct (NVFP4), an OpenAI-compatible multimodal chat model with
    vision served via vLLM.
  repo_id: RedHatAI/Qwen3-VL-32B-Instruct-NVFP4
  example_model_input:
    model: Qwen/Qwen3-VL-32B-Instruct
    messages:
      - role: user
        content:
          - type: text
            text: "Describe this image in one sentence."
          - type: image_url
            image_url:
              url: "https://picsum.photos/id/237/200/300"
    stream: true
    max_tokens: 512
    temperature: 1.0
  tags:
    - openai-compatible
base_image:
  image: vllm/vllm-openai:v0.22.0-cu129
weights:
  - source: "hf://RedHatAI/Qwen3-VL-32B-Instruct-NVFP4@main"
    mount_location: "/app/checkpoint/model"
    auth_secret_name: "hf_access_token"
secrets:
  hf_access_token: null
docker_server:
  start_command: >-
    sh -c "GPU_COUNT=$(nvidia-smi --list-gpus | wc -l) && vllm serve /app/checkpoint/model
    --tensor-parallel-size $GPU_COUNT
    --served-model-name Qwen/Qwen3-VL-32B-Instruct
    --max-num-seqs 16
    --max-model-len auto
    --limit-mm-per-prompt.image 2
    --gpu-memory-utilization 0.9
    --enable-prefix-caching
    --trust-remote-code
    --enable-auto-tool-choice
    --tool-call-parser hermes
    --load-format runai_streamer"
  readiness_endpoint: /health
  liveness_endpoint: /health
  predict_endpoint: /v1/chat/completions
  server_port: 8000
environment_variables:
  VLLM_LOGGING_LEVEL: WARNING
  VLLM_ENGINE_READY_TIMEOUT_S: "3600"
resources:
  accelerator: RTX_PRO_6000
  use_gpu: true
runtime:
  health_checks:
    restart_check_delay_seconds: 1800
    restart_threshold_seconds: 1200
    stop_traffic_threshold_seconds: 120
  predict_concurrency: 8

Flags

The start_command passes these flags to the engine. Each one controls a runtime or serving behavior:
FlagValueWhat it does
--tensor-parallel-size$GPU_COUNTNumber of GPUs to shard the model across.
--max-num-seqs16Maximum number of concurrent sequences in the batch.
--max-model-lenautoMaximum context length (tokens) the server accepts per request.
--limit-mm-per-prompt.image2Maximum number of image inputs per prompt.
--gpu-memory-utilization0.9Fraction of GPU memory vLLM may use for weights and KV cache.
--enable-prefix-caching(no value)Reuse KV cache across requests that share a prefix.
--trust-remote-code(no value)Execute model-specific Python from the checkpoint (required for many Qwen, Phi, and custom architectures).
--enable-auto-tool-choice(no value)Let the model choose when to call tools without requiring tool_choice: "required".
--tool-call-parserhermesServer-side parser that emits structured tool_calls on the response. hermes: Hermes-style function calls.
--load-formatrunai_streamerWeight loading backend. runai_streamer: Stream weights from object storage without materializing to disk.

Deploy

Push the config to Baseten:
uvx truss push
You should see output similar to:
✨ Model qwen3-vl-32b-throughput was successfully pushed ✨

   Model ID:      abc1d2ef
   Deployment ID: xyz123
   Endpoint:      model-abc1d2ef.api.baseten.co
   Logs:          https://app.baseten.co/models/abc1d2ef/logs/xyz123
Your model ID is printed in the truss push output (abcd1234 in the example). Use it wherever you see {model_id} in the next section.

Call the model

Your deployment serves an OpenAI-compatible API. Replace {model_id} with your model ID and make sure BASETEN_API_KEY is set. Now call your deployment to run inference:
main.py
import os
from openai import OpenAI

client = OpenAI(
    api_key=os.environ["BASETEN_API_KEY"],
    base_url="https://model-{model_id}.api.baseten.co/environments/production/sync/v1",
)

response = client.chat.completions.create(
    model="Qwen/Qwen3-VL-32B-Instruct",
    messages=[
        {"role": "user", "content": "What is machine learning?"}
    ],
)

print(response.choices[0].message.content)
To let the model call tools, pass a tools array. The server returns structured tool_calls on the response:
tools = [{
    "type": "function",
    "function": {
        "name": "get_weather",
        "parameters": {
            "type": "object",
            "properties": {"location": {"type": "string"}},
            "required": ["location"],
        },
    },
}]

response = client.chat.completions.create(
    model="Qwen/Qwen3-VL-32B-Instruct",
    messages=[
        {"role": "user", "content": "What's the weather in Paris?"}
    ],
    tools=tools,
)
print(response.choices[0].message.tool_calls)