Qwen3-ASR

Alibaba’s Qwen3-ASR is a compact 1.7B speech-to-text model with multilingual transcription support.

Setup

Sign in to Baseten

uvx truss login --browser

Install the OpenAI SDK

uv pip install openai

This preset serves Qwen3-ASR on a single H100 40GB through vLLM, tuned for fast multilingual transcription.

Hardware

H100_40GB × 1

Engine

vLLM (0.22.0-cu129 build)

Concurrency

256

Write the config

Create and move into the project directory:

mkdir qwen3-asr-1.7b-latency && cd qwen3-asr-1.7b-latency

Then create a file named config.yaml and paste the following:

config.yaml

model_name: "model:qwen3-asr-1.7b preset:latency"
model_metadata:
  repo_id: Qwen/Qwen3-ASR-1.7B
  example_model_input:
    stream: false
    messages:
      - role: user
        content:
          - type: audio_url
            audio_url:
              url: https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen3-ASR-Repo/asr_en.wav
  tags:
    - openai-compatible
secrets:
  hf_access_token: null
weights:
  - source: "hf://Qwen/Qwen3-ASR-1.7B@main"
    mount_location: "/app/checkpoint/model"
    auth_secret_name: "hf_access_token"
base_image:
  image: vllm/vllm-openai:v0.22.0-cu129
docker_server:
  start_command: sh -c "vllm serve /app/checkpoint/model --tensor-parallel-size 1 --served-model-name Qwen/Qwen3-ASR-1.7B --gpu-memory-utilization 0.8 --host 0.0.0.0 --port 8000 --load-format runai_streamer"
  readiness_endpoint: /health
  liveness_endpoint: /health
  predict_endpoint: /v1/chat/completions
  server_port: 8000
resources:
  accelerator: H100_40GB:1
  cpu: "1"
  memory: 10Gi
  use_gpu: true
requirements:
  - vllm[audio]
  - librosa
  - torch
  - torchaudio
  - pynvml
  - ffmpeg-python
system_packages:
  - python3.10-venv
  - ffmpeg
  - openmpi-bin
  - libopenmpi-dev
runtime:
  predict_concurrency: 256

Flags

The start_command passes these flags to the engine. Each one controls a runtime or serving behavior:

Flag	Value	What it does
`--tensor-parallel-size`	`1`	Number of GPUs to shard the model across.
`--gpu-memory-utilization`	`0.8`	Fraction of GPU memory vLLM may use for weights and KV cache.
`--load-format`	`runai_streamer`	Weight loading backend. runai_streamer: Stream weights from object storage without materializing to disk.

Deploy

Push the config to Baseten:

uvx truss push

You should see output similar to:

✨ Model qwen3-asr-1.7b-latency was successfully pushed ✨

   Model ID:      abc1d2ef
   Deployment ID: xyz123
   Endpoint:      model-abc1d2ef.api.baseten.co
   Logs:          https://app.baseten.co/models/abc1d2ef/logs/xyz123

truss push prints your model ID (abc1d2ef in the example). The examples below use it wherever you see {model_id}, and read your API key from the BASETEN_API_KEY environment variable.

Call the model

Your deployment serves an OpenAI-compatible chat completions API at /v1/chat/completions that accepts audio inputs. Send audio as an audio_url content item on a chat message. The model returns the transcription as the assistant message content.

Python
cURL

main.py

import os
from openai import OpenAI

client = OpenAI(
    api_key=os.environ["BASETEN_API_KEY"],
    base_url="https://model-{model_id}.api.baseten.co/environments/production/sync/v1",
)

response = client.chat.completions.create(
    model="Qwen/Qwen3-ASR-1.7B",
    messages=[
        {
            "role": "user",
            "content": [
                {
                    "type": "audio_url",
                    "audio_url": {
                        "url": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen3-ASR-Repo/asr_en.wav"
                    },
                }
            ],
        }
    ],
)

print(response.choices[0].message.content)

curl -s https://model-{model_id}.api.baseten.co/environments/production/sync/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $BASETEN_API_KEY" \
  -d '{
    "model": "Qwen/Qwen3-ASR-1.7B",
    "messages": [
      {"role": "user", "content": [
        {"type": "audio_url", "audio_url": {"url": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen3-ASR-Repo/asr_en.wav"}}
      ]}
    ]
  }'

Examples

Models

Engines

Custom Docker servers

Custom Python models

Chains

Setup

Hardware

Engine

Concurrency

Write the config

Flags

Deploy

Call the model

Next steps

Call your model

Autoscaling

​Setup

Hardware

Engine

Concurrency

​Write the config

​Flags

​Deploy

​Call the model

​Next steps

Call your model

Autoscaling

Setup

Write the config

Flags

Deploy

Call the model

Next steps