Documentation Index
Fetch the complete documentation index at: https://docs.baseten.co/llms.txt
Use this file to discover all available pages before exploring further.
Setup
To get started, sign into Baseten with Truss and then install the OpenAI SDK.
Sign in to Basetenuvx truss login --browser
Qwen/Qwen3-ASR-1.7B is a 1.7B-parameter encoder-decoder model.
This preset serves Qwen3-ASR on a single H100 40GB through vLLM, tuned for fast multilingual transcription.
Write the config
Create and move into the project directory:
mkdir qwen3-asr-1.7b-latency && cd qwen3-asr-1.7b-latency
Then create a file named config.yaml and paste the following:
model_name: "model:qwen3-asr-1.7b preset:latency"
model_metadata:
repo_id: Qwen/Qwen3-ASR-1.7B
example_model_input:
stream: false
messages:
- role: user
content:
- type: audio_url
audio_url:
url: https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen3-ASR-Repo/asr_en.wav
tags:
- openai-compatible
secrets:
hf_access_token: null
weights:
- source: "hf://Qwen/Qwen3-ASR-1.7B@main"
mount_location: "/app/models/Qwen/Qwen3-ASR-1.7B"
auth_secret_name: "hf_access_token"
base_image:
image: vllm/vllm-openai:v0.18.0
docker_server:
start_command: sh -c "vllm serve /app/models/Qwen/Qwen3-ASR-1.7B --gpu-memory-utilization 0.8 --host 0.0.0.0 --port 8000"
readiness_endpoint: /health
liveness_endpoint: /health
predict_endpoint: /v1/chat/completions
server_port: 8000
resources:
accelerator: H100_40GB:1
cpu: "1"
memory: 10Gi
use_gpu: true
requirements:
- vllm[audio]
- librosa
- torch
- torchaudio
- pynvml
- ffmpeg-python
system_packages:
- python3.10-venv
- ffmpeg
- openmpi-bin
- libopenmpi-dev
runtime:
predict_concurrency: 256
Flags
The start_command passes these flags to the engine. Each one controls a runtime or serving behavior:
| Flag | Value | What it does |
|---|
--gpu-memory-utilization | 0.8 | Fraction of GPU memory vLLM may use for weights and KV cache. |
Deploy
Push the config to Baseten:
You should see output similar to:
✨ Model qwen3-asr-1.7b-latency was successfully pushed ✨
🪵 View logs for your deployment at https://app.baseten.co/models/abcd1234/logs/wxyz5678
Your model ID is the string after /models/ in the logs URL (abcd1234 in the example). Use it wherever you see {model_id} in the next section.
Call the model
Your deployment serves an OpenAI-compatible chat completions API at /v1/chat/completions that accepts audio inputs. Replace {model_id} with your model ID and make sure BASETEN_API_KEY is set.
Send audio as an audio_url content item on a chat message. The model returns the transcription as the assistant message content.
import os
from openai import OpenAI
client = OpenAI(
api_key=os.environ["BASETEN_API_KEY"],
base_url="https://model-{model_id}.api.baseten.co/environments/production/sync/v1",
)
response = client.chat.completions.create(
model="Qwen/Qwen3-ASR-1.7B",
messages=[
{
"role": "user",
"content": [
{
"type": "audio_url",
"audio_url": {
"url": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen3-ASR-Repo/asr_en.wav"
},
}
],
}
],
)
print(response.choices[0].message.content)
curl -s https://model-{model_id}.api.baseten.co/environments/production/sync/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Api-Key $BASETEN_API_KEY" \
-d '{
"model": "Qwen/Qwen3-ASR-1.7B",
"messages": [
{"role": "user", "content": [
{"type": "audio_url", "audio_url": {"url": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen3-ASR-Repo/asr_en.wav"}}
]}
]
}'