Setup
To get started, sign into Baseten with Truss and then install the OpenAI SDK.Sign in to Baseten
Install the OpenAI SDK
Hardware
H100_40GB × 1
Engine
vLLM 0.18.0
Concurrency
256
Write the config
Create and move into the project directory:config.yaml and paste the following:
config.yaml
Flags
Thestart_command passes these flags to the engine. Each one controls a runtime or serving behavior:
| Flag | Value | What it does |
|---|---|---|
--gpu-memory-utilization | 0.8 | Fraction of GPU memory vLLM may use for weights and KV cache. |
Deploy
Push the config to Baseten:/models/ in the logs URL (abcd1234 in the example). Use it wherever you see {model_id} in the next section.
Call the model
Your deployment serves an OpenAI-compatible chat completions API at/v1/chat/completions that accepts audio inputs. Replace {model_id} with your model ID and make sure BASETEN_API_KEY is set.
Send audio as an audio_url content item on a chat message. The model returns the transcription as the assistant message content.
- Python
- cURL
main.py