Setup
To get started, sign into Baseten with Truss and then install thewebsockets library.
Sign in to Baseten
Install websockets
Hardware
H100_40GB × 1
Engine
vLLM (0.22.0-cu129 build)
Write the config
Create and move into the project directory:config.yaml and paste the following:
config.yaml
Flags
Thestart_command passes these flags to the engine. Each one controls a runtime or serving behavior:
| Flag | Value | What it does |
|---|---|---|
--tensor-parallel-size | 1 | Number of GPUs to shard the model across. |
--compilation-config | {"cudagraph_mode": "PIECEWISE"} | vLLM compilation passes (op fusion, dead-code elimination). |
Deploy
Push the config to Baseten:truss push output (abcd1234 in the example). Use it wherever you see {model_id} in the next section.
Call the model
This preset exposes a WebSocket streaming endpoint at/v1/realtime for low-latency, incremental transcription. See the streaming transcription API reference for the message protocol, Python client example, and supported audio formats.