The streaming audio transcription endpoint is ONLY compatible with websockets not with the REST API. To begin using the transcription endpoint, establish a connection via WebSocket. Once connected, you must first send a metadata JSON object (as a string) over the WebSocket. This metadata informs the model about the format and type of audio data it should expect. After the metadata is sent, you can begin streaming raw audio bytes directly over the same WebSocket connection.Documentation Index
Fetch the complete documentation index at: https://docs.baseten.co/llms.txt
Use this file to discover all available pages before exploring further.
Parameters
The ID of the model you want to call.
Your Baseten API key, formatted with prefix
Api-Key (for example, {"Authorization": "Api-Key abcd1234.abcd1234"}).Websocket Metadata
These parameters configure the Voice Activity Detector (VAD) and allow you to tune behavior such as speech endpointing.
- threshold (
float, default=0.5): The probability threshold for detecting speech, between 0.0 and 1.0. Frames with a probability above this value are considered speech. A higher threshold makes the VAD more selective, reducing false positives from background noise. - min_silence_duration_ms (
int, default=300): The minimum duration of silence (in milliseconds) required to determine that speech has ended. - speech_pad_ms (
int, default=0): Padding (in milliseconds) added to both the start and end of detected speech segments to avoid cutting off words prematurely.
Parameters for controlling streaming ASR behavior.
- encoding (
string, default="pcm_s16le"): Audio encoding format. - sample_rate (
int, default=16000): Audio sample rate in Hz. Whisper models are optimized for a sample rate of 16,000 Hz. - enable_partial_transcripts (
boolean, optional): If set to true, intermediate (partial) transcripts will be sent over the WebSocket as audio is received. For most voice AI use cases, we recommend setting this tofalse. - partial_transcript_interval_s (
float, default=0.5): Interval in seconds that the model waits before sending a partial transcript, if partials are enabled. - final_transcript_max_duration_s (
int, default=30): The maximum duration of buffered audio (in seconds) before a final transcript is forcibly returned. This value should not exceed30.
Parameters for controlling Whisper’s behavior.
- prompt (
string, optional): Optional transcription prompt. - audio_language (
string, default="en"): Language of the input audio. Set to"auto"for automatic detection. - language_detection_only (
boolean, default=false): Iftrue, only return the automatic language detection result without transcribing. - language_options (
list[string], default=[]): List of language codes to consider for language detection, for example["en", "zh"]. This could improve language detection accuracy by scoping the language detection to a specific set of languages that only makes sense for your use case. By default, we consider all languages supported by Whisper model. [Added since v0.5.0] - use_dynamic_preprocessing (
boolean, default=false): Enables dynamic range compression to process audio with variable loudness. - show_word_timestamps (
boolean, default=false): Iftrue, include word-level timestamps in the output. [Added since v0.4.0] - show_beam_results (
boolean, default=false): Iftrue, include transcriptions from all beams of beam search in the response. [Added since v0.7.5]
Advanced parameters for controlling Whisper’s sampling behavior.
- beam_width (
integer, optional): Beam search width for decoding. Controls the number of candidate sequences to maintain during beam search. [Added since v0.6.0] - length_penalty (
float, optional): Length penalty applied to the output. Higher values encourage longer outputs. [Added since v0.6.0] - repetition_penalty (
float, optional): Penalty for repeating tokens. Higher values discourage repetition. [Added since v0.6.0] - beam_search_diversity_rate (
float, optional): Controls diversity in beam search. Higher values increase diversity among beam candidates. [Added since v0.6.0] - no_repeat_ngram_size (
integer, optional): Prevents repetition of n-grams of the specified size. [Added since v0.6.0]
Deprecated since v0.6.0. Use
whisper_params.whisper_sampling_params instead. Specifically, replace beam_size with whisper_params.whisper_sampling_params.beam_width and length_penalty with whisper_params.whisper_sampling_params.length_penalty.FAQ
How do I handle end of audio to avoid losing the last utterance?
By default, the VAD-based endpointing only triggers a transcript when it detects a period of silence after speech. If you close the connection abruptly without signaling end-of-audio, any speech still buffered that hasn’t hit a silence boundary will be lost. To flush the buffer and get a final transcript for all remaining audio, send anend_audio control message before closing the connection:
- Immediately acknowledge:
{"type": "end_audio", "body": {"status": "acknowledged"}} - Finish transcribing all remaining buffered audio, sending any final transcription results
- Signal completion:
{"type": "end_audio", "body": {"status": "finished"}}
finished, it is safe to close the connection.
Python
How do I process multiple audio sessions without reconnecting every time?
Each WebSocket connection is a single streaming session. The metadata (language, VAD config, encoding, etc.) is fixed at connection time and can’t be changed mid-session. Once the server sends{"status": "finished"} in response to end_audio, the session is complete and the connection will close.
To process multiple files or conversation turns, open a new connection for each session. To minimize reconnection latency in high-throughput scenarios, establish the next connection before the previous one has fully closed (overlapping connections):
Python
Each WebSocket connection maps to a dedicated worker on the server. Keeping connections alive unnecessarily will consume server resources. Use health check messages (
{"type": "health_check"}) to verify a long-lived connection is still active before sending audio.