Parameters
The ID of the model you want to call.
Your Baseten API key, formatted with prefix
Api-Key (e.g. {"Authorization": "Api-Key abcd1234.abcd1234"}).Websocket Metadata
These parameters configure the Voice Activity Detector (VAD) and allow you to tune behavior such as speech endpointing.
- threshold (
float, default=0.5): The probability threshold for detecting speech, between 0.0 and 1.0. Frames with a probability above this value are considered speech. A higher threshold makes the VAD more selective, reducing false positives from background noise. - min_silence_duration_ms (
int, default=300): The minimum duration of silence (in milliseconds) required to determine that speech has ended. - speech_pad_ms (
int, default=0): Padding (in milliseconds) added to both the start and end of detected speech segments to avoid cutting off words prematurely.
Parameters for controlling streaming ASR behavior.
- encoding (
string, default="pcm_s16le"): Audio encoding format. - sample_rate (
int, default=16000): Audio sample rate in Hz. Whisper models are optimized for a sample rate of 16,000 Hz. - enable_partial_transcripts (
boolean, optional): If set to true, intermediate (partial) transcripts will be sent over the WebSocket as audio is received. For most voice AI use cases, we recommend setting this tofalse. - partial_transcript_interval_s (
float, default=0.5): Interval in seconds that the model waits before sending a partial transcript, if partials are enabled. - final_transcript_max_duration_s (
int, default=30): The maximum duration of buffered audio (in seconds) before a final transcript is forcibly returned. This value should not exceed30.
Parameters for controlling Whisper’s behavior.
- prompt (
string, optional): Optional transcription prompt. - audio_language (
string, default="en"): Language of the input audio. Set to"auto"for automatic detection. - language_detection_only (
boolean, default=false): Iftrue, only return the automatic language detection result without transcribing. - language_options (
list[string], default=[]): List of language codes to consider for language detection, for example["en", "zh"]. This could improve language detection accuracy by scoping the language detection to a specific set of languages that only makes sense for your use case. By default, we consider all languages supported by Whisper model. [Added since v0.5.0] - use_dynamic_preprocessing (
boolean, default=false): Enables dynamic range compression to process audio with variable loudness. - show_word_timestamps (
boolean, default=false): Iftrue, include word-level timestamps in the output. [Added since v0.4.0] - show_beam_results (
boolean, default=false): Iftrue, include transcriptions from all beams of beam search in the response. [Added since v0.7.5]
Advanced parameters for controlling Whisper’s sampling behavior.
- beam_width (
integer, optional): Beam search width for decoding. Controls the number of candidate sequences to maintain during beam search. [Added since v0.6.0] - length_penalty (
float, optional): Length penalty applied to the output. Higher values encourage longer outputs. [Added since v0.6.0] - repetition_penalty (
float, optional): Penalty for repeating tokens. Higher values discourage repetition. [Added since v0.6.0] - beam_search_diversity_rate (
float, optional): Controls diversity in beam search. Higher values increase diversity among beam candidates. [Added since v0.6.0] - no_repeat_ngram_size (
integer, optional): Prevents repetition of n-grams of the specified size. [Added since v0.6.0]
Deprecated since v0.6.0. Use
whisper_params.whisper_sampling_params instead. Specifically, replace beam_size with whisper_params.whisper_sampling_params.beam_width and length_penalty with whisper_params.whisper_sampling_params.length_penalty.FAQ
How do I handle end of audio to avoid losing the last utterance?
By default, the VAD-based endpointing only triggers a transcript when it detects a period of silence after speech. If you close the connection abruptly without signaling end-of-audio, any speech still buffered that hasn’t hit a silence boundary will be lost. To flush the buffer and get a final transcript for all remaining audio, send anend_audio control message before closing the connection:
- Immediately acknowledge:
{"type": "end_audio", "body": {"status": "acknowledged"}} - Finish transcribing all remaining buffered audio, sending any final transcription results
- Signal completion:
{"type": "end_audio", "body": {"status": "finished"}}
finished, it is safe to close the connection.
Python
How do I process multiple audio sessions without reconnecting every time?
Each WebSocket connection is a single streaming session. The metadata (language, VAD config, encoding, etc.) is fixed at connection time and can’t be changed mid-session. Once the server sends{"status": "finished"} in response to end_audio, the session is complete and the connection will close.
To process multiple files or conversation turns, open a new connection for each session. To minimize reconnection latency in high-throughput scenarios, establish the next connection before the previous one has fully closed (overlapping connections):
Python
Each WebSocket connection maps to a dedicated worker on the server. Keeping connections alive unnecessarily will consume server resources. Use health check messages (
{"type": "health_check"}) to verify a long-lived connection is still active before sending audio.