Parameters
The ID of the model you want to call.
Your Baseten API key, formatted with prefix
Api-Key
(e.g. {"Authorization": "Api-Key abcd1234.abcd1234"}
).Websocket Metadata
These parameters configure the Voice Activity Detector (VAD) and allow you to tune behavior such as speech endpointing.
- threshold (
float
): The probability threshold for detecting speech, between 0.0 and 1.0. Frames with a probability above this value are considered speech. A higher threshold makes the VAD more selective, reducing false positives from background noise. - min_silence_duration_ms (
int
): The minimum duration of silence (in milliseconds) required to determine that speech has ended. - speech_pad_ms (
int
): Padding (in milliseconds) added to both the start and end of detected speech segments to avoid cutting off words prematurely.
Parameters for controlling streaming ASR behavior.
- encoding (
string
, default="pcm_s16le"
): Audio encoding format. - sample_rate (
int
, default="16000"
): Audio sample rate in Hz. Whisper models are optimized for a sample rate of 16,000 Hz. - enable_partial_transcripts (
boolean
, optional): If set to true, intermediate (partial) transcripts will be sent over the WebSocket as audio is received. For most voice AI use cases, we recommend setting this tofalse
. - partial_transcript_interval_s (
float
, default=0.5
): Interval in seconds that the model waits before sending a partial transcript, if partials are enabled. - final_transcript_max_duration_s (
int
, default=30
): The maximum duration of buffered audio (in seconds) before a final transcript is forcibly returned. This value should not exceed30
.
Parameters for controlling Whisper’s behavior.
- prompt (
string
, optional): Optional transcription prompt. - audio_language (
string
, default="en"
): Language of the input audio. Set to"auto"
for automatic detection. - language_detection_only (
boolean
, default=false
): Iftrue
, only return the automatic language detection result without transcribing. - language_options (
list[string]
, default=[]
): List of language codes to consider for language detection, for example["en", "zh"]
. This could improve language detection accuracy by scoping the language detection to a specific set of languages that only makes sense for your use case. By default, we consider all languages supported by Whisper model. [Added since v0.5.0] - use_dynamic_preprocessing (
boolean
, default=false
): Enables dynamic range compression to process audio with variable loudness. - show_word_timestamps (
boolean
, default=false
): Iftrue
, include word-level timestamps in the output. [Added since v0.4.0]
Advanced parameters for controlling Whisper’s sampling behavior.
- beam_width (
integer
, optional): Beam search width for decoding. Controls the number of candidate sequences to maintain during beam search. [Added since v0.6.0] - length_penalty (
float
, optional): Length penalty applied to the output. Higher values encourage longer outputs. [Added since v0.6.0] - repetition_penalty (
float
, optional): Penalty for repeating tokens. Higher values discourage repetition. [Added since v0.6.0] - beam_search_diversity_rate (
float
, optional): Controls diversity in beam search. Higher values increase diversity among beam candidates. [Added since v0.6.0] - no_repeat_ngram_size (
integer
, optional): Prevents repetition of n-grams of the specified size. [Added since v0.6.0]
Advanced settings for automatic speech recognition (ASR) process.
- beam_size (
integer
, default=1
): Beam search size for decoding. We support beam size up to 5. [Deprecated since v0.6.0. Usewhisper_input.whisper_params.whisper_sampling_params.beam_width
instead.] - length_penalty (
float
, default=2.0
): Length penalty applied to ASR output. Length penalty can only work whenbeam_size
is greater than 1. [Deprecated since v0.6.0. Usewhisper_input.whisper_params.whisper_sampling_params.length_penalty
instead.]