Parameters
The ID of the model you want to call.
Your Baseten API key, formatted with prefix
Api-Key (e.g. {"Authorization": "Api-Key abcd1234.abcd1234"}).Websocket Metadata
These parameters configure the Voice Activity Detector (VAD) and allow you to tune behavior such as speech endpointing.
- threshold (
float, default=0.5): The probability threshold for detecting speech, between 0.0 and 1.0. Frames with a probability above this value are considered speech. A higher threshold makes the VAD more selective, reducing false positives from background noise. - min_silence_duration_ms (
int, default=300): The minimum duration of silence (in milliseconds) required to determine that speech has ended. - speech_pad_ms (
int, default=0): Padding (in milliseconds) added to both the start and end of detected speech segments to avoid cutting off words prematurely.
Parameters for controlling streaming ASR behavior.
- encoding (
string, default="pcm_s16le"): Audio encoding format. - sample_rate (
int, default="16000"): Audio sample rate in Hz. Whisper models are optimized for a sample rate of 16,000 Hz. - enable_partial_transcripts (
boolean, optional): If set to true, intermediate (partial) transcripts will be sent over the WebSocket as audio is received. For most voice AI use cases, we recommend setting this tofalse. - partial_transcript_interval_s (
float, default=0.5): Interval in seconds that the model waits before sending a partial transcript, if partials are enabled. - final_transcript_max_duration_s (
int, default=30): The maximum duration of buffered audio (in seconds) before a final transcript is forcibly returned. This value should not exceed30.
Parameters for controlling Whisper’s behavior.
- prompt (
string, optional): Optional transcription prompt. - audio_language (
string, default="en"): Language of the input audio. Set to"auto"for automatic detection. - language_detection_only (
boolean, default=false): Iftrue, only return the automatic language detection result without transcribing. - language_options (
list[string], default=[]): List of language codes to consider for language detection, for example["en", "zh"]. This could improve language detection accuracy by scoping the language detection to a specific set of languages that only makes sense for your use case. By default, we consider all languages supported by Whisper model. [Added since v0.5.0] - use_dynamic_preprocessing (
boolean, default=false): Enables dynamic range compression to process audio with variable loudness. - show_word_timestamps (
boolean, default=false): Iftrue, include word-level timestamps in the output. [Added since v0.4.0] - show_beam_results (
boolean, default=false): Iftrue, include transcriptions from all beams of beam search in the response. [Added since v0.7.5]
Advanced parameters for controlling Whisper’s sampling behavior.
- beam_width (
integer, optional): Beam search width for decoding. Controls the number of candidate sequences to maintain during beam search. [Added since v0.6.0] - length_penalty (
float, optional): Length penalty applied to the output. Higher values encourage longer outputs. [Added since v0.6.0] - repetition_penalty (
float, optional): Penalty for repeating tokens. Higher values discourage repetition. [Added since v0.6.0] - beam_search_diversity_rate (
float, optional): Controls diversity in beam search. Higher values increase diversity among beam candidates. [Added since v0.6.0] - no_repeat_ngram_size (
integer, optional): Prevents repetition of n-grams of the specified size. [Added since v0.6.0]
Advanced settings for automatic speech recognition (ASR) process.
- beam_size (
integer, default=1): Beam search size for decoding. We support beam size up to 5. [Deprecated since v0.6.0. Usewhisper_input.whisper_params.whisper_sampling_params.beam_widthinstead.] - length_penalty (
float, default=2.0): Length penalty applied to ASR output. Length penalty can only work whenbeam_sizeis greater than 1. [Deprecated since v0.6.0. Usewhisper_input.whisper_params.whisper_sampling_params.length_penaltyinstead.]