Api-Key
(e.g. {"Authorization": "Api-Key abcd1234.abcd1234"}
).url
, audio_b64
, or audio_bytes
.string
): URL of the audio file.string
): Base64-encoded audio content.bytes
): Raw audio bytes.string
, optional): Optional transcription prompt.string
, default="en"
): Language of the input audio. Set to "auto"
for automatic detection.boolean
, default=false
): If true
, only return the automatic language detection result without transcribing.list[string]
, default=[]
): List of language codes to consider for language detection, for example ["en", "zh"]
. This could improve language detection accuracy by scoping the language detection to a specific set of languages that only makes sense for your use case. By default, we consider all languages supported by Whisper model. [Added since v0.5.0]boolean
, default=false
): Enables dynamic range compression to process audio with variable loudness.boolean
, default=false
): If true
, include word-level timestamps in the output. [Added since v0.4.0]boolean
, default=true
): If true
, enable audio chunking by voice activity detection (VAD) model. If false
, the model can only process up to 30 seconds of audio at a time. [Added since v0.6.0]integer
, optional): Beam search width for decoding. Controls the number of candidate sequences to maintain during beam search. [Added since v0.6.0]float
, optional): Length penalty applied to the output. Higher values encourage longer outputs. [Added since v0.6.0]float
, optional): Penalty for repeating tokens. Higher values discourage repetition. [Added since v0.6.0]float
, optional): Controls diversity in beam search. Higher values increase diversity among beam candidates. [Added since v0.6.0]integer
, optional): Prevents repetition of n-grams of the specified size. [Added since v0.6.0]integer
, default=1
): Beam search size for decoding. We support beam size up to 5. [Deprecated since v0.6.0. Use whisper_input.whisper_params.whisper_sampling_params.beam_width
instead.]float
, default=2.0
): Length penalty applied to ASR output. Length penalty can only work when beam_size
is greater than 1. [Deprecated since v0.6.0. Use whisper_input.whisper_params.whisper_sampling_params.length_penalty
instead.]integer
, default=29
): Maximum duration of speech in seconds to be considered a speech segment. max_speech_duration_s
cannot be over 30 because Whisper model can only take up to 30 seconds audio input. [Added since v0.4.0]integer
, default=3000
): In the end of each speech chunk wait for min_silence_duration_ms before separating it. [Added since v0.4.0]float
, default=0.5
): Speech threshold. VAD outputs speech probabilities for each audio chunk, probabilities above this value are considered as speech. It is better to tune this parameter for each dataset separately, but “lazy” 0.5 is pretty good for most datasets. [Added since v0.4.0]