Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.baseten.co/llms.txt

Use this file to discover all available pages before exploring further.

Setup

To get started, sign into Baseten with Truss and then install the websockets library.
Sign in to Baseten
uvx truss login --browser
Install websockets
uv pip install websockets
mistralai/Voxtral-Mini-4B-Realtime-2602 is a 4B-parameter encoder-decoder model. This preset serves Voxtral Mini Realtime on H100 40GB, tuned for low-latency streaming transcription.

Hardware

H100_40GB × 1

Engine

vLLM (latest build)

Write the config

Create and move into the project directory:
mkdir voxtral-mini-4b-latency && cd voxtral-mini-4b-latency
Then create a file named config.yaml and paste the following:
config.yaml
model_name: "model:voxtral-mini-4b preset:latency"
model_metadata:
  repo_id: mistralai/Voxtral-Mini-4B-Realtime-2602
secrets:
  hf_access_token: null
weights:
  - source: "hf://mistralai/Voxtral-Mini-4B-Realtime-2602@main"
    mount_location: "/app/models/mistralai/Voxtral-Mini-4B-Realtime-2602"
    auth_secret_name: "hf_access_token"
environment_variables:
  VLLM_DISABLE_COMPILE_CACHE: "1"
base_image:
  image: vllm/vllm-openai:latest
docker_server:
  start_command: sh -c "VLLM_DISABLE_COMPILE_CACHE=1 vllm serve /app/models/mistralai/Voxtral-Mini-4B-Realtime-2602 --compilation-config '{\"cudagraph_mode\":\"PIECEWISE\"}' --host 0.0.0.0 --port 8000"
  readiness_endpoint: /health
  liveness_endpoint: /health
  predict_endpoint: /v1/realtime
  server_port: 8000
resources:
  accelerator: H100_40GB:1
  cpu: "1"
  memory: 10Gi
  use_gpu: true
requirements:
  - vllm[audio]
  - librosa
  - torch
  - torchaudio
  - pynvml
  - ffmpeg-python
  - websockets
system_packages:
  - python3.10-venv
  - ffmpeg
  - openmpi-bin
  - libopenmpi-dev
runtime:
  is_websocket_endpoint: true
  transport:
    kind: websocket
    ping_interval_seconds: null
    ping_timeout_seconds: null

Flags

The start_command passes these flags to the engine. Each one controls a runtime or serving behavior:
FlagValueWhat it does
--compilation-config{"cudagraph_mode":"PIECEWISE"}vLLM compilation passes (op fusion, dead-code elimination).

Deploy

Push the config to Baseten:
uvx truss push
You should see output similar to:
✨ Model voxtral-mini-4b-latency was successfully pushed ✨
🪵 View logs for your deployment at https://app.baseten.co/models/abcd1234/logs/wxyz5678
Your model ID is the string after /models/ in the logs URL (abcd1234 in the example). Use it wherever you see {model_id} in the next section.

Call the model

This preset exposes a WebSocket streaming endpoint at /v1/realtime for low-latency, incremental transcription. See the streaming transcription API reference for the message protocol, Python client example, and supported audio formats.