Skip to main content

Setup

To get started, sign into Baseten with Truss and then install the websockets library.
Sign in to Baseten
uvx truss login --browser
Install websockets
uv pip install websockets
mistralai/Voxtral-Mini-4B-Realtime-2602 is a 4B-parameter encoder-decoder model. This preset serves Voxtral Mini Realtime on H100 40GB, tuned for low-latency streaming transcription.

Hardware

H100_40GB × 1

Engine

vLLM (latest build)

Write the config

Create and move into the project directory:
mkdir voxtral-mini-4b-latency && cd voxtral-mini-4b-latency
Then create a file named config.yaml and paste the following:
config.yaml
model_name: "model:voxtral-mini-4b preset:latency"
model_metadata:
  repo_id: mistralai/Voxtral-Mini-4B-Realtime-2602
secrets:
  hf_access_token: null
weights:
  - source: "hf://mistralai/Voxtral-Mini-4B-Realtime-2602@main"
    mount_location: "/app/models/mistralai/Voxtral-Mini-4B-Realtime-2602"
    auth_secret_name: "hf_access_token"
environment_variables:
  VLLM_DISABLE_COMPILE_CACHE: "1"
base_image:
  image: vllm/vllm-openai:latest
docker_server:
  start_command: sh -c "VLLM_DISABLE_COMPILE_CACHE=1 vllm serve /app/models/mistralai/Voxtral-Mini-4B-Realtime-2602 --compilation-config '{\"cudagraph_mode\":\"PIECEWISE\"}' --host 0.0.0.0 --port 8000"
  readiness_endpoint: /health
  liveness_endpoint: /health
  predict_endpoint: /v1/realtime
  server_port: 8000
resources:
  accelerator: H100_40GB:1
  cpu: "1"
  memory: 10Gi
  use_gpu: true
requirements:
  - vllm[audio]
  - librosa
  - torch
  - torchaudio
  - pynvml
  - ffmpeg-python
  - websockets
system_packages:
  - python3.10-venv
  - ffmpeg
  - openmpi-bin
  - libopenmpi-dev
runtime:
  is_websocket_endpoint: true
  transport:
    kind: websocket
    ping_interval_seconds: null
    ping_timeout_seconds: null

Flags

The start_command passes these flags to the engine. Each one controls a runtime or serving behavior:
FlagValueWhat it does
--compilation-config{"cudagraph_mode":"PIECEWISE"}vLLM compilation passes (op fusion, dead-code elimination).

Deploy

Push the config to Baseten:
uvx truss push
You should see output similar to:
✨ Model voxtral-mini-4b-latency was successfully pushed ✨
🪵 View logs for your deployment at https://app.baseten.co/models/abcd1234/logs/wxyz5678
Your model ID is the string after /models/ in the logs URL (abcd1234 in the example). Use it wherever you see {model_id} in the next section.

Call the model

This preset exposes a WebSocket streaming endpoint at /v1/realtime for low-latency, incremental transcription. See the streaming transcription API reference for the message protocol, Python client example, and supported audio formats.