Skip to main content

Setup

To get started, sign into Baseten with Truss and then install the websockets library.
Sign in to Baseten
uvx truss login --browser
Install websockets
uv pip install websockets
mistralai/Voxtral-Mini-4B-Realtime-2602 is a 4B-parameter encoder-decoder model. This preset serves Voxtral Mini Realtime on H100 40GB, tuned for low-latency streaming transcription.

Hardware

H100_40GB × 1

Engine

vLLM (0.22.0-cu129 build)

Write the config

Create and move into the project directory:
mkdir voxtral-mini-4b-latency && cd voxtral-mini-4b-latency
Then create a file named config.yaml and paste the following:
config.yaml
model_name: "model:voxtral-mini-4b preset:latency"
model_metadata:
  repo_id: mistralai/Voxtral-Mini-4B-Realtime-2602
secrets:
  hf_access_token: null
weights:
  - source: "hf://mistralai/Voxtral-Mini-4B-Realtime-2602@main"
    mount_location: "/app/checkpoint/model"
    auth_secret_name: "hf_access_token"
environment_variables:
  VLLM_DISABLE_COMPILE_CACHE: "1"
base_image:
  image: vllm/vllm-openai:v0.22.0-cu129
docker_server:
  start_command: >-
    sh -c "VLLM_DISABLE_COMPILE_CACHE=1 vllm serve /app/checkpoint/model
    --tensor-parallel-size 1
    --served-model-name mistralai/Voxtral-Mini-4B-Realtime-2602
    --host 0.0.0.0
    --port 8000
    --compilation-config '{\"cudagraph_mode\": \"PIECEWISE\"}'"
  readiness_endpoint: /health
  liveness_endpoint: /health
  predict_endpoint: /v1/realtime
  server_port: 8000
resources:
  accelerator: H100_40GB:1
  cpu: "1"
  memory: 10Gi
  use_gpu: true
requirements:
  - vllm[audio]
  - librosa
  - torch
  - torchaudio
  - pynvml
  - ffmpeg-python
  - websockets
system_packages:
  - python3.10-venv
  - ffmpeg
  - openmpi-bin
  - libopenmpi-dev
runtime:
  is_websocket_endpoint: true
  transport:
    kind: websocket
    ping_interval_seconds: null
    ping_timeout_seconds: null

Flags

The start_command passes these flags to the engine. Each one controls a runtime or serving behavior:
FlagValueWhat it does
--tensor-parallel-size1Number of GPUs to shard the model across.
--compilation-config{"cudagraph_mode": "PIECEWISE"}vLLM compilation passes (op fusion, dead-code elimination).

Deploy

Push the config to Baseten:
uvx truss push
You should see output similar to:
✨ Model voxtral-mini-4b-latency was successfully pushed ✨

   Model ID:      abc1d2ef
   Deployment ID: xyz123
   Endpoint:      model-abc1d2ef.api.baseten.co
   Logs:          https://app.baseten.co/models/abc1d2ef/logs/xyz123
Your model ID is printed in the truss push output (abcd1234 in the example). Use it wherever you see {model_id} in the next section.

Call the model

This preset exposes a WebSocket streaming endpoint at /v1/realtime for low-latency, incremental transcription. See the streaming transcription API reference for the message protocol, Python client example, and supported audio formats.