Voxtral

Mistral’s Voxtral Mini Realtime is a 4B speech-to-text model tuned for real-time streaming transcription.

Setup

Sign in to Baseten

uvx truss login --browser

Install websockets

uv pip install websockets

This preset serves Voxtral Mini Realtime on H100 40GB, tuned for low-latency streaming transcription.

Hardware

H100

Engine

vLLM (0.22.0 custom build)

Context

10K

Concurrency

Write the config

Create and move into the project directory:

mkdir voxtral-mini-4b-latency && cd voxtral-mini-4b-latency

Then create a file named config.yaml and paste the following:

config.yaml

model_name: "model:voxtral-mini-4b preset:latency"
model_metadata:
  repo_id: mistralai/Voxtral-Mini-4B-Realtime-2602
secrets:
  hf_access_token: null
weights:
  - source: "hf://mistralai/Voxtral-Mini-4B-Realtime-2602@main"
    mount_location: "/app/checkpoint/model"
    auth_secret_name: "hf_access_token"
environment_variables:
  VLLM_CACHE_ROOT: /cache/org/vllm
  TORCHINDUCTOR_CACHE_DIR: /cache/org/inductor
  TRITON_CACHE_DIR: /cache/org/triton
base_image:
  image: "baseten/vllm-openai:0.22.0-voxtral-realtime-fixes"
docker_server:
  start_command: >-
    sh -c "vllm serve /app/checkpoint/model
    --tensor-parallel-size 1
    --api-server-count 8
    --enable-realtime-unbounded 
    --no-enable-prefix-caching
    --realtime-reanchor-margin-tokens 1024
    --hf-overrides '{\"text_config\": {\"sliding_window\": 4096}}'
    --served-model-name mistralai/Voxtral-Mini-4B-Realtime-2602
    --host 0.0.0.0 
    --port 8000
    --max-num-seqs 48
    --max-model-len 10240
    --compilation-config '{\"cudagraph_mode\": \"PIECEWISE\", \"cudagraph_capture_sizes\": [1, 2, 4, 8, 16, 24, 32, 48], \"max_cudagraph_capture_size\": 48}'"
  readiness_endpoint: /health
  liveness_endpoint: /health
  predict_endpoint: /v1/realtime
  server_port: 8000
resources:
  accelerator: H100
  cpu: "8"
  memory: 32Gi
  use_gpu: true
requirements:
  - librosa
  - pynvml
  - ffmpeg-python
  - websockets
system_packages:
  - python3.10-venv
  - ffmpeg
  - openmpi-bin
  - libopenmpi-dev
runtime:
  predict_concurrency: 48
  is_websocket_endpoint: true
  transport:
    kind: websocket
    ping_interval_seconds: null
    ping_timeout_seconds: null

Flags

The start_command passes these flags to the engine. Each one controls a runtime or serving behavior:

Flag	Value	What it does
`--tensor-parallel-size`	`1`	Number of GPUs to shard the model across.
`--api-server-count`	`8`	Number of API server processes vLLM runs in front of the engine, spreading HTTP and WebSocket handling across CPU cores.
`--enable-realtime-unbounded`	(no value)	Lets realtime streaming sessions run without a fixed duration cap. Available in Baseten’s patched vLLM build, not upstream vLLM.
`--no-enable-prefix-caching`	(no value)	Disable prefix caching, so repeated prompts do not reuse cached KV blocks.
`--realtime-reanchor-margin-tokens`	`1024`	Token margin the realtime endpoint keeps when it re-anchors a long stream’s context window, bounding memory growth over long sessions.
`--hf-overrides`	`{"text_config": {"sliding_window": 4096}}`	Overrides fields in the model’s Hugging Face config as a JSON object. The dotted form (`--hf-overrides.<field>`) sets the same fields one at a time.
`--max-num-seqs`	`48`	Maximum number of concurrent sequences in the batch.
`--max-model-len`	`10240`	Maximum context length (tokens) the server accepts per request.
`--compilation-config`	`{"cudagraph_mode": "PIECEWISE", "cudagraph_capture_sizes": [1, 2, 4, 8, 16, 24, 32, 48], "max_cudagraph_capture_size": 48}`	vLLM compilation passes (op fusion, dead-code elimination).

Deploy

Push the config to Baseten:

uvx truss push

You should see output similar to:

✨ Model voxtral-mini-4b-latency was successfully pushed ✨

   Model ID:      abc1d2ef
   Deployment ID: xyz123
   Endpoint:      model-abc1d2ef.api.baseten.co
   Logs:          https://app.baseten.co/models/abc1d2ef/logs/xyz123

truss push prints your model ID (abc1d2ef in the example). The examples below use it wherever you see {model_id}, and read your API key from the BASETEN_API_KEY environment variable.

Call the model

This preset exposes a WebSocket streaming endpoint at /v1/realtime for low-latency, incremental transcription. See the streaming transcription API reference for the message protocol, Python client example, and supported audio formats.

Examples

Models

Engines

Custom Docker servers

Custom Python models

Chains

Setup

Hardware

Engine

Context

Concurrency

Write the config

Flags

Deploy

Call the model

Next steps

Call your model

Autoscaling

​Setup

Hardware

Engine

Context

Concurrency

​Write the config

​Flags

​Deploy

​Call the model

​Next steps

Call your model

Autoscaling

Setup

Write the config

Flags

Deploy

Call the model

Next steps