MiniMax M2.5

Setup

To get started, sign into Baseten with Truss and then install the OpenAI SDK.

Sign in to Baseten

uvx truss login --browser

Install the OpenAI SDK

uv pip install openai

MiniMaxAI/MiniMax-M2.5 is a 229B-parameter MoE model with up to 200K context. This preset serves MiniMax M2.5 on H100:4 with expert-parallel sharding and Runai Streamer weight loading, optimized for maximum batch throughput.

Hardware

H100 × 4

Engine

vLLM (latest build)

Context

200K

Concurrency

Write the config

Create and move into the project directory:

mkdir minimax-m2.5-throughput && cd minimax-m2.5-throughput

Then create a file named config.yaml and paste the following:

config.yaml

model_metadata:
  example_model_input:
    messages:
      - role: system
        content: "You are a helpful assistant."
      - role: user
        content: "What is the meaning of life?"
    stream: true
    model: MiniMaxAI/MiniMax-M2.5
    max_tokens: 32768
    temperature: 0.7
  tags:
    - openai-compatible

model_name: "model:minimax-m2.5 preset:throughput"

base_image:
  image: vllm/vllm-openai:latest

docker_server:
  start_command: >
    sh -c "SAFETENSORS_FAST_GPU=1 python3 -m vllm.entrypoints.openai.api_server
    --model /models/MiniMax-M2.5
    --host 0.0.0.0 --port 8000
    --served-model-name MiniMaxAI/MiniMax-M2.5
    --tensor-parallel-size $(nvidia-smi -L | wc -l)
    --enable_expert_parallel
    --trust-remote-code
    --load-format runai_streamer
    --disable-log-stats
    --max-num-seqs 64
    --max-num-batched-tokens 8192
    --tool-call-parser minimax_m2
    --reasoning-parser minimax_m2_append_think
    --enable-auto-tool-choice"
  readiness_endpoint: /health
  liveness_endpoint: /health
  predict_endpoint: /v1/chat/completions
  server_port: 8000

weights:
  - source: "hf://MiniMaxAI/MiniMax-M2.5@main"
    mount_location: "/models/MiniMax-M2.5"
    ignore_patterns:
      - "*.md"
      - "*.txt"

resources:
  accelerator: H100:4
  use_gpu: true

runtime:
  predict_concurrency: 64
  health_checks:
    restart_check_delay_seconds: 1800
    restart_threshold_seconds: 1200
    stop_traffic_threshold_seconds: 120

Flags

The start_command passes these flags to the engine. Each one controls a runtime or serving behavior:

Flag	Value	What it does
`--model`	`/models/MiniMax-M2.5`	Path (or HF repo) the engine loads the model from.
`--tensor-parallel-size`	`$(nvidia-smi`	Number of GPUs to shard the model across.
`--enable_expert_parallel`	(no value)	Shard MoE expert weights across tensor-parallel ranks instead of replicating them, reducing per-GPU memory for large MoE models.
`--trust-remote-code`	(no value)	Execute model-specific Python from the checkpoint (required for many Qwen, Phi, and custom architectures).
`--load-format`	`runai_streamer`	Weight loading backend. runai_streamer: Stream weights from object storage without materializing to disk.
`--disable-log-stats`	(no value)	Suppress periodic engine stats logging.
`--max-num-seqs`	`64`	Maximum number of concurrent sequences in the batch.
`--max-num-batched-tokens`	`8192`	Maximum total tokens processed per scheduler step.
`--tool-call-parser`	`minimax_m2`	Server-side parser that emits structured `tool_calls` on the response. minimax_m2: MiniMax M2 tool format.
`--reasoning-parser`	`minimax_m2_append_think`	Server-side parser that separates reasoning output into `reasoning_content`. minimax_m2_append_think: MiniMax M2 append-think format.
`--enable-auto-tool-choice`	(no value)	Let the model choose when to call tools without requiring `tool_choice: "required"`.

Deploy

Push the config to Baseten:

uvx truss push

You should see output similar to:

✨ Model minimax-m2.5-throughput was successfully pushed ✨
🪵 View logs for your deployment at https://app.baseten.co/models/abcd1234/logs/wxyz5678

Your model ID is the string after /models/ in the logs URL (abcd1234 in the example). Use it wherever you see {model_id} in the next section.

Call the model

Your deployment serves an OpenAI-compatible API. Replace {model_id} with your model ID and make sure BASETEN_API_KEY is set. Now call your deployment to run inference:

Python
cURL

main.py

import os
from openai import OpenAI

client = OpenAI(
    api_key=os.environ["BASETEN_API_KEY"],
    base_url="https://model-{model_id}.api.baseten.co/environments/production/sync/v1",
)

response = client.chat.completions.create(
    model="MiniMaxAI/MiniMax-M2.5",
    messages=[
        {"role": "user", "content": "What is machine learning?"}
    ],
)

print(response.choices[0].message.content)

curl -s https://model-{model_id}.api.baseten.co/environments/production/sync/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Api-Key $BASETEN_API_KEY" \
  -d '{
    "model": "MiniMaxAI/MiniMax-M2.5",
    "messages": [
      {"role": "user", "content": "What is machine learning?"}
    ]
  }'

The server parses the model’s chain of thought into a separate reasoning_content field on the response. Read it alongside the final answer:

response = client.chat.completions.create(
    model="MiniMaxAI/MiniMax-M2.5",
    messages=[
        {"role": "user", "content": "How many r's in strawberry?"}
    ],
)
print(response.choices[0].message.reasoning_content)  # chain of thought
print(response.choices[0].message.content)            # final answer

To let the model call tools, pass a tools array. The server returns structured tool_calls on the response:

tools = [{
    "type": "function",
    "function": {
        "name": "get_weather",
        "parameters": {
            "type": "object",
            "properties": {"location": {"type": "string"}},
            "required": ["location"],
        },
    },
}]

response = client.chat.completions.create(
    model="MiniMaxAI/MiniMax-M2.5",
    messages=[
        {"role": "user", "content": "What's the weather in Paris?"}
    ],
    tools=tools,
)
print(response.choices[0].message.tool_calls)

Examples

Models

Engines

Custom Docker servers

Custom Python models

Chains

Setup

Hardware

Engine

Context

Concurrency

Write the config

Flags

Deploy

Call the model

Examples

Models

Engines

Custom Docker servers

Custom Python models

Chains

Documentation Index

​Setup

Hardware

Engine

Context

Concurrency

​Write the config

​Flags

​Deploy

​Call the model

Setup

Write the config

Flags

Deploy

Call the model