Skip to main content

Setup

To get started, sign into Baseten with Truss and then install the OpenAI SDK.
Sign in to Baseten
uvx truss login --browser
Install the OpenAI SDK
uv pip install openai
zai-org/GLM-5 is a MoE model with up to 128K context. This preset serves GLM-5 FP8 on B200:8, Z.ai’s frontier model tuned for the lowest time-to-first-token available.

Hardware

B200 × 8

Engine

vLLM (glm5 build)

Context

128K

Concurrency

64

Write the config

Create and move into the project directory:
mkdir glm-5-latency && cd glm-5-latency
Then create a file named config.yaml and paste the following:
config.yaml
model_metadata:
  example_model_input:
    messages:
      - role: system
        content: "You are a helpful assistant."
      - role: user
        content: "What is the meaning of life?"
    stream: true
    model: zai-org/GLM-5
    max_tokens: 32768
    temperature: 0.7
  tags:
    - openai-compatible

model_name: "model:glm-5 preset:latency"

base_image:
  image: vllm/vllm-openai:glm5

docker_server:
  start_command: >
    sh -c "VLLM_DEEP_GEMM_WARMUP=relax python3 -m vllm.entrypoints.openai.api_server
    --model /models/GLM-5-FP8
    --chat-template /models/GLM-5-FP8/chat_template.jinja
    --host 0.0.0.0 --port 8000
    --served-model-name zai-org/GLM-5
    --tensor-parallel-size 8
    --trust-remote-code
    --load-format runai_streamer
    --disable-log-stats
    --max-num-seqs 64
    --max-num-batched-tokens 8192
    --tool-call-parser glm47
    --reasoning-parser glm45
    --enable-auto-tool-choice
    --speculative-config.method mtp
    --speculative-config.num_speculative_tokens 1"
  readiness_endpoint: /health
  liveness_endpoint: /health
  predict_endpoint: /v1/chat/completions
  server_port: 8000

weights:
  - source: "hf://zai-org/GLM-5-FP8@main"
    mount_location: "/models/GLM-5-FP8"
    ignore_patterns:
      - "*.md"
      - "*.txt"

resources:
  accelerator: B200:8
  use_gpu: true

runtime:
  predict_concurrency: 64
  health_checks:
    restart_check_delay_seconds: 1800
    restart_threshold_seconds: 1200
    stop_traffic_threshold_seconds: 120

Flags

The start_command passes these flags to the engine. Each one controls a runtime or serving behavior:
FlagValueWhat it does
--model/models/GLM-5-FP8Path (or HF repo) the engine loads the model from.
--chat-template/models/GLM-5-FP8/chat_template.jinjaPath to a Jinja chat template file that overrides the checkpoint’s default.
--tensor-parallel-size8Number of GPUs to shard the model across.
--trust-remote-code(no value)Execute model-specific Python from the checkpoint (required for many Qwen, Phi, and custom architectures).
--load-formatrunai_streamerWeight loading backend. runai_streamer: Stream weights from object storage without materializing to disk.
--disable-log-stats(no value)Suppress periodic engine stats logging.
--max-num-seqs64Maximum number of concurrent sequences in the batch.
--max-num-batched-tokens8192Maximum total tokens processed per scheduler step.
--tool-call-parserglm47Server-side parser that emits structured tool_calls on the response.
--reasoning-parserglm45Server-side parser that separates reasoning output into reasoning_content.
--enable-auto-tool-choice(no value)Let the model choose when to call tools without requiring tool_choice: "required".
--speculative-config.methodmtpSpeculative decoding method. mtp: Multi-token prediction head speculation.
--speculative-config.num_speculative_tokens1Number of tokens the draft speculator proposes per step.

Deploy

Push the config to Baseten:
uvx truss push
You should see output similar to:
✨ Model glm-5-latency was successfully pushed ✨
🪵 View logs for your deployment at https://app.baseten.co/models/abcd1234/logs/wxyz5678
Your model ID is the string after /models/ in the logs URL (abcd1234 in the example). Use it wherever you see {model_id} in the next section.

Call the model

Your deployment serves an OpenAI-compatible API. Replace {model_id} with your model ID and make sure BASETEN_API_KEY is set. Now call your deployment to run inference:
main.py
import os
from openai import OpenAI

client = OpenAI(
    api_key=os.environ["BASETEN_API_KEY"],
    base_url="https://model-{model_id}.api.baseten.co/environments/production/sync/v1",
)

response = client.chat.completions.create(
    model="zai-org/GLM-5",
    messages=[
        {"role": "user", "content": "What is machine learning?"}
    ],
)

print(response.choices[0].message.content)
The server parses the model’s chain of thought into a separate reasoning_content field on the response. Read it alongside the final answer:
response = client.chat.completions.create(
    model="zai-org/GLM-5",
    messages=[
        {"role": "user", "content": "How many r's in strawberry?"}
    ],
)
print(response.choices[0].message.reasoning_content)  # chain of thought
print(response.choices[0].message.content)            # final answer
To let the model call tools, pass a tools array. The server returns structured tool_calls on the response:
tools = [{
    "type": "function",
    "function": {
        "name": "get_weather",
        "parameters": {
            "type": "object",
            "properties": {"location": {"type": "string"}},
            "required": ["location"],
        },
    },
}]

response = client.chat.completions.create(
    model="zai-org/GLM-5",
    messages=[
        {"role": "user", "content": "What's the weather in Paris?"}
    ],
    tools=tools,
)
print(response.choices[0].message.tool_calls)