Documentation Index
Fetch the complete documentation index at: https://docs.baseten.co/llms.txt
Use this file to discover all available pages before exploring further.
Setup
To get started, sign into Baseten with Truss and then install the OpenAI SDK.
Sign in to Basetenuvx truss login --browser
zai-org/GLM-5 is a MoE model with up to 128K context.
This preset serves GLM-5 FP8 on B200:8, Z.ai’s frontier model tuned for the lowest time-to-first-token available.
Write the config
Create and move into the project directory:
mkdir glm-5-latency && cd glm-5-latency
Then create a file named config.yaml and paste the following:
model_metadata:
example_model_input:
messages:
- role: system
content: "You are a helpful assistant."
- role: user
content: "What is the meaning of life?"
stream: true
model: zai-org/GLM-5
max_tokens: 32768
temperature: 0.7
tags:
- openai-compatible
model_name: "model:glm-5 preset:latency"
base_image:
image: vllm/vllm-openai:glm5
docker_server:
start_command: >
sh -c "VLLM_DEEP_GEMM_WARMUP=relax python3 -m vllm.entrypoints.openai.api_server
--model /models/GLM-5-FP8
--chat-template /models/GLM-5-FP8/chat_template.jinja
--host 0.0.0.0 --port 8000
--served-model-name zai-org/GLM-5
--tensor-parallel-size 8
--trust-remote-code
--load-format runai_streamer
--disable-log-stats
--max-num-seqs 64
--max-num-batched-tokens 8192
--tool-call-parser glm47
--reasoning-parser glm45
--enable-auto-tool-choice
--speculative-config.method mtp
--speculative-config.num_speculative_tokens 1"
readiness_endpoint: /health
liveness_endpoint: /health
predict_endpoint: /v1/chat/completions
server_port: 8000
weights:
- source: "hf://zai-org/GLM-5-FP8@main"
mount_location: "/models/GLM-5-FP8"
ignore_patterns:
- "*.md"
- "*.txt"
resources:
accelerator: B200:8
use_gpu: true
runtime:
predict_concurrency: 64
health_checks:
restart_check_delay_seconds: 1800
restart_threshold_seconds: 1200
stop_traffic_threshold_seconds: 120
Flags
The start_command passes these flags to the engine. Each one controls a runtime or serving behavior:
| Flag | Value | What it does |
|---|
--model | /models/GLM-5-FP8 | Path (or HF repo) the engine loads the model from. |
--chat-template | /models/GLM-5-FP8/chat_template.jinja | Path to a Jinja chat template file that overrides the checkpoint’s default. |
--tensor-parallel-size | 8 | Number of GPUs to shard the model across. |
--trust-remote-code | (no value) | Execute model-specific Python from the checkpoint (required for many Qwen, Phi, and custom architectures). |
--load-format | runai_streamer | Weight loading backend. runai_streamer: Stream weights from object storage without materializing to disk. |
--disable-log-stats | (no value) | Suppress periodic engine stats logging. |
--max-num-seqs | 64 | Maximum number of concurrent sequences in the batch. |
--max-num-batched-tokens | 8192 | Maximum total tokens processed per scheduler step. |
--tool-call-parser | glm47 | Server-side parser that emits structured tool_calls on the response. |
--reasoning-parser | glm45 | Server-side parser that separates reasoning output into reasoning_content. |
--enable-auto-tool-choice | (no value) | Let the model choose when to call tools without requiring tool_choice: "required". |
--speculative-config.method | mtp | Speculative decoding method. mtp: Multi-token prediction head speculation. |
--speculative-config.num_speculative_tokens | 1 | Number of tokens the draft speculator proposes per step. |
Deploy
Push the config to Baseten:
You should see output similar to:
✨ Model glm-5-latency was successfully pushed ✨
🪵 View logs for your deployment at https://app.baseten.co/models/abcd1234/logs/wxyz5678
Your model ID is the string after /models/ in the logs URL (abcd1234 in the example). Use it wherever you see {model_id} in the next section.
Call the model
Your deployment serves an OpenAI-compatible API. Replace {model_id} with your model ID and make sure BASETEN_API_KEY is set.
Now call your deployment to run inference:
import os
from openai import OpenAI
client = OpenAI(
api_key=os.environ["BASETEN_API_KEY"],
base_url="https://model-{model_id}.api.baseten.co/environments/production/sync/v1",
)
response = client.chat.completions.create(
model="zai-org/GLM-5",
messages=[
{"role": "user", "content": "What is machine learning?"}
],
)
print(response.choices[0].message.content)
curl -s https://model-{model_id}.api.baseten.co/environments/production/sync/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Api-Key $BASETEN_API_KEY" \
-d '{
"model": "zai-org/GLM-5",
"messages": [
{"role": "user", "content": "What is machine learning?"}
]
}'
The server parses the model’s chain of thought into a separate reasoning_content field on the response. Read it alongside the final answer:
response = client.chat.completions.create(
model="zai-org/GLM-5",
messages=[
{"role": "user", "content": "How many r's in strawberry?"}
],
)
print(response.choices[0].message.reasoning_content) # chain of thought
print(response.choices[0].message.content) # final answer
To let the model call tools, pass a tools array. The server returns structured tool_calls on the response:
tools = [{
"type": "function",
"function": {
"name": "get_weather",
"parameters": {
"type": "object",
"properties": {"location": {"type": "string"}},
"required": ["location"],
},
},
}]
response = client.chat.completions.create(
model="zai-org/GLM-5",
messages=[
{"role": "user", "content": "What's the weather in Paris?"}
],
tools=tools,
)
print(response.choices[0].message.tool_calls)