Setup
To get started, sign into Baseten with Truss and then install the OpenAI SDK.Sign in to Baseten
Install the OpenAI SDK
Hardware
B200 × 8
Engine
vLLM (glm5 build)
Context
128K
Concurrency
64
Write the config
Create and move into the project directory:config.yaml and paste the following:
config.yaml
Flags
Thestart_command passes these flags to the engine. Each one controls a runtime or serving behavior:
| Flag | Value | What it does |
|---|---|---|
--model | /models/GLM-5-FP8 | Path (or HF repo) the engine loads the model from. |
--chat-template | /models/GLM-5-FP8/chat_template.jinja | Path to a Jinja chat template file that overrides the checkpoint’s default. |
--tensor-parallel-size | 8 | Number of GPUs to shard the model across. |
--trust-remote-code | (no value) | Execute model-specific Python from the checkpoint (required for many Qwen, Phi, and custom architectures). |
--load-format | runai_streamer | Weight loading backend. runai_streamer: Stream weights from object storage without materializing to disk. |
--disable-log-stats | (no value) | Suppress periodic engine stats logging. |
--max-num-seqs | 64 | Maximum number of concurrent sequences in the batch. |
--max-num-batched-tokens | 8192 | Maximum total tokens processed per scheduler step. |
--tool-call-parser | glm47 | Server-side parser that emits structured tool_calls on the response. |
--reasoning-parser | glm45 | Server-side parser that separates reasoning output into reasoning_content. |
--enable-auto-tool-choice | (no value) | Let the model choose when to call tools without requiring tool_choice: "required". |
--speculative-config.method | mtp | Speculative decoding method. mtp: Multi-token prediction head speculation. |
--speculative-config.num_speculative_tokens | 1 | Number of tokens the draft speculator proposes per step. |
Deploy
Push the config to Baseten:/models/ in the logs URL (abcd1234 in the example). Use it wherever you see {model_id} in the next section.
Call the model
Your deployment serves an OpenAI-compatible API. Replace{model_id} with your model ID and make sure BASETEN_API_KEY is set.
Now call your deployment to run inference:
- Python
- cURL
main.py
reasoning_content field on the response. Read it alongside the final answer:
tools array. The server returns structured tool_calls on the response: