Setup
To get started, sign into Baseten with Truss and then install the OpenAI SDK.Sign in to Baseten
Install the OpenAI SDK
- Standard
- Flash
zai-org/GLM-4.7 is a MoE model with up to 198K context.This preset serves GLM-4.7 from an FP4 checkpoint on B200:4, delivering frontier-class reasoning at single-node cost.Then create a file named
You should see output similar to:Your model ID is the string after
Hardware
B200 × 4
Engine
TRT-LLM v2
Context
198K
Concurrency
64
Write the config
Create and move into the project directory:config.yaml and paste the following:config.yaml
Key parameters
Baseten Inference Stack (BIS) reads these fields from thetrt_llm block. Each one shapes how the engine is built and served:| Parameter | Value |
|---|---|
| Tensor parallel size | 4 |
| Max sequence length | 202752 |
| Max batch size | 64 |
| Max batched tokens | 8192 |
| Chunked prefill | enabled |
| Inference stack | v2 |
| Served model name | glm47 |
Deploy
Push the config to Baseten:/models/ in the logs URL (abcd1234 in the example). Use it wherever you see {model_id} in the next section.Call the model
Your deployment serves an OpenAI-compatible API. Replace{model_id} with your model ID and make sure BASETEN_API_KEY is set.Now call your deployment to run inference:- Python
- cURL
main.py