Setup
To get started, sign into Baseten with Truss and then install the OpenAI SDK.Sign in to Baseten
Install the OpenAI SDK
Hardware
H100
Engine
vLLM (0.20.2-cu129 build)
Context
256K
Concurrency
1000
Write the config
Create and move into the project directory:config.yaml and paste the following:
config.yaml
Flags
Thestart_command passes these flags to the engine. Each one controls a runtime or serving behavior:
| Flag | Value | What it does |
|---|---|---|
--tensor-parallel-size | $GPU_COUNT | Number of GPUs to shard the model across. |
--gpu-memory-utilization | 0.95 | Fraction of GPU memory vLLM may use for weights and KV cache. |
--max-model-len | 262144 | Maximum context length (tokens) the server accepts per request. |
--max-num-batched-tokens | 32768 | Maximum total tokens processed per scheduler step. |
--dtype | auto | Weight precision loaded at runtime. auto: Match the model’s checkpoint dtype (default). |
--enable-chunked-prefill | (no value) | Process long prompts in chunks so decode requests keep running. |
--enable-prefix-caching | (no value) | Reuse KV cache across requests that share a prefix. |
--max-num-seqs | 512 | Maximum number of concurrent sequences in the batch. |
--limit-mm-per-prompt.image | 2 | Maximum number of image inputs per prompt. |
--reasoning-parser | qwen3 | Server-side parser that separates reasoning output into reasoning_content. qwen3: Qwen3-family thinking format (used by Qwen3, Qwen3.5, and Qwen3.6). |
--enable-auto-tool-choice | (no value) | Let the model choose when to call tools without requiring tool_choice: "required". |
--tool-call-parser | qwen3_coder | Server-side parser that emits structured tool_calls on the response. qwen3_coder: Qwen3-Coder tool format. |
--trust-remote-code | (no value) | Execute model-specific Python from the checkpoint (required for many Qwen, Phi, and custom architectures). |
Deploy
Push the config to Baseten:truss push output (abcd1234 in the example). Use it wherever you see {model_id} in the next section.
Call the model
Your deployment serves an OpenAI-compatible API. Replace{model_id} with your model ID and make sure BASETEN_API_KEY is set.
Now call your deployment to run inference:
- Python
- cURL
main.py
reasoning_content field on the response:
tools array. The server returns structured tool_calls on the response: