Setup
To get started, sign into Baseten with Truss and then install the OpenAI SDK.Sign in to Baseten
Install the OpenAI SDK
Hardware
H100 × 4
Engine
vLLM (latest build)
Context
128K
Concurrency
256
Write the config
Create and move into the project directory:config.yaml and paste the following:
config.yaml
Flags
Thestart_command passes these flags to the engine. Each one controls a runtime or serving behavior:
| Flag | Value | What it does |
|---|---|---|
--max-model-len | 131072 | Maximum context length (tokens) the server accepts per request. |
--tensor-parallel-size | 4 | Number of GPUs to shard the model across. |
--distributed-executor-backend | mp | How vLLM coordinates tensor-parallel workers across processes. mp: Python multiprocessing (single-node default). |
--gpu-memory-utilization | 0.95 | Fraction of GPU memory vLLM may use for weights and KV cache. |
--kv-cache-dtype | fp8 | KV cache numeric precision. fp8: ~2× KV cache density with negligible quality impact on most models. |
--limit-mm-per-prompt | {"image": 10} | Max multimodal inputs accepted per prompt (JSON object keyed by modality). |
--override-generation-config | {"attn_temperature_tuning": true} | JSON overrides applied on top of the model’s default generation config. |
Deploy
Push the config to Baseten:/models/ in the logs URL (abcd1234 in the example). Use it wherever you see {model_id} in the next section.
Call the model
Your deployment serves an OpenAI-compatible API. Replace{model_id} with your model ID and make sure BASETEN_API_KEY is set.
Now call your deployment to run inference:
- Python
- cURL
main.py