config.yaml schema for BIS-LLM (Baseten Inference Stack v2). The v2 stack simplifies the build: section and moves runtime fields out of build.
For translating an Engine-Builder-LLM (v1) configuration to BIS-LLM, see Migrate from Engine-Builder-LLM.
Configuration structure
Build configuration
Thebuild section configures model compilation and optimization settings.
Specifies where to find the model checkpoint. Same structure as v1 with v2-specific optimizations.For training checkpoint deployment, see Deploy with optimized inference engines. For cloud storage sources (GCS, S3, Azure), see Deploy from cloud storage.
Quantization format for model weights (simplified from v1).Options:
no_quant: precision of the repo (fp16 or bf16). BIS-LLM also supports quantized checkpoints from nvidia-modelopt libraries.fp8: FP8 weights + 16-bit KV cachefp8_kv: FP8 weights + FP8 KV cachefp4: FP4 weights + 16-bit KV cache (B200 only)fp4_kv: FP4 weights + FP8 KV cache (B200 only)fp4_mlp_only: FP4 MLP layers only + 16-bit KV cache (B200 only)
Configuration for post-training quantization calibration.
Number of GPUs to use during the build process. Auto-detected from resources. Minimum: 1, with no fixed maximum.
Skip the engine build step and use a pre-built model that does not require quantization. Use when you have a pre-built engine from model cache.
Runtime configuration
Theruntime section configures inference engine behavior.
Maximum sequence length (context) for single requests. Range: 1 to 1048576.
Maximum number of input sequences processed concurrently. Range: 1 to 2048.
Maximum number of batched input tokens after padding removal. Range: 65 to 131072.
Number of GPUs to use for tensor parallelism. Auto-detected from resources. Minimum: 1, with no fixed maximum (set it to the number of GPUs in your
accelerator setting).Enable chunked prefilling for long sequences.
Model name returned in API responses.
Preview. Pass-through configuration patches for the v2 inference stack. Fields under
patch_kwargs may change without notice; keys that overlap standard runtime fields are rejected at build time.Complete configuration examples
Qwen3-30B-A3B-Instruct-2507 MoE with FP4 on B200
GPT-OSS 120B on B200:1 with no_quant
This example deploys GPT-OSS with default settings. For production throughput with Eagle speculative decoding on B200, see Speculative decoding for BIS-LLM and Advanced features for BIS-LLM.DeepSeek V3
This example deploys a pre-quantized ModelOpt checkpoint withno_quant. For higher throughput on DeepSeek V3 family models, use multi-GPU B200 layouts with MTP speculative decoding or disaggregated serving. See Speculative decoding for BIS-LLM and Disaggregated serving.
Hardware selection
GPU recommendations for v2:- B200: Best for FP4 quantization and next-gen performance
- H100: Best for FP8 quantization and production workloads
- Multi-GPU: Required for large MoE models (>30B parameters)
| Model Size | Recommended GPU | Quantization | Tensor Parallel |
|---|---|---|---|
<30B MoE | H100:2-4 | FP8 | 2-4 |
| 30-100B MoE | H100:4-8 | FP8 | 4-8 |
| 100B+ MoE | B200:4-8 | FP4 | 4-8 |
| Dense >30B | H100:2-4 | FP8 | 2-4 |
Related
- BIS-LLM overview: Main engine documentation.
- Migrate from Engine-Builder-LLM: Translate a v1 configuration to BIS-LLM (v2).
- Advanced features for BIS-LLM: KV-aware routing, disaggregated serving, and speculative decoding.
- Structured outputs for BIS-LLM: JSON schema validation.
- Model deployment examples: Concrete deployment examples.