Configuration structure
Build configuration
checkpoint_repository
Specifies where to find the model checkpoint. Same structure as V1 but with V2-specific optimizations.
Structure:
quantization_type
Quantization options for V2 inference stack (simplified from V1):
Options:
no_quant: precision of the repo. This can be fp16 / bf16. Unique to BIS-LLM is that we also do support quantized checkpoints from nvidia-modelopt libraries.fp8: FP8 weights + 16-bit KV cachefp4: FP4 weights + 16-bit KV cache (B200 only)fp4_mlp_only: FP4 MLP layers only + 16-bit KV cache
quantization_config
Configuration for post-training quantization calibration:
Structure:
num_builder_gpus
Number of GPUs to use during the build process.
Default: 1 (auto-detected from resources)Range: 1 to 8 Example:
skip_build_result
Skip the engine build step and use a pre-built model, that does not require any quantization.
Default: falseUse case: When you have a pre-built engine from model cache Example:
Engine configuration
max_seq_len
Maximum sequence length (context) for single requests.
Default: 32768 (64K)Range: 1 to 1048576 Example:
max_batch_size
Maximum number of input sequences processed concurrently.
Default: 256Range: 1 to 2048 Example:
max_num_tokens
Maximum number of batched input tokens after padding removal.
Default: 8192Range: 64 to 131072 Example:
tensor_parallel_size
Number of GPUs to use for tensor parallelism.
Default: 1 (auto-detected from resources)Range: 1 to 8 Example:
enable_chunked_prefill
Enable chunked prefilling for long sequences.
Default: true
Example:
served_model_name
Model name returned in API responses.
Default: None (uses model name from config)
Example:
patch_kwargs
Advanced configuration patches for V2 inference stack.
Structure:
Complete configuration examples
Qwen3-30B-A3B-Instruct-2507 MoE with FP4 on B200
GPT-OSS 120B on B200:1 with no_quant
Note: We have GPT-OSS much more optimized. The below example is functional, but you can sequeeze much more performance usingB200, e.g. with Baseten’s custom Eagle Heads.
DeepSeek V3
Note: We have DeepSeek V3 / V3.1 / V3.2 much more optimized. The below example is functional, but you can sequeeze much more performance usingB200:4, e.g. with MTP Heads and disaggregated serving, or data-parallel attention.
V2 vs V1 configuration differences
Simplified build configuration
V1 build configuration:Key differences
inference_stack: Explicitly set tov2- Simplified build options: Many V1 options moved to engine
- No
base_model: Automatically detected from checkpoint - No
plugin_configuration: Handled automatically - No
speculator: Lookahead decoding requires FDE involement. - Tensor parallel: Moved to engine as
tensor_parallel_size
Validation and troubleshooting
Common V2 configuration errors
Error:Field trt_llm.build.base_model is not allowed to be set when using v2 inference stack
- Cause: Setting
base_modelin V2 configuration - Fix: Remove
base_modelfield, V2 detects automatically
Field trt_llm.build.quantization_type is not allowed to be set when using v2 inference stack
- Cause: Using unsupported quantization type
- Fix: Use supported quantization:
no_quant,fp8,fp4,fp4_mlp_only,fp4_kv,fp8_kv
Field trt_llm.build.speculator is not allowed to be set when using v2 inference stack
- Cause: Trying to use lookahead decoding in V2
- Fix: Use V1 stack for lookahead decoding, or V2 without speculation or reach out to us to use V2 with speculation.
Migration from V1
V1 to V2 migration
V1 configuration:Migration steps
- Add
inference_stack: v2 - Remove
base_model(auto-detected) - **Move
max_seq_len,max_batch_size,max_num_tokensto engine - Change
tensor_parallel_counttotensor_parallel_size - Remove
plugin_configuration(handled automatically) - Update quantization type (V2 has simplified options)
- Remove
speculator(not supported in V2)
Hardware selection
GPU recommendations for V2:- B200: Best for FP4 quantization and next-gen performance
- H100: Best for FP8 quantization and production workloads
- Multi-GPU: Required for large MoE models (>30B parameters)
| Model Size | Recommended GPU | Quantization | Tensor Parallel |
|---|---|---|---|
<30B MoE | H100:2-4 | FP8 | 2-4 |
| 30-100B MoE | H100:4-8 | FP8 | 4-8 |
| 100B+ MoE | B200:4-8 | FP4 | 4-8 |
| Dense >30B | H100:2-4 | FP8 | 2-4 |
Further reading
- BIS-LLM overview - Main engine documentation
- Advanced features documentation - Enterprise features and capabilities
- Structured outputs for BIS-LLM - Advanced JSON schema validation
- Examples section - Concrete deployment examples