Engine-Builder-LLM is the v1 inference stack. BIS-LLM is the v2 stack. The two share much of the sameDocumentation Index
Fetch the complete documentation index at: https://docs.baseten.co/llms.txt
Use this file to discover all available pages before exploring further.
trt_llm schema but differ in what counts as build configuration, what counts as runtime configuration, and how autoscaling, speculation, and routing work. This page covers the field-by-field translation and the semantic changes that aren’t just renames.
The shape change
v2 simplifies thebuild: section to five fields (checkpoint_repository, quantization_type, quantization_config, num_builder_gpus, skip_build_result) and moves everything else to runtime:. The build validator rejects v1-only fields under build: with explicit error messages.
v1 (Engine-Builder-LLM):
config.yaml
config.yaml
Migration steps
Apply these seven changes to translate the v1 build configuration to v2. The order matters only for step 1 (the inference stack declaration must come first); the rest are independent.- Add
inference_stack: v2at the top oftrt_llm:. - Remove
base_model. v2 detects the architecture from the checkpoint automatically. - Move
max_seq_len,max_batch_size, andmax_num_tokensfrombuild:toruntime:. - Rename
tensor_parallel_counttotensor_parallel_sizeand move it toruntime:. - Remove
plugin_configuration. v2 handlespaged_kv_cache,use_paged_context_fmha, anduse_fp8_context_fmhaautomatically. - Remove
speculator. v1 lookahead decoding is not supported in v2; see Speculative decoding moves to the Management API below. - Replace
enable_chunked_context: truewithenable_chunked_prefill: trueif it was set.
Semantic changes (not just renames)
The field translation above keeps your deployment running, but four behaviors change in ways that affect how you should configure and operate the v2 deployment.Speculative decoding moves to the Management API
v1 lookahead decoding lives inconfig.yaml under trt_llm.build.speculator. v2 doesn’t support lookahead. Instead, BIS-LLM offers Eagle, MTP, and N-gram speculative decoding through the Management API speculative_config block, not through config.yaml. See Speculative decoding for the configuration shape. Eagle and MTP require Enterprise; contact your Baseten representative to enable.
Autoscaling switches to token-based
v1 deployments use Baseten’s standard request-concurrency autoscaler: replicas scale based onconcurrency_target and target_utilization_percentage. v2 deployments use token-based autoscaling instead: scale on target_in_flight_tokens. The v2 deployment API rejects concurrency_target and target_utilization_percentage. Convert your v1 concurrency target to a token target using:
For a model averaging 4K input and 1K output tokens at v1 concurrency_target of 10, the v2 token target is roughly 50,000.
KV-aware routing becomes available
v1 has no equivalent. Workloads with prefix-overlapping requests (long shared system prompts, multi-turn conversations, agentic loops) can enable KV-aware routing on the v2 deployment to substantially reduce time-to-first-token through cache reuse. KV-aware routing requires Enterprise.Disaggregated serving becomes available
v1 has no equivalent. Workloads with high prefill-to-decode imbalance (long-context inference, mixed-length traffic) can use disaggregated serving to split prefill and decode onto independent replica groups. Disaggregated serving requires Enterprise.Validation errors you might see
The v2 build validator rejects v1-only fields with explicit errors. The most common during migration:| Error | Cause | Fix |
|---|---|---|
Field trt_llm.build.base_model is not allowed to be set when using v2 inference stack | base_model left in build: | Remove. v2 auto-detects from the checkpoint. |
Field trt_llm.build.<field> is not allowed to be set when using v2 inference stack | v1 runtime fields (max_seq_len, max_batch_size, max_num_tokens, tensor_parallel_count, plugin_configuration) still in build: | Move them to runtime:. Rename tensor_parallel_count to tensor_parallel_size. |
Field trt_llm.build.speculator is not allowed to be set when using v2 inference stack | speculator block kept from v1 | Remove. Use the Management API speculative_config block instead. |
After migrating
Watch these metrics during and after the cutover:tps_per_requestandconcurrent_requestsshould stay similar or improve.autoscaler_in_flight_tokensis the new load signal. Tunetarget_in_flight_tokensbased on observed values; aim for the 50,000-150,000 starting range.speculation_rateis available once Eagle or MTP is configured through the Management API.
Related
- BIS-LLM overview: Main engine documentation.
- BIS-LLM configuration: Complete v2 YAML reference.
- Engine-Builder-LLM configuration: v1 reference for comparison.
- Token-based autoscaling: v2 autoscaling configuration.
- Speculative decoding: v2 speculative decoding (Eagle, MTP, N-gram).