Skip to main content
BIS-LLM (Baseten Inference Stack v2) is the engine for Mixture of Experts (MoE) models and large dense LLMs. It targets MoE families (DeepSeek V3.x, Qwen3MoE, Kimi-K2, Llama 4, GLM-4.7, GPT-OSS 120B) and the largest dense models, where the standard request-based autoscaler and a single-server inference engine both leave performance on the table. The v2 stack adds token-based autoscaling, KV-aware routing, disaggregated serving, expert parallel load balancing, and DP attention. Deployments mirror build artifacts to the Baseten Delivery Network so cold starts stay fast.

Production features

BIS-LLM ships four features that the standard inference path doesn’t include. Token-based autoscaling lives on the Autoscaling engines page; the other three are documented together in Advanced features for BIS-LLM.

Token-based autoscaling

Scales replicas on target_in_flight_tokens rather than request concurrency, so mixed-length prompt workloads scale on real compute load.

KV-aware routing

Routes requests to the worker most likely to serve them from KV cache. Lower time-to-first-token on prefix-overlapping traffic.

Disaggregated serving

Splits prefill and decode onto independent worker groups that scale separately.

Speculative decoding

Eagle, MTP, and N-gram speculation. Multiple tokens per forward pass on supported architectures.

A canonical configuration

The trt_llm block in config.yaml configures the build and runtime. A pre-quantized DeepSeek V3 deployment on B200 looks like:
config.yaml
model_name: deepseek-v3-1-nvfp4
resources:
  accelerator: B200:4
  use_gpu: true
trt_llm:
  inference_stack: v2
  build:
    checkpoint_repository:
      source: HF
      repo: "nvidia/DeepSeek-V3.1-NVFP4"
      runtime_secret_name: hf_access_token
    quantization_type: no_quant  # ModelOpt-quantized checkpoint
  runtime:
    max_seq_len: 131072
    max_batch_size: 256
    tensor_parallel_size: 4
    enable_chunked_prefill: true
    served_model_name: "deepseek-v3"
After truss push, the build compiles the engine, the BDN mirrors weights to GPU-local storage, and the deployment exposes OpenAI-compatible /v1/chat/completions. The four production features above each plug in through their own configuration blocks; see BIS-LLM configuration for the complete reference and additional examples (GPT-OSS 120B, Qwen3-MoE, Llama 3.3 70B). For tuning advice on a specific or fine-tuned model, contact your Baseten representative.

OpenAI-compatible inference

BIS-LLM deployments expose /v1/chat/completions, /v1/completions, and /v1/embeddings (where applicable). Standard OpenAI client SDKs work without modification:
from openai import OpenAI
import os

client = OpenAI(
    api_key=os.environ["BASETEN_API_KEY"],
    base_url="https://model-xxxxxx.api.baseten.co/environments/production/sync/v1"
)

response = client.chat.completions.create(
    model="not-required",
    messages=[{"role": "user", "content": "Explain mixture of experts in two sentences."}],
)
Structured outputs and function calling are supported through the standard OpenAI parameters and have their own reference pages.

Observability

BIS-LLM emits metrics from three components. Each has its own dashboard section:
DomainMetric prefixPage
Autoscaler decisionsautoscaler_*Autoscaling engines
Router and KV cachekv_cache_*KV-aware routing
Engine and requestengine-level metrics belowThis page
Engine-level metrics, available on every BIS-LLM deployment:
MetricWhat it measures
tps_per_requestTokens per second per request.
input_tokens / output_tokensTotal token throughput across the deployment.
input_tokens_per_request / output_tokens_per_requestPer-request token averages.
concurrent_requestsCurrently in-flight request count.
speculation_rateDraft-token acceptance rate when speculative decoding is active. High rates indicate the draft model is well-aligned.
cpu_usage / memory_usage / gpu_usage / gpu_memory_usageResource utilization per replica.
replica_count_by_statusReplica counts grouped by lifecycle status.
Start with tps_per_request to confirm replicas handle load as expected. If you run Enterprise features, add kv_cache_hit_rate (KV-aware routing, in the router domain) or speculation_rate (Eagle/MTP) next. See Advanced features for BIS-LLM for speculative-decoding configuration that produces speculation_rate.

Migrating from Engine-Builder-LLM

Engine-Builder-LLM is the v1 stack. Migrating to BIS-LLM is mostly moving runtime fields out of build:, renaming tensor_parallel_count to tensor_parallel_size, and removing fields v2 handles automatically (plugin_configuration, base_model). Autoscaling, speculation, and routing also change in ways that aren’t just renames. See Migrate from Engine-Builder-LLM for the field-by-field mapping, the semantic changes, and the validation errors you might see during cutover.