Overview

BIS-LLM (Baseten Inference Stack V2) is Baseten’s next-generation engine for Mixture of Experts (MoE) models and advanced text generation use cases. Built on the V2 inference stack, it provides cutting-edge optimizations including KV-aware routing, disaggregated serving, expert parallel load balancing and DP attention. Before you continue reading - we have enabled a small subset of features for customers - the primary way to deploy these large models is though Forward Deployed Engineers.

Overview and use cases

BIS-LLM is designed for MoE models and scenarios requiring the most advanced inference optimizations.

Ideal for:

MoE model families:

DeepSeek: deepseek-ai/DeepSeek-R1, deepseek-ai/DeepSeek-V3.1, deepseek-ai/DeepSeek-V3.2
Qwen MoE: Qwen/Qwen3-30B-A3B, Qwen/Qwen3-Coder-480B-A35B-Instruct
Kimi: moonshotai/Kimi-K2-Instruct
GLM: zai-org/GLM-4.7
LLama4: meta-llama/llama-4-maverick
GPT-OSS: Various open-source GPT variants

Advanced use cases:

High-performance inference: FP4 quantization on GB200/B200 GPUs
Complex reasoning: Advanced tool calling and structured outputs
Large-scale deployments: Multi-node setups and distributed inference

Forward deployed engineer gated features

We gated some more advanced features behind feature flags that we internally toggle. They are not the easiest to use, and some are mutually exclusive - making them hard to maintain on our docs page. The features below power some of the largest LLM deployments for the customer logos on our website and a couple of world-records on GPUs. For detailed information on each advanced feature, see Gated Features for BIS-LLM.

Architecture support

MoE model support

BIS-LLM specifically optimizes for Mixture of Experts architectures: Primary MoE architectures:

DeepseekV32ForCausalLM - DeepSeek family
Qwen3MoEForCausalLM - Qwen3 MoE family
KimiK2ForCausalLM - Kimi K2 family
Glm4MoeForCausalLM - GLM MoE variants
GPTOSS - OpenAI GPT-OSS variants
…

Dense model support

While optimized for MoE, BIS-LLM also supports dense models with advanced features: Benefits for dense models:

GB200/B200 optimization: Advanced GPU kernel optimization
FP4 quantization: Next-generation quantization support
Enhanced memory management: Improved KV cache handling

When to use BIS-LLM for dense models:

Models >30B parameters requiring maximum performance
Deployments on GB200/B200 GPUs with advanced quantization
You tried out V1 and want to compare against V2
You want to try V2 features like KV routing or Disaggregated Serving.
Speculation on GB200/B200

Advanced quantization

BIS-LLM supports next-generation quantization formats for maximum performance: Quantization options:

no_quant: FP16/BF16 precision, or automatically uses hf_quant_config.json from modelopt if available
fp8: FP8 weights + 16-bit KV cache
fp4: FP4 weights + 16-bit KV cache
fp8_kv: FP8 weights + 8-bit symmetric kv cache
fp4_kv: FP8 weights + 8-bit symmetric kv cache
fp4_mlp_only: FP4 weights (mlp layers) + 16-bit kv-cache and attn computation

B200 optimization:

FP4 kernels: Custom B200 kernels for maximum performance
Memory efficiency: 75% memory reduction with FP4, some models like DeepSeekV3 strongly preferred on B200 due to kernel selection.
Speed improvement: 4x-8x faster inference with minimal accuracy loss
Cascaded improvements: More memory and faster inference leading to improved system performance, especially under high load.

Example:

trt_llm:
  inference_stack: v2
  build:
    checkpoint_repository:
      source: HF
      repo: "Qwen/Qwen3-30B-A3B"
    quantization_type: fp4  # B200 only

Structured outputs and tool calling

Advanced JSON schema validation and function calling capabilities: Features:

JSON schema validation: Precise structured output generation
Function calling: Advanced tool selection and execution
Multi-tool support: Complex tool chains and reasoning
Schema inheritance: Nested and complex schema support

Example:

from pydantic import BaseModel
from openai import OpenAI

class ResearchResult(BaseModel):
    topic: str
    findings: list[str]
    confidence: float
    sources: list[str]

client = OpenAI(
    api_key=os.environ['BASETEN_API_KEY'],
    base_url="https://model-xxxxxx.api.baseten.co/environments/production/sync/v1"
)

response = client.beta.chat.completions.parse(
    model="not-required",
    messages=[
        {"role": "user", "content": "Analyze the latest AI research papers"}
    ],
    response_format=ResearchResult
)

result = response.choices[0].message.parsed

Configuration examples

Note: The below examples are just functional examples. Advanced features are frequently changing. Please reach out how to best configure a specific or fine-tuned model, we are happy to help.

GPT-OSS 120B deployment

model_name: gpt-oss-120b
resources:
  accelerator: H100:8  # 8 GPUs for large dense model
  cpu: '8'
  memory: 80Gi
  use_gpu: true
trt_llm:
  inference_stack: v2
  build:
    checkpoint_repository:
      source: HF
      repo: "openai/gpt-oss-120b"
      revision: main
      runtime_secret_name: hf_access_token
    # GPT-OSS runs in MXFP4 - which is supported by H100.
    # by selecting `no_quant` we apply no special quantization.
    # MXFP4 and modelopt-style nvfp4 are supported out of the box.
    quantization_type: no_quant
    num_builder_gpus: 8
  runtime:
    max_seq_len: 32768
    max_batch_size: 256
    max_num_tokens: 16384
    tensor_parallel_size: 8
    enable_chunked_prefill: true
    served_model_name: "gpt-oss-120b"

Qwen3-30B-A3B-Instruct-2507 MoE with FP4 quantization

model_name: Qwen3-30B-A3B-Instruct-2507-FP4
resources:
  accelerator: B200:2
  cpu: '4'
  memory: 40Gi
  use_gpu: true
trt_llm:
  inference_stack: v2
  build:
    checkpoint_repository:
      source: HF
      repo: "Qwen/Qwen3-30B-A3B-Instruct-2507"
      revision: main
    quantization_type: fp4
    num_builder_gpus: 2
  runtime:
    max_seq_len: 65536
    max_batch_size: 128
    max_num_tokens: 8192
    tensor_parallel_size: 2
    enable_chunked_prefill: true
    served_model_name: "Qwen3-30B-A3B-Instruct-2507"

Dense model with BIS-LLM V2

model_name: Llama-3.3-70B-V2
resources:
  accelerator: H100:4
  cpu: '4'
  memory: 40Gi
  use_gpu: true
trt_llm:
  inference_stack: v2
  build:
    checkpoint_repository:
      source: HF
      repo: "meta-llama/Llama-3.3-70B-Instruct"
      revision: main
      runtime_secret_name: hf_access_token
    quantization_type: fp8
    num_builder_gpus: 4
  runtime:
    max_seq_len: 131072
    max_batch_size: 256
    max_num_tokens: 8192
    tensor_parallel_size: 4
    enable_chunked_prefill: true
    served_model_name: "Llama-3.3-70B-Instruct"

Integration examples

OpenAI-compatible inference

from openai import OpenAI
import os

client = OpenAI(
    api_key=os.environ['BASETEN_API_KEY'],
    base_url="https://model-xxxxxx.api.baseten.co/environments/production/sync/v1"
)

# Standard chat completion 
response = client.chat.completions.create(
    model="not-required",
    messages=[
        {"role": "system", "content": "You are an advanced AI assistant."},
        {"role": "user", "content": "Explain the concept of mixture of experts in AI."}
    ],
    temperature=0.7,
    max_tokens=1000
)

print(response.choices[0].message.content)

Advanced structured outputs

from pydantic import BaseModel
from openai import OpenAI

class ExpertAnalysis(BaseModel):
    routing_decision: str
    expert_utilization: dict[str, float]
    processing_time: float
    confidence_score: float

client = OpenAI(
    api_key=os.environ['BASETEN_API_KEY'],
    base_url="https://model-xxxxxx.api.baseten.co/environments/production/sync/v1"
)

response = client.beta.chat.completions.parse(
    model="not-required",
    messages=[
        {"role": "user", "content": "Analyze the expert routing for this complex query"}
    ],
    response_format=ExpertAnalysis
)

analysis = response.choices[0].message.parsed
print(f"Routing decision: {analysis.routing_decision}")
print(f"Expert utilization: {analysis.expert_utilization}")

Multi-tool function calling

client = OpenAI(
    api_key=os.environ['BASETEN_API_KEY'],
    base_url="https://model-xxxxxx.api.baseten.co/environments/production/sync/v1"
)

tools = [
    {
        "type": "function",
        "function": {
            "name": "analyze_expert_routing",
            "description": "Analyze expert routing patterns",
            "parameters": {
                "type": "object",
                "properties": {
                    "query": {"type": "string"},
                    "expert_count": {"type": "integer"}
                }
            }
        }
    },
    {
        "type": "function",
        "function": {
            "name": "optimize_performance",
            "description": "Optimize model performance",
            "parameters": {
                "type": "object",
                "properties": {
                    "target_tps": {"type": "number"},
                    "memory_budget": {"type": "integer"}
                }
            }
        }
    }
]

response = client.chat.completions.create(
    model="not-required",
    messages=[
        {"role": "user", "content": "Analyze and optimize the performance of this MoE model"}
    ],
    tools=tools
)

for tool_call in response.choices[0].message.tool_calls:
    print(f"Function: {tool_call.function.name}")
    print(f"Arguments: {tool_call.function.arguments}")

Best practices

Hardware selection

GPU recommendations:

B200: Best for FP4 quantization and next-gen performance
H100: Best for FP8 quantization and production workloads
Multi-GPU: Required for large MoE models (>30B parameters)
Multi-Node:

Configuration guidelines:

Model Size	Recommended GPU	Quantization	Tensor Parallel
`<30B` MoE	H100:2-4	FP8	2-4
30-100B MoE	H100:4-8	FP8	4-8
100B+ MoE	B200:4-8	FP4	4-8
Dense >30B	H100:2-4	FP8	2-4

Production best practices

V2 inference stack optimization

Configuration differences from V1

# V2 (recommended for MoE and advanced models)
trt_llm:
  inference_stack: v2
  build:
    checkpoint_repository:
      source: HF
      repo: "openai/gpt-oss-120b"
    quantization_type: fp8
  runtime:
    max_seq_len: 32768  # Set in engine for V2
    max_batch_size: 32
    tensor_parallel_size: 8  # Engine configuration

Migration guide

From Engine-Builder-LLM

V1 configuration:

trt_llm:
  build:
    base_model: decoder
    checkpoint_repository:
      source: HF
      repo: "Qwen/Qwen3-32B"
    quantization_type: fp8_kv
    tensor_parallel_count: 8

V2 configuration:

trt_llm:
  inference_stack: v2
  build:
    checkpoint_repository:
      source: HF
      repo: "Qwen/Qwen3-32B"
    quantization_type: fp8_kv
  runtime:
    tensor_parallel_size: 8
    enable_chunked_prefill: true

Key differences

inference_stack: Explicitly set to v2
Build configuration: Simplified with fewer options
Engine configuration: Enhanced with V2-specific features
Performance: Better optimization for MoE models

BIS-LLM reference config - Complete V2 configuration options.
Advanced features documentation - Enterprise features and capabilities.
Structured outputs - Advanced JSON schema validation.
Examples section - Concrete deployment examples.

Get started

Concepts

Development

Deployment

Inference

Engines

Training

Organization

Observability

Troubleshooting

Overview and use cases

Ideal for:

Forward deployed engineer gated features

Architecture support

MoE model support

Dense model support

Advanced quantization

Structured outputs and tool calling

Configuration examples

GPT-OSS 120B deployment

Qwen3-30B-A3B-Instruct-2507 MoE with FP4 quantization

Dense model with BIS-LLM V2

Integration examples

OpenAI-compatible inference

Advanced structured outputs

Multi-tool function calling

Best practices

Hardware selection

Production best practices

V2 inference stack optimization

Configuration differences from V1

Migration guide

From Engine-Builder-LLM

Key differences

Get started

Concepts

Development

Deployment

Inference

Engines

Training

Organization

Observability

Troubleshooting

​Overview and use cases

​Ideal for:

​Forward deployed engineer gated features

​Architecture support

​MoE model support

​Dense model support

​Advanced quantization

​Structured outputs and tool calling

​Configuration examples

​GPT-OSS 120B deployment

​Qwen3-30B-A3B-Instruct-2507 MoE with FP4 quantization

​Dense model with BIS-LLM V2

​Integration examples

​OpenAI-compatible inference

​Advanced structured outputs

​Multi-tool function calling

​Best practices

​Hardware selection

​Production best practices

​V2 inference stack optimization

​Configuration differences from V1

​Migration guide

​From Engine-Builder-LLM

​Key differences

​Related

Overview and use cases

Ideal for:

Forward deployed engineer gated features

Architecture support

MoE model support

Dense model support

Advanced quantization

Structured outputs and tool calling

Configuration examples

GPT-OSS 120B deployment

Qwen3-30B-A3B-Instruct-2507 MoE with FP4 quantization

Dense model with BIS-LLM V2

Integration examples

OpenAI-compatible inference

Advanced structured outputs

Multi-tool function calling

Best practices

Hardware selection

Production best practices

V2 inference stack optimization

Configuration differences from V1

Migration guide

From Engine-Builder-LLM

Key differences

Related