Skip to main content
BIS-LLM (Baseten Inference Stack V2) is Baseten’s next-generation engine for Mixture of Experts (MoE) models and advanced text generation use cases. Built on the V2 inference stack, it provides cutting-edge optimizations including KV-aware routing, disaggregated serving, expert parallel load balancing and DP attention. Before you continue reading - we have enabled a small subset of features for customers - the primary way to deploy these large models is though Forward Deployed Engineers.

Overview and use cases

BIS-LLM is designed for MoE models and scenarios requiring the most advanced inference optimizations.

Ideal for:

MoE model families:
  • DeepSeek: deepseek-ai/DeepSeek-R1, deepseek-ai/DeepSeek-V3.1, deepseek-ai/DeepSeek-V3.2
  • Qwen MoE: Qwen/Qwen3-30B-A3B, Qwen/Qwen3-Coder-480B-A35B-Instruct
  • Kimi: moonshotai/Kimi-K2-Instruct
  • GLM: zai-org/GLM-4.7
  • LLama4: meta-llama/llama-4-maverick
  • GPT-OSS: Various open-source GPT variants
Advanced use cases:
  • High-performance inference: FP4 quantization on GB200/B200 GPUs
  • Complex reasoning: Advanced tool calling and structured outputs
  • Large-scale deployments: Multi-node setups and distributed inference

Forward Deployed Engineer Gated Features

We gated some more advanced features behind feature flags that we internally toggle. They are not the easiest to use, and some are mutually exclusive - making them hard to maintain on our docs page. The features below power some of the largest LLM deployments for the customer logos on our website and a couple of world-records on GPUs. For detailed information on each advanced feature, see Gated Features for BIS-LLM.

Architecture support

MoE model support

BIS-LLM specifically optimizes for Mixture of Experts architectures: Primary MoE architectures:
  • DeepseekV32ForCausalLM - DeepSeek family
  • Qwen3MoEForCausalLM - Qwen3 MoE family
  • KimiK2ForCausalLM - Kimi K2 family
  • Glm4MoeForCausalLM - GLM MoE variants
  • GPTOSS - OpenAI GPT-OSS variants

Dense model support

While optimized for MoE, BIS-LLM also supports dense models with advanced features: Benefits for dense models:
  • GB200/B200 optimization: Advanced GPU kernel optimization
  • FP4 quantization: Next-generation quantization support
  • Enhanced memory management: Improved KV cache handling
When to use BIS-LLM for dense models:
  • Models >30B parameters requiring maximum performance
  • Deployments on GB200/B200 GPUs with advanced quantization
  • You tried out V1 and want to compare against V2
  • You want to try V2 features like KV routing or Disaggregated Serving.
  • Speculation on GB200/B200

Advanced quantization

BIS-LLM supports next-generation quantization formats for maximum performance: Quantization options:
  • no_quant: FP16/BF16 precision, or automatically uses hf_quant_config.json from modelopt if available
  • fp8: FP8 weights + 16-bit KV cache
  • fp4: FP4 weights + 16-bit KV cache
  • fp8_kv: FP8 weights + 8-bit synmetric kv cache
  • fp4_kv: FP8 weights + 8-bit synmetric kv cache
  • fp4_mlp_only: FP4 weights (mlp layers) + 16-bit kv-cache and attn computation
B200 optimization:
  • FP4 kernels: Custom B200 kernels for maximum performance
  • Memory efficiency: 75% memory reduction with FP4, some models like DeepSeekV3 strongly prefered on B200 due to kernel selection.
  • Speed improvement: 4x-8x faster inference with minimal accuracy loss
  • Cascaded improvments: More memory and faster inference leading to improved system performance, especially under high load.
Example:
trt_llm:
  inference_stack: v2
  build:
    checkpoint_repository:
      source: HF
      repo: "Qwen/Qwen3-30B-A3B"
    quantization_type: fp4  # B200 only

Structured outputs and tool calling

Advanced JSON schema validation and function calling capabilities: Features:
  • JSON schema validation: Precise structured output generation
  • Function calling: Advanced tool selection and execution
  • Multi-tool support: Complex tool chains and reasoning
  • Schema inheritance: Nested and complex schema support
Example:
from pydantic import BaseModel
from openai import OpenAI

class ResearchResult(BaseModel):
    topic: str
    findings: list[str]
    confidence: float
    sources: list[str]

client = OpenAI(
    api_key=os.environ['BASETEN_API_KEY'],
    base_url="https://model-xxxxxx.api.baseten.co/environments/production/sync/v1"
)

response = client.beta.chat.completions.parse(
    model="not-required",
    messages=[
        {"role": "user", "content": "Analyze the latest AI research papers"}
    ],
    response_format=ResearchResult
)

result = response.choices[0].message.parsed

Configuration examples

Note: The below examples are just functional examples — advanced features are frequently changing. Please reach out how to best configure a specific or fine-tuned model, we are happy to help.

GPT-OSS 120B deployment

model_name: gpt-oss-120b
resources:
  accelerator: H100:8  # 8 GPUs for large dense model
  cpu: '8'
  memory: 80Gi
  use_gpu: true
trt_llm:
  inference_stack: v2
  build:
    checkpoint_repository:
      source: HF
      repo: "openai/gpt-oss-120b"
      revision: main
      runtime_secret_name: hf_access_token
    quantization_type: fp8
    num_builder_gpus: 8
  runtime:
    max_seq_len: 32768
    max_batch_size: 256
    max_num_tokens: 16384
    tensor_parallel_size: 8
    enable_chunked_prefill: true
    served_model_name: "gpt-oss-120b"

Qwen3-30B-A3B-Instruct-2507 MoE with FP4 quantization

model_name: Qwen3-30B-A3B-Instruct-2507-FP4
resources:
  accelerator: B200:2
  cpu: '4'
  memory: 40Gi
  use_gpu: true
trt_llm:
  inference_stack: v2
  build:
    checkpoint_repository:
      source: HF
      repo: "Qwen/Qwen3-30B-A3B-Instruct-2507"
      revision: main
    quantization_type: fp4
    num_builder_gpus: 2
  runtime:
    max_seq_len: 65536
    max_batch_size: 128
    max_num_tokens: 8192
    tensor_parallel_size: 2
    enable_chunked_prefill: true
    served_model_name: "Qwen3-30B-A3B-Instruct-2507"

Dense model with BIS-LLM V2

model_name: Llama-3.3-70B-V2
resources:
  accelerator: H100:4
  cpu: '4'
  memory: 40Gi
  use_gpu: true
trt_llm:
  inference_stack: v2
  build:
    checkpoint_repository:
      source: HF
      repo: "meta-llama/Llama-3.3-70B-Instruct"
      revision: main
      runtime_secret_name: hf_access_token
    quantization_type: fp8
    num_builder_gpus: 4
  runtime:
    max_seq_len: 131072
    max_batch_size: 256
    max_num_tokens: 8192
    tensor_parallel_size: 4
    enable_chunked_prefill: true
    served_model_name: "Llama-3.3-70B-Instruct"

Integration examples

OpenAI-compatible inference

from openai import OpenAI
import os

client = OpenAI(
    api_key=os.environ['BASETEN_API_KEY'],
    base_url="https://model-xxxxxx.api.baseten.co/environments/production/sync/v1"
)

# Standard chat completion 
response = client.chat.completions.create(
    model="not-required",
    messages=[
        {"role": "system", "content": "You are an advanced AI assistant."},
        {"role": "user", "content": "Explain the concept of mixture of experts in AI."}
    ],
    temperature=0.7,
    max_tokens=1000
)

print(response.choices[0].message.content)

Advanced structured outputs

from pydantic import BaseModel
from openai import OpenAI

class ExpertAnalysis(BaseModel):
    routing_decision: str
    expert_utilization: dict[str, float]
    processing_time: float
    confidence_score: float

client = OpenAI(
    api_key=os.environ['BASETEN_API_KEY'],
    base_url="https://model-xxxxxx.api.baseten.co/environments/production/sync/v1"
)

response = client.beta.chat.completions.parse(
    model="not-required",
    messages=[
        {"role": "user", "content": "Analyze the expert routing for this complex query"}
    ],
    response_format=ExpertAnalysis
)

analysis = response.choices[0].message.parsed
print(f"Routing decision: {analysis.routing_decision}")
print(f"Expert utilization: {analysis.expert_utilization}")

Multi-tool function calling

client = OpenAI(
    api_key=os.environ['BASETEN_API_KEY'],
    base_url="https://model-xxxxxx.api.baseten.co/environments/production/sync/v1"
)

tools = [
    {
        "type": "function",
        "function": {
            "name": "analyze_expert_routing",
            "description": "Analyze expert routing patterns",
            "parameters": {
                "type": "object",
                "properties": {
                    "query": {"type": "string"},
                    "expert_count": {"type": "integer"}
                }
            }
        }
    },
    {
        "type": "function",
        "function": {
            "name": "optimize_performance",
            "description": "Optimize model performance",
            "parameters": {
                "type": "object",
                "properties": {
                    "target_tps": {"type": "number"},
                    "memory_budget": {"type": "integer"}
                }
            }
        }
    }
]

response = client.chat.completions.create(
    model="not-required",
    messages=[
        {"role": "user", "content": "Analyze and optimize the performance of this MoE model"}
    ],
    tools=tools
)

for tool_call in response.choices[0].message.tool_calls:
    print(f"Function: {tool_call.function.name}")
    print(f"Arguments: {tool_call.function.arguments}")

Best practices

Hardware selection

GPU recommendations:
  • B200: Best for FP4 quantization and next-gen performance
  • H100: Best for FP8 quantization and production workloads
  • Multi-GPU: Required for large MoE models (>30B parameters)
  • Multi-Node:
Configuration guidelines:
Model SizeRecommended GPUQuantizationTensor Parallel
<30B MoEH100:2-4FP82-4
30-100B MoEH100:4-8FP84-8
100B+ MoEB200:4-8FP44-8
Dense >30BH100:2-4FP82-4

Production best practices

V2 inference stack optimization

Configuration differences from V1

# V2 (recommended for MoE and advanced models)
trt_llm:
  inference_stack: v2
  build:
    checkpoint_repository:
      source: HF
      repo: "openai/gpt-oss-120b"
    quantization_type: fp8
  runtime:
    max_seq_len: 32768  # Set in engine for V2
    max_batch_size: 32
    tensor_parallel_size: 8  # Engine configuration

Migration guide

From Engine-Builder-LLM

V1 configuration:
trt_llm:
  build:
    base_model: decoder
    checkpoint_repository:
      source: HF
      repo: "Qwen/Qwen3-32B"
    quantization_type: fp8_kv
    tensor_parallel_count: 8
V2 configuration:
trt_llm:
  inference_stack: v2
  build:
    checkpoint_repository:
      source: HF
      repo: "Qwen/Qwen3-32B"
    quantization_type: fp8_kv
  runtime:
    tensor_parallel_size: 8
    enable_chunked_prefill: true

Key differences

  1. inference_stack: Explicitly set to v2
  2. Build configuration: Simplified with fewer options
  3. Engine configuration: Enhanced with V2-specific features
  4. Performance: Better optimization for MoE models

Further reading