> ## Documentation Index
> Fetch the complete documentation index at: https://docs.baseten.co/llms.txt
> Use this file to discover all available pages before exploring further.

# Overview

> Next-generation engine for MoE models with advanced optimizations

BIS-LLM (Baseten Inference Stack V2) is Baseten's next-generation engine for Mixture of Experts (MoE) models and advanced text generation use cases. Built on the V2 inference stack, it provides cutting-edge optimizations including KV-aware routing, disaggregated serving, expert parallel load balancing and DP attention.
Before you continue reading - we have enabled a small subset of features for customers - the primary way to deploy these large models is though Forward Deployed Engineers.

BIS-LLM deployments mirror build artifacts to the [Baseten Delivery Network](/development/model/bdn) automatically. No extra configuration is required.

## Overview and use cases

BIS-LLM is designed for MoE models and scenarios requiring the most advanced inference optimizations.

### Ideal for:

**MoE model families:**

* **DeepSeek**: `deepseek-ai/DeepSeek-R1`, `deepseek-ai/DeepSeek-V3.1`, `deepseek-ai/DeepSeek-V3.2`
* **Qwen MoE**: `Qwen/Qwen3-30B-A3B`, `Qwen/Qwen3-Coder-480B-A35B-Instruct`
* **Kimi**: `moonshotai/Kimi-K2-Instruct`
* **GLM**: `zai-org/GLM-4.7`
* **LLama4**: `meta-llama/llama-4-maverick`
* **GPT-OSS**: Various open-source GPT variants

**Advanced use cases:**

* **High-performance inference**: FP4 quantization on GB200/B200 GPUs
* **Complex reasoning**: Advanced tool calling and structured outputs
* **Large-scale deployments**: Multi-node setups and distributed inference

## Forward deployed engineer gated features

We gated some more advanced features behind feature flags that we internally toggle.
They are not the easiest to use, and some are mutually exclusive - making them hard to maintain on our docs page.

The features below power some of the largest LLM deployments for the customer logos on our website and a couple of [world-records on GPUs](https://www.baseten.co/blog/how-we-made-the-fastest-gpt-oss-on-nvidia-gpus-60-percent-faster/).

For detailed information on each advanced feature, see [Gated Features for BIS-LLM](/engines/bis-llm/advanced-features).

## Architecture support

### MoE model support

BIS-LLM specifically optimizes for Mixture of Experts architectures:

**Primary MoE architectures:**

* `DeepseekV32ForCausalLM` - DeepSeek family
* `Qwen3MoEForCausalLM` - Qwen3 MoE family
* `KimiK2ForCausalLM` - Kimi K2 family
* `Glm4MoeForCausalLM` - GLM MoE variants
* `GPTOSS` - OpenAI GPT-OSS variants
* ...

### Dense model support

While optimized for MoE, BIS-LLM also supports dense models with advanced features:

**Benefits for dense models:**

* **GB200/B200 optimization**: Advanced GPU kernel optimization
* **FP4 quantization**: Next-generation quantization support
* **Enhanced memory management**: Improved KV cache handling

**When to use BIS-LLM for dense models:**

* Models >30B parameters requiring maximum performance
* Deployments on GB200/B200 GPUs with advanced quantization
* You tried out V1 and want to compare against V2
* You want to try V2 features like KV routing or Disaggregated Serving.
* Speculation on GB200/B200

### Advanced quantization

BIS-LLM supports next-generation quantization formats for maximum performance:

**Quantization options:**

* `no_quant`: FP16/BF16 precision, or automatically uses hf\_quant\_config.json from modelopt if available
* `fp8`: FP8 weights + 16-bit KV cache
* `fp4`: FP4 weights + 16-bit KV cache
* `fp8_kv`: FP8 weights + 8-bit symmetric kv cache
* `fp4_kv`: FP8 weights + 8-bit symmetric kv cache
* `fp4_mlp_only`: FP4 weights (mlp layers) + 16-bit kv-cache and attn computation

**B200 optimization:**

* **FP4 kernels**: Custom B200 kernels for maximum performance
* **Memory efficiency**: 75% memory reduction with FP4, some models like DeepSeekV3 strongly preferred on B200 due to kernel selection.
* **Speed improvement**: 4x-8x faster inference with minimal accuracy loss
* **Cascaded improvements**: More memory and faster inference leading to improved system performance, especially under high load.

**Example:**

```yaml theme={"system"}
trt_llm:
  inference_stack: v2
  build:
    checkpoint_repository:
      source: HF
      repo: "Qwen/Qwen3-30B-A3B"
    quantization_type: fp4  # B200 only
```

### Structured outputs and tool calling

Advanced JSON schema validation and function calling capabilities:

**Features:**

* **JSON schema validation**: Precise structured output generation
* **Function calling**: Advanced tool selection and execution
* **Multi-tool support**: Complex tool chains and reasoning
* **Schema inheritance**: Nested and complex schema support

**Example:**

```python theme={"system"}
from pydantic import BaseModel
from openai import OpenAI

class ResearchResult(BaseModel):
    topic: str
    findings: list[str]
    confidence: float
    sources: list[str]

client = OpenAI(
    api_key=os.environ['BASETEN_API_KEY'],
    base_url="https://model-xxxxxx.api.baseten.co/environments/production/sync/v1"
)

response = client.beta.chat.completions.parse(
    model="not-required",
    messages=[
        {"role": "user", "content": "Analyze the latest AI research papers"}
    ],
    response_format=ResearchResult
)

result = response.choices[0].message.parsed
```

## Configuration examples

**Note**: The below examples are just functional examples. Advanced features are frequently changing. Please reach out how to best configure a specific or fine-tuned model, we are happy to help.

### GPT-OSS 120B deployment

```yaml theme={"system"}
model_name: gpt-oss-120b
resources:
  accelerator: H100:8  # 8 GPUs for large dense model
  cpu: '8'
  memory: 80Gi
  use_gpu: true
trt_llm:
  inference_stack: v2
  build:
    checkpoint_repository:
      source: HF
      repo: "openai/gpt-oss-120b"
      revision: main
      runtime_secret_name: hf_access_token
    # GPT-OSS runs in MXFP4 - which is supported by H100.
    # by selecting `no_quant` we apply no special quantization.
    # MXFP4 and modelopt-style nvfp4 are supported out of the box.
    quantization_type: no_quant
    num_builder_gpus: 8
  runtime:
    max_seq_len: 32768
    max_batch_size: 256
    max_num_tokens: 16384
    tensor_parallel_size: 8
    enable_chunked_prefill: true
    served_model_name: "gpt-oss-120b"
```

### Qwen3-30B-A3B-Instruct-2507 MoE with FP4 quantization

```yaml theme={"system"}
model_name: Qwen3-30B-A3B-Instruct-2507-FP4
resources:
  accelerator: B200:2
  cpu: '4'
  memory: 40Gi
  use_gpu: true
trt_llm:
  inference_stack: v2
  build:
    checkpoint_repository:
      source: HF
      repo: "Qwen/Qwen3-30B-A3B-Instruct-2507"
      revision: main
    quantization_type: fp4
    num_builder_gpus: 2
  runtime:
    max_seq_len: 65536
    max_batch_size: 128
    max_num_tokens: 8192
    tensor_parallel_size: 2
    enable_chunked_prefill: true
    served_model_name: "Qwen3-30B-A3B-Instruct-2507"
```

### Dense model with BIS-LLM V2

```yaml theme={"system"}
model_name: Llama-3.3-70B-V2
resources:
  accelerator: H100:4
  cpu: '4'
  memory: 40Gi
  use_gpu: true
trt_llm:
  inference_stack: v2
  build:
    checkpoint_repository:
      source: HF
      repo: "meta-llama/Llama-3.3-70B-Instruct"
      revision: main
      runtime_secret_name: hf_access_token
    quantization_type: fp8
    num_builder_gpus: 4
  runtime:
    max_seq_len: 131072
    max_batch_size: 256
    max_num_tokens: 8192
    tensor_parallel_size: 4
    enable_chunked_prefill: true
    served_model_name: "Llama-3.3-70B-Instruct"
```

## Integration examples

### OpenAI-compatible inference

```python theme={"system"}
from openai import OpenAI
import os

client = OpenAI(
    api_key=os.environ['BASETEN_API_KEY'],
    base_url="https://model-xxxxxx.api.baseten.co/environments/production/sync/v1"
)

# Standard chat completion 
response = client.chat.completions.create(
    model="not-required",
    messages=[
        {"role": "system", "content": "You are an advanced AI assistant."},
        {"role": "user", "content": "Explain the concept of mixture of experts in AI."}
    ],
    temperature=0.7,
    max_tokens=1000
)

print(response.choices[0].message.content)
```

### Advanced structured outputs

```python theme={"system"}
from pydantic import BaseModel
from openai import OpenAI

class ExpertAnalysis(BaseModel):
    routing_decision: str
    expert_utilization: dict[str, float]
    processing_time: float
    confidence_score: float

client = OpenAI(
    api_key=os.environ['BASETEN_API_KEY'],
    base_url="https://model-xxxxxx.api.baseten.co/environments/production/sync/v1"
)

response = client.beta.chat.completions.parse(
    model="not-required",
    messages=[
        {"role": "user", "content": "Analyze the expert routing for this complex query"}
    ],
    response_format=ExpertAnalysis
)

analysis = response.choices[0].message.parsed
print(f"Routing decision: {analysis.routing_decision}")
print(f"Expert utilization: {analysis.expert_utilization}")
```

### Multi-tool function calling

```python theme={"system"}
client = OpenAI(
    api_key=os.environ['BASETEN_API_KEY'],
    base_url="https://model-xxxxxx.api.baseten.co/environments/production/sync/v1"
)

tools = [
    {
        "type": "function",
        "function": {
            "name": "analyze_expert_routing",
            "description": "Analyze expert routing patterns",
            "parameters": {
                "type": "object",
                "properties": {
                    "query": {"type": "string"},
                    "expert_count": {"type": "integer"}
                }
            }
        }
    },
    {
        "type": "function",
        "function": {
            "name": "optimize_performance",
            "description": "Optimize model performance",
            "parameters": {
                "type": "object",
                "properties": {
                    "target_tps": {"type": "number"},
                    "memory_budget": {"type": "integer"}
                }
            }
        }
    }
]

response = client.chat.completions.create(
    model="not-required",
    messages=[
        {"role": "user", "content": "Analyze and optimize the performance of this MoE model"}
    ],
    tools=tools
)

for tool_call in response.choices[0].message.tool_calls:
    print(f"Function: {tool_call.function.name}")
    print(f"Arguments: {tool_call.function.arguments}")
```

## Best practices

### Hardware selection

**GPU recommendations:**

* **B200**: Best for FP4 quantization and next-gen performance
* **H100**: Best for FP8 quantization and production workloads
* **Multi-GPU**: Required for large MoE models (>30B parameters)
* **Multi-Node**:

**Configuration guidelines:**

| **Model Size** | **Recommended GPU** | **Quantization** | **Tensor Parallel** |
| -------------- | ------------------- | ---------------- | ------------------- |
| `<30B` MoE     | H100:2-4            | FP8              | 2-4                 |
| 30-100B MoE    | H100:4-8            | FP8              | 4-8                 |
| 100B+ MoE      | B200:4-8            | FP4              | 4-8                 |
| Dense >30B     | H100:2-4            | FP8              | 2-4                 |

## Production best practices

### V2 inference stack optimization

#### Configuration differences from V1

```yaml theme={"system"}
# V2 (recommended for MoE and advanced models)
trt_llm:
  inference_stack: v2
  build:
    checkpoint_repository:
      source: HF
      repo: "openai/gpt-oss-120b"
    quantization_type: fp8
  runtime:
    max_seq_len: 32768  # Set in engine for V2
    max_batch_size: 32
    tensor_parallel_size: 8  # Engine configuration
```

## Migration guide

### From Engine-Builder-LLM

**V1 configuration:**

```yaml theme={"system"}
trt_llm:
  build:
    base_model: decoder
    checkpoint_repository:
      source: HF
      repo: "Qwen/Qwen3-32B"
    quantization_type: fp8_kv
    tensor_parallel_count: 8
```

**V2 configuration:**

```yaml theme={"system"}
trt_llm:
  inference_stack: v2
  build:
    checkpoint_repository:
      source: HF
      repo: "Qwen/Qwen3-32B"
    quantization_type: fp8_kv
  runtime:
    tensor_parallel_size: 8
    enable_chunked_prefill: true
```

### Key differences

1. **`inference_stack`**: Explicitly set to `v2`
2. **Build configuration**: Simplified with fewer options
3. **Engine configuration**: Enhanced with V2-specific features
4. **Performance**: Better optimization for MoE models

## Related

* [BIS-LLM reference config](/engines/bis-llm/bis-llm-config) - Complete V2 configuration options.
* [Advanced features documentation](/engines/bis-llm/advanced-features) - Enterprise features and capabilities.
* [Structured outputs](/inference/structured-outputs) - Advanced JSON schema validation.
* [Examples section](/examples/overview) - Concrete deployment examples.
