> ## Documentation Index
> Fetch the complete documentation index at: https://docs.baseten.co/llms.txt
> Use this file to discover all available pages before exploring further.

# Overview

> Dense LLM text generation with lookahead decoding and structured outputs

Engine-Builder-LLM optimizes dense text generation models with [TensorRT-LLM](https://github.com/NVIDIA/TensorRT-LLM), delivering up to 4000 tokens/second for code generation with [lookahead decoding](/engines/engine-builder-llm/lookahead-decoding). The engine supports [structured outputs](/inference/structured-outputs) for JSON schema validation.

Engine-Builder-LLM deployments mirror build artifacts to the [Baseten Delivery Network](/development/model/bdn) automatically. No extra configuration is required.

## Use cases

**Model families:**

* **Llama**: `meta-llama/Llama-3.3-70B-Instruct`, `meta-llama/Llama-3.2-3B-Instruct`.
* **Qwen**: `Qwen/Qwen2.5-72B-Instruct`, `Qwen3/Qwen3-8B`, `Qwen/QwQ-32B-Preview`.
* **Mistral**: `mistralai/Mistral-7B-Instruct-v0.3`, `mistralai/Mistral-Small-24B-Instruct`.
* **DeepSeek**: `deepseek-ai/DeepSeek-R1-Distill-Llama-70B`.
* **Gemma 3**: `google/gemma-3-27b-it`, `google/gemma-3-12b-it`.
* **Microsoft**: `microsoft/Phi-4`.

Engine-Builder-LLM handles high-throughput dialogue systems, coding assistants with lookahead decoding, and content generation with structured outputs. The engine's speculative decoding accelerates code generation by 2-4x, making it ideal for coding agents and JSON-heavy workloads.

### LoRA support

Engine-Builder-LLM supports [multi-LoRA](/engines/engine-builder-llm/lora-support) deployments with engine adapter switching:

<CardGroup cols={2}>
  <Card title="Multi-LoRA" href="/engines/engine-builder-llm/lora-support" icon="layers" iconType="duotone">
    Multiple adapters, engine switching, parameter-efficient fine-tuning
  </Card>

  <Card title="Quick start" href="/engines/engine-builder-llm/lora-support" icon="rocket-launch" iconType="duotone">
    Deploy LoRA adapters in minutes
  </Card>
</CardGroup>

### Structured outputs

Engine-Builder-LLM supports OpenAI-compatible structured outputs with JSON schema validation:

<CardGroup cols={2}>
  <Card title="Features" href="/inference/structured-outputs#engine-builder-llm" icon="check-circle" iconType="duotone">
    Full OpenAI compatibility, JSON schema validation, complex nested schemas
  </Card>

  <Card title="Quick start" href="/inference/structured-outputs" icon="rocket-launch" iconType="duotone">
    Get started with structured outputs in minutes
  </Card>
</CardGroup>

### Key benefits

<CardGroup cols={2}>
  <Card title="Low latency" icon="lightning-bolt" iconType="duotone">
    TensorRT-LLM compilation optimizes time-to-first-token.
  </Card>

  <Card title="High throughput" icon="rocket-launch" iconType="duotone">
    Batching and kernel optimization maximize tokens per second.
  </Card>

  <Card title="Lookahead decoding" icon="eye" iconType="duotone">
    Speculative decoding accelerates coding agents and predictable content.
  </Card>

  <Card title="Structured outputs" icon="shapes" iconType="duotone">
    JSON schema validation for controlled text generation.
  </Card>
</CardGroup>

## Architecture support

### Supported model types

Engine-Builder-LLM supports all causal language model architectures that end with `ForCausalLM`:

**Primary architectures:**

* `LlamaForCausalLM`: Llama family models.
* `Qwen2ForCausalLM`: Qwen family models.
* `MistralForCausalLM`: Mistral family models.
* `Gemma2ForCausalLM`: Gemma family models.
* `Phi3ForCausalLM`: Phi family models.

**Automatic detection:**

The engine automatically detects the model architecture from the checkpoint repository and applies appropriate optimizations.

### Model size support

| **Model Size** | **Single GPU** | **Tensor Parallel** | **Recommended GPU** |
| -------------- | -------------- | ------------------- | ------------------- |
| `<8B`          | L4, A10G, H100 | N/A                 | L4 (cost-effective) |
| 8B-70B         | H100           | TP1-TP2             | H100 (2 GPUs)       |
| 70B+           | H100 / B200    | TP4+                | H100 (4+ GPUs)      |

## Advanced features

### Lookahead decoding

Lookahead decoding accelerates inference for code generation, JSON output, and templated content by speculating on future tokens using n-gram patterns.

**Best for:**

* **Code generation**: Highly predictable patterns in code.
* **Structured content**: Reliable JSON, YAML, XML generation.
* **Mathematical expressions**: Predictable mathematical notation.
* **Template completion**: Filling in predictable templates.

Enable lookahead decoding by adding a `speculator` section:

```yaml theme={"system"}
trt_llm:
  build:
    speculator:
      speculative_decoding_mode: LOOKAHEAD_DECODING
      lookahead_windows_size: 1
      lookahead_ngram_size: 8
      lookahead_verification_set_size: 1
      enable_b10_lookahead: true
```

**Performance impact:**

* **Speed improvement**: Up to 2x faster for code and structured content.
* **Prompt lookup**: Up to 10x faster for prompt-lookup workloads like code apply, reaching 4000 tokens/s per request on Qwen-3-8B with a single H100.
* **Optimal batch size**: Less than 32 requests for best performance.

### Structured outputs

Generate text that conforms to JSON schemas for reliable data extraction and controlled generation.

**Use cases:**

* **Data extraction**: Extract structured information from unstructured text.
* **API response generation**: Generate JSON responses for APIs.
* **Configuration generation**: Create structured configuration files.
* **Content validation**: Ensure generated content meets specific criteria.

Structured outputs work out of the box with no extra configuration. Define a Pydantic schema:

```python theme={"system"}
import os
from pydantic import BaseModel
from openai import OpenAI

class User(BaseModel):
    name: str
    age: int
    email: str

client = OpenAI(
    api_key=os.environ['BASETEN_API_KEY'],
    base_url="https://model-xxxxxx.api.baseten.co/environments/production/sync/v1"
)

response = client.beta.chat.completions.parse(
    model="not-required",
    messages=[
        {"role": "user", "content": "Extract user info from: John is 25 years old and his email is john@example.com"}
    ],
    response_format=User
)

user = response.choices[0].message.parsed
print(f"Name: {user.name}, Age: {user.age}, Email: {user.email}")
```

### Quantization options

Engine-Builder-LLM supports multiple [quantization](/engines/performance-concepts/quantization-guide) formats for different performance and accuracy trade-offs.

**Quantization types:**

* `no_quant`: `FP16`/`BF16` precision (baseline).
* `fp8`: `FP8` weights + 16-bit KV cache (2x speedup).
* `fp8_kv`: `FP8` weights + `FP8` KV cache (2.5x speedup).
* `fp4`: `FP4` weights + 16-bit KV cache (4x speedup, B200 only).
* `fp4_kv`: `FP4` weights + `FP8` KV cache (4.5x speedup, B200 only).
* `fp4_mlp_only`: `FP4` MLP only + 16-bit KV (3x speedup, B200 only).

**Hardware requirements:**

Hardware requirements vary by quantization type.

| **Quantization**                | **Minimum GPU**      | **Memory reduction** | **Speed improvement** |
| ------------------------------- | -------------------- | -------------------- | --------------------- |
| `no_quant`                      | A100                 | None                 | Baseline              |
| `fp8`                           | L4, H100, H200, B200 | 50%                  | 2x                    |
| `fp8_kv`                        | L4, H100, H200, B200 | 60%                  | 2.5x                  |
| `fp4`, `fp4_kv`, `fp4_mlp_only` | B200 only            | 75%                  | 3-4.5x                |

## Configuration examples

### Basic Llama 3.3 70B deployment

Llama 3.3 70B on H100 GPUs with `FP8` quantization:

```yaml theme={"system"}
model_name: Llama-3.3-70B-Instruct
resources:
  accelerator: H100:4  # 4 GPUs for 70B model
  cpu: '4'
  memory: 40Gi
  use_gpu: true
trt_llm:
  build:
    base_model: decoder
    checkpoint_repository:
      source: HF
      repo: "meta-llama/Llama-3.3-70B-Instruct"
      revision: main
      runtime_secret_name: hf_access_token
    max_seq_len: 131072
    max_batch_size: 256
    max_num_tokens: 8192
    quantization_type: fp8_kv
    tensor_parallel_count: 4
    plugin_configuration:
      paged_kv_cache: true
      use_paged_context_fmha: true
      use_fp8_context_fmha: true
    quantization_config:
      calib_size: 1024
      calib_dataset: "abisee/cnn_dailymail"
      calib_max_seq_length: 2048
  runtime:
    kv_cache_free_gpu_mem_fraction: 0.9
    enable_chunked_context: true
    batch_scheduler_policy: guaranteed_no_evict
    served_model_name: "Llama-3.3-70B-Instruct"
```

### Qwen 2.5 32B with lookahead decoding

Qwen 2.5 32B with *speculative decoding* for faster inference. Read more on [lookahead decoding here](/engines/engine-builder-llm/lookahead-decoding)

```yaml theme={"system"}
model_name: Qwen-2.5-32B-Lookahead
resources:
  accelerator: H100:1
  cpu: '2'
  memory: 20Gi
  use_gpu: true
trt_llm:
  build:
    base_model: decoder
    checkpoint_repository:
      source: HF
      repo: "Qwen/Qwen2.5-Coder-32B-Instruct"
      revision: main
    max_seq_len: 32768
    max_batch_size: 128
    max_num_tokens: 8192
    quantization_type: fp8 # no fp8_kv for qwen2.5 models
    tensor_parallel_count: 1
    num_builder_gpus: 2 # will be loaded in bf16 for quantization, will require `2x32Gb memory -> 2H100s
    speculator:
      speculative_decoding_mode: LOOKAHEAD_DECODING
      lookahead_windows_size: 3
      lookahead_ngram_size: 8
      lookahead_verification_set_size: 3
      enable_b10_lookahead: true
    plugin_configuration:
      paged_kv_cache: true
      use_paged_context_fmha: true
      use_fp8_context_fmha: true
  runtime:
    kv_cache_free_gpu_mem_fraction: 0.85
    enable_chunked_context: true
    batch_scheduler_policy: guaranteed_no_evict
    served_model_name: "Qwen-2.5-Coder-32B-Instruct"
```

### Small model for cost-effective deployment

Llama 3.2 3B on an L4 GPU for cost efficiency:

```yaml theme={"system"}
model_name: Llama-3.2-3B-Instruct
resources:
  accelerator: L4
  cpu: '1'
  memory: 10Gi
  use_gpu: true
trt_llm:
  build:
    base_model: decoder
    checkpoint_repository:
      source: HF
      repo: "meta-llama/Llama-3.2-3B-Instruct"
      revision: main
    max_seq_len: 8192
    max_batch_size: 256
    max_num_tokens: 4096
    quantization_type: fp8
    tensor_parallel_count: 1
    plugin_configuration:
      paged_kv_cache: true
      use_paged_context_fmha: true
      use_fp8_context_fmha: false
  runtime:
    kv_cache_free_gpu_mem_fraction: 0.9
    enable_chunked_context: true
    batch_scheduler_policy: guaranteed_no_evict
    served_model_name: "Llama-3.2-3B-Instruct"
```

## Performance characteristics

### Latency and throughput factors

Performance depends on model size (smaller models respond faster), quantization (`FP8`/`FP4` reduces memory and improves throughput), lookahead decoding (effective for code and structured content), batch size (larger batches improve throughput at the cost of latency), and hardware (H100 and B200 GPUs deliver the best results).

### Memory usage considerations

**Memory optimization factors:**

* **Quantization**: `FP8` reduces memory by \~50%, `FP4` by \~75%.
* **Lookahead decoding**: Minimal additional memory overhead.
* **Tensor parallelism**: Distributes memory across multiple GPUs.
* **KV cache management**: Configurable memory allocation for context handling.

## Integration examples

### OpenAI-compatible inference

Engine-Builder-LLM deployments are OpenAI compatible, enabling use of the standard OpenAI SDK.

```python theme={"system"}
import os
from openai import OpenAI

client = OpenAI(
    api_key=os.environ['BASETEN_API_KEY'],
    base_url="https://model-xxxxxx.api.baseten.co/environments/production/sync/v1"
)

# Standard chat completion
response = client.chat.completions.create(
    model="not-required",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Explain quantum computing in simple terms."}
    ],
    temperature=0.7,
    max_tokens=500
)

print(response.choices[0].message.content)

# Streaming completion
for chunk in client.chat.completions.create(
    model="not-required",
    messages=[{"role": "user", "content": "Write a poem about AI"}],
    stream=True,
):
    print(chunk.choices[0].delta.content or "", end="")
```

Point `base_url` to your model's production endpoint. Find this URL in your Baseten dashboard after deployment. The `model` parameter can be any string since Baseten routes based on the URL, not this field. Set `stream=True` to receive tokens as they're generated.

Running this returns a chat completion response with the model's answer in `response.choices[0].message.content`, or streams chunks with partial content in `delta.content`.

### Performant Client Usage

For high-throughput batch processing, use the [Performance Client](/inference/performance-client) which handles concurrent requests efficiently.

```python theme={"system"}
from baseten_performance_client import PerformanceClient

client = PerformanceClient(
    base_url="https://model-xxxxxx.api.baseten.co/environments/production/sync", 
    api_key=os.environ['BASETEN_API_KEY']
)

# Batch chat completions with stream=False
payloads = [
    {
        "model": "model",
        "messages": [{"role": "user", "content": "Explain quantum computing"}],
        "stream": False,
        "max_tokens": 500
    },
    {
        "model": "model", 
        "messages": [{"role": "user", "content": "Write a poem about AI"}],
        "stream": False,
        "max_tokens": 300
    }
] * 10  # 20 total requests

response = client.batch_post(
    url_path="/v1/chat/completions",
    payloads=payloads,
)

# Access 20 responses
for i, resp in enumerate(response.data):
    print(f"Response {i+1}: {resp['choices'][0]['message']['content']}")
```

**Use cases:** Bulk content generation, Unlocked GIL during Request, batch data processing, performance benchmarking.

### Structured outputs

*Structured outputs* guarantee the response matches your Pydantic schema.

```python theme={"system"}
import os
from pydantic import BaseModel
from openai import OpenAI

class Task(BaseModel):
    title: str
    priority: str
    due_date: str
    description: str

client = OpenAI(
    api_key=os.environ['BASETEN_API_KEY'],
    base_url="https://model-xxxxxx.api.baseten.co/environments/production/sync/v1"
)

response = client.beta.chat.completions.parse(
    model="not-required",
    messages=[
        {"role": "user", "content": "Create a task for: Review the quarterly report by next Friday"}
    ],
    response_format=Task
)

task = response.choices[0].message.parsed
print(f"Task: {task.title}")
print(f"Priority: {task.priority}")
```

Define your schema as a Pydantic model with typed fields. Pass it to `response_format` and use `beta.chat.completions.parse` instead of the regular `create` method.

The response includes a `parsed` attribute with your data already converted to a `Task` object, so no JSON parsing is needed.

### Function calling

*Function calling* lets the model invoke your functions with structured arguments. Define available tools, and the model returns function calls when appropriate.

```python theme={"system"}
import os
from openai import OpenAI

client = OpenAI(
    api_key=os.environ['BASETEN_API_KEY'],
    base_url="https://model-xxxxxx.api.baseten.co/environments/production/sync/v1"
)

tools = [{
    "type": "function",
    "function": {
        "name": "get_weather",
        "description": "Get current weather for a location",
        "parameters": {
            "type": "object",
            "properties": {
                "location": {
                    "type": "string",
                    "description": "City name, e.g., San Francisco"
                }
            },
            "required": ["location"]
        }
    }
}]

response = client.chat.completions.create(
    model="not-required",
    messages=[{"role": "user", "content": "What's the weather like in Boston?"}],
    tools=tools
)

tool_call = response.choices[0].message.tool_calls[0]
print(f"Function: {tool_call.function.name}")
print(f"Arguments: {tool_call.function.arguments}")
```

Define each tool with a `name`, `description`, and JSON schema for `parameters`. The description helps the model decide when to use the tool.

When the model chooses to call a function, `tool_calls` contains the function name and JSON-encoded arguments. Your code executes the function and optionally sends the result back for a final response.

## Best practices

### Model selection

**For cost-effective deployments:**

* Use models under 8B parameters on L4 GPUs, H100 or H100\_40GB.
* Consider quantization for memory efficiency.
* Implement autoscaling for variable traffic.

**For high-performance deployments:**

* Use H100 GPUs with `FP8` quantization.
* Enable lookahead decoding for code generation.
* Use tensor parallelism for large models.

**For coding assistants:**

* Use models trained on code (Qwen-Coder, CodeLlama).
* Enable lookahead decoding with window size 1 for maximum throughput.
* Consider smaller models for faster response times.

### Hardware optimization

**GPU selection:**

* **L4 or H100\_40GB**: Best for models under 15B parameters, cost-effective.
* **H100\_80GB**: Recommended for models 15-70B parameters for optimal performance.
* **H100**: Best for models 15-70B parameters, high performance.
* **B200**: Required for `FP4` quantization.

**Memory optimization:**

* Use quantization to reduce memory usage.
* Lower max\_seq\_len or enable chunked prefill.
* Monitor memory usage during deployment.

### Performance tuning

**For lowest latency:**

* Use smaller models when possible.
* Enable lookahead decoding for code generation.

**For highest throughput:**

* Use larger batch sizes.
* Enable `FP8`/`FP4` quantization.
* Use tensor parallelism for large models.

**For cost efficiency:**

* Use L4 GPUs with quantization.
* Implement efficient autoscaling.
* Choose appropriately sized models.

## Migration guide

### From other deployment systems

Coming from vLLM? Here's how the configuration maps:

```yaml theme={"system"}
# vLLM configuration (old)
model: "meta-llama/Llama-3.3-70B-Instruct"
tensor_parallel_size: 4
quantization: "fp8"

# Engine-Builder-LLM configuration (new)
trt_llm:
  build:
    base_model: decoder
    checkpoint_repository:
      source: HF
      repo: "meta-llama/Llama-3.3-70B-Instruct"
    quantization_type: fp8_kv
    tensor_parallel_count: 4
```

## Related

* [Engine-Builder-LLM reference config](/engines/engine-builder-llm/engine-builder-config): Complete configuration options.
* [Structured outputs](/inference/structured-outputs): JSON schema validation and controlled generation.
* [Lookahead decoding guide](/engines/engine-builder-llm/lookahead-decoding): Advanced speculative decoding.
* [Custom engine builder](/engines/engine-builder-llm/custom-engine-builder): Custom model.py implementation.
* [Quantization guide](/engines/performance-concepts/quantization-guide): `FP8`/`FP4` trade-offs and hardware requirements.
* [TensorRT-LLM examples](/examples/tensorrt-llm): Concrete deployment examples.
