Skip to main content
Engine-Builder-LLM optimizes dense text generation models with TensorRT-LLM, delivering up to 4000 tokens/second for code generation with lookahead decoding. The engine supports structured outputs for JSON schema validation.

Use cases

Model families:
  • Llama: meta-llama/Llama-3.3-70B-Instruct, meta-llama/Llama-3.2-3B-Instruct.
  • Qwen: Qwen/Qwen2.5-72B-Instruct, Qwen3/Qwen3-8B, Qwen/QwQ-32B-Preview.
  • Mistral: mistralai/Mistral-7B-Instruct-v0.3, mistralai/Mistral-Small-24B-Instruct.
  • DeepSeek: deepseek-ai/DeepSeek-R1-Distill-Llama-70B.
  • Gemma 3: google/gemma-3-27b-it, google/gemma-3-12b-it.
  • Microsoft: microsoft/Phi-4.
Engine-Builder-LLM handles high-throughput dialogue systems, coding assistants with lookahead decoding, and content generation with structured outputs. The engine’s speculative decoding accelerates code generation by 2-4x, making it ideal for coding agents and JSON-heavy workloads.

LoRA support

Engine-Builder-LLM supports multi-LoRA deployments with engine adapter switching:

Structured outputs

Engine-Builder-LLM supports OpenAI-compatible structured outputs with JSON schema validation:

Key benefits

Low latency

TensorRT-LLM compilation optimizes time-to-first-token.

High throughput

Batching and kernel optimization maximize tokens per second.

Lookahead decoding

Speculative decoding accelerates coding agents and predictable content.

Structured outputs

JSON schema validation for controlled text generation.

Architecture support

Supported model types

Engine-Builder-LLM supports all causal language model architectures that end with ForCausalLM: Primary architectures:
  • LlamaForCausalLM: Llama family models.
  • Qwen2ForCausalLM: Qwen family models.
  • MistralForCausalLM: Mistral family models.
  • Gemma2ForCausalLM: Gemma family models.
  • Phi3ForCausalLM: Phi family models.
Automatic detection: The engine automatically detects the model architecture from the checkpoint repository and applies appropriate optimizations.

Model size support

Model SizeSingle GPUTensor ParallelRecommended GPU
<8BL4, A10G, H100N/AL4 (cost-effective)
8B-70BH100TP2H100 (2 GPUs)
70B+H100 / B200TP4+H100 (4+ GPUs)

Advanced features

Lookahead decoding

Lookahead decoding accelerates inference for code generation, JSON output, and templated content by speculating on future tokens using n-gram patterns. Best for:
  • Code generation: Highly predictable patterns in code.
  • Structured content: Reliable JSON, YAML, XML generation.
  • Mathematical expressions: Predictable mathematical notation.
  • Template completion: Filling in predictable templates.
Enable lookahead decoding by adding a speculator section:
trt_llm:
  build:
    speculator:
      speculative_decoding_mode: LOOKAHEAD_DECODING
      lookahead_windows_size: 1
      lookahead_ngram_size: 8
      lookahead_verification_set_size: 1
      enable_b10_lookahead: true
Performance impact:
  • Speed improvement: Up to 2x faster for code and structured content.
  • Prompt lookup: Up to 10x faster for prompt-lookup workloads like code apply, reaching 4000 tokens/s per request on Qwen-3-8B with a single H100.
  • Optimal batch size: Less than 32 requests for best performance.

Structured outputs

Generate text that conforms to JSON schemas for reliable data extraction and controlled generation. Use cases:
  • Data extraction: Extract structured information from unstructured text.
  • API response generation: Generate JSON responses for APIs.
  • Configuration generation: Create structured configuration files.
  • Content validation: Ensure generated content meets specific criteria.
Structured outputs work out of the box with no extra configuration. Define a Pydantic schema:
import os
from pydantic import BaseModel
from openai import OpenAI

class User(BaseModel):
    name: str
    age: int
    email: str

client = OpenAI(
    api_key=os.environ['BASETEN_API_KEY'],
    base_url="https://model-xxxxxx.api.baseten.co/environments/production/sync/v1"
)

response = client.beta.chat.completions.parse(
    model="not-required",
    messages=[
        {"role": "user", "content": "Extract user info from: John is 25 years old and his email is [email protected]"}
    ],
    response_format=User
)

user = response.choices[0].message.parsed
print(f"Name: {user.name}, Age: {user.age}, Email: {user.email}")

Quantization options

Engine-Builder-LLM supports multiple quantization formats for different performance and accuracy trade-offs. Quantization types:
  • no_quant: FP16/BF16 precision (baseline).
  • fp8: FP8 weights + 16-bit KV cache (2x speedup).
  • fp8_kv: FP8 weights + FP8 KV cache (2.5x speedup).
  • fp4: FP4 weights + 16-bit KV cache (4x speedup, B200 only).
  • fp4_kv: FP4 weights + FP8 KV cache (4.5x speedup, B200 only).
  • fp4_mlp_only: FP4 MLP only + 16-bit KV (3x speedup, B200 only).
Hardware requirements: Hardware requirements vary by quantization type.
QuantizationMinimum GPUMemory reductionSpeed improvement
no_quantA100NoneBaseline
fp8L4, H100, H200, B20050%2x
fp8_kvL4, H100, H200, B20060%2.5x
fp4, fp4_kv, fp4_mlp_onlyB200 only75%3-4.5x

Configuration examples

Basic Llama 3.3 70B deployment

Llama 3.3 70B on H100 GPUs with FP8 quantization:
model_name: Llama-3.3-70B-Instruct
resources:
  accelerator: H100:4  # 4 GPUs for 70B model
  cpu: '4'
  memory: 40Gi
  use_gpu: true
trt_llm:
  build:
    base_model: decoder
    checkpoint_repository:
      source: HF
      repo: "meta-llama/Llama-3.3-70B-Instruct"
      revision: main
      runtime_secret_name: hf_access_token
    max_seq_len: 131072
    max_batch_size: 256
    max_num_tokens: 8192
    quantization_type: fp8_kv
    tensor_parallel_count: 4
    plugin_configuration:
      paged_kv_cache: true
      use_paged_context_fmha: true
      use_fp8_context_fmha: true
    quantization_config:
      calib_size: 1024
      calib_dataset: "cnn_dailymail"
      calib_max_seq_length: 2048
  runtime:
    kv_cache_free_gpu_mem_fraction: 0.9
    enable_chunked_context: true
    batch_scheduler_policy: guaranteed_no_evict
    served_model_name: "Llama-3.3-70B-Instruct"

Qwen 2.5 32B with lookahead decoding

Qwen 2.5 32B with speculative decoding for faster inference. Read more on lookahead decoding here
model_name: Qwen-2.5-32B-Lookahead
resources:
  accelerator: H100:1
  cpu: '2'
  memory: 20Gi
  use_gpu: true
trt_llm:
  build:
    base_model: decoder
    checkpoint_repository:
      source: HF
      repo: "Qwen/Qwen2.5-Coder-32B-Instruct"
      revision: main
    max_seq_len: 32768
    max_batch_size: 128
    max_num_tokens: 8192
    quantization_type: fp8 # no fp8_kv for qwen2.5 models
    tensor_parallel_count: 1
    num_builder_gpus: 2 # will be loaded in bf16 for quantization, will require `2x32Gb memory -> 2H100s
    speculator:
      speculative_decoding_mode: LOOKAHEAD_DECODING
      lookahead_windows_size: 3
      lookahead_ngram_size: 8
      lookahead_verification_set_size: 3
      enable_b10_lookahead: true
    plugin_configuration:
      paged_kv_cache: true
      use_paged_context_fmha: true
      use_fp8_context_fmha: true
  runtime:
    kv_cache_free_gpu_mem_fraction: 0.85
    enable_chunked_context: true
    batch_scheduler_policy: guaranteed_no_evict
    served_model_name: "Qwen-2.5-Coder-32B-Instruct"

Small model for cost-effective deployment

Llama 3.2 3B on an L4 GPU for cost efficiency:
model_name: Llama-3.2-3B-Instruct
resources:
  accelerator: L4
  cpu: '1'
  memory: 10Gi
  use_gpu: true
trt_llm:
  build:
    base_model: decoder
    checkpoint_repository:
      source: HF
      repo: "meta-llama/Llama-3.2-3B-Instruct"
      revision: main
    max_seq_len: 8192
    max_batch_size: 256
    max_num_tokens: 4096
    quantization_type: fp8
    tensor_parallel_count: 1
    plugin_configuration:
      paged_kv_cache: true
      use_paged_context_fmha: true
      use_fp8_context_fmha: false
  runtime:
    kv_cache_free_gpu_mem_fraction: 0.9
    enable_chunked_context: true
    batch_scheduler_policy: guaranteed_no_evict
    served_model_name: "Llama-3.2-3B-Instruct"

Performance characteristics

Latency and throughput factors

Performance depends on model size (smaller models respond faster), quantization (FP8/FP4 reduces memory and improves throughput), lookahead decoding (effective for code and structured content), batch size (larger batches improve throughput at the cost of latency), and hardware (H100 and B200 GPUs deliver the best results).

Memory usage considerations

Memory optimization factors:
  • Quantization: FP8 reduces memory by ~50%, FP4 by ~75%.
  • Lookahead decoding: Minimal additional memory overhead.
  • Tensor parallelism: Distributes memory across multiple GPUs.
  • KV cache management: Configurable memory allocation for context handling.

Integration examples

OpenAI-compatible inference

Engine-Builder-LLM deployments are OpenAI compatible, enabling use of the standard OpenAI SDK.
import os
from openai import OpenAI

client = OpenAI(
    api_key=os.environ['BASETEN_API_KEY'],
    base_url="https://model-xxxxxx.api.baseten.co/environments/production/sync/v1"
)

# Standard chat completion
response = client.chat.completions.create(
    model="not-required",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Explain quantum computing in simple terms."}
    ],
    temperature=0.7,
    max_tokens=500
)

print(response.choices[0].message.content)

# Streaming completion
for chunk in client.chat.completions.create(
    model="not-required",
    messages=[{"role": "user", "content": "Write a poem about AI"}],
    stream=True,
):
    print(chunk.choices[0].delta.content or "", end="")
Point base_url to your model’s production endpoint. Find this URL in your Baseten dashboard after deployment. The model parameter can be any string since Baseten routes based on the URL, not this field. Set stream=True to receive tokens as they’re generated. Running this returns a chat completion response with the model’s answer in response.choices[0].message.content, or streams chunks with partial content in delta.content.

Performant Client Usage

For high-throughput batch processing, use the Performance Client which handles concurrent requests efficiently.
from baseten_performance_client import PerformanceClient

client = PerformanceClient(
    base_url="https://model-xxxxxx.api.baseten.co/environments/production/sync", 
    api_key=os.environ['BASETEN_API_KEY']
)

# Batch chat completions with stream=False
payloads = [
    {
        "model": "model",
        "messages": [{"role": "user", "content": "Explain quantum computing"}],
        "stream": False,
        "max_tokens": 500
    },
    {
        "model": "model", 
        "messages": [{"role": "user", "content": "Write a poem about AI"}],
        "stream": False,
        "max_tokens": 300
    }
] * 10  # 20 total requests

response = client.batch_post(
    url_path="/v1/chat/completions",
    payloads=payloads,
)

# Access 20 responses
for i, resp in enumerate(response.data):
    print(f"Response {i+1}: {resp['choices'][0]['message']['content']}")
Use cases: Bulk content generation, Unlocked GIL during Request, batch data processing, performance benchmarking.

Structured outputs

Structured outputs guarantee the response matches your Pydantic schema.
import os
from pydantic import BaseModel
from openai import OpenAI

class Task(BaseModel):
    title: str
    priority: str
    due_date: str
    description: str

client = OpenAI(
    api_key=os.environ['BASETEN_API_KEY'],
    base_url="https://model-xxxxxx.api.baseten.co/environments/production/sync/v1"
)

response = client.beta.chat.completions.parse(
    model="not-required",
    messages=[
        {"role": "user", "content": "Create a task for: Review the quarterly report by next Friday"}
    ],
    response_format=Task
)

task = response.choices[0].message.parsed
print(f"Task: {task.title}")
print(f"Priority: {task.priority}")
Define your schema as a Pydantic model with typed fields. Pass it to response_format and use beta.chat.completions.parse instead of the regular create method. The response includes a parsed attribute with your data already converted to a Task object, so no JSON parsing is needed.

Function calling

Function calling lets the model invoke your functions with structured arguments. Define available tools, and the model returns function calls when appropriate.
import os
from openai import OpenAI

client = OpenAI(
    api_key=os.environ['BASETEN_API_KEY'],
    base_url="https://model-xxxxxx.api.baseten.co/environments/production/sync/v1"
)

tools = [{
    "type": "function",
    "function": {
        "name": "get_weather",
        "description": "Get current weather for a location",
        "parameters": {
            "type": "object",
            "properties": {
                "location": {
                    "type": "string",
                    "description": "City name, e.g., San Francisco"
                }
            },
            "required": ["location"]
        }
    }
}]

response = client.chat.completions.create(
    model="not-required",
    messages=[{"role": "user", "content": "What's the weather like in Boston?"}],
    tools=tools
)

tool_call = response.choices[0].message.tool_calls[0]
print(f"Function: {tool_call.function.name}")
print(f"Arguments: {tool_call.function.arguments}")
Define each tool with a name, description, and JSON schema for parameters. The description helps the model decide when to use the tool. When the model chooses to call a function, tool_calls contains the function name and JSON-encoded arguments. Your code executes the function and optionally sends the result back for a final response.

Best practices

Model selection

For cost-effective deployments:
  • Use models under 8B parameters on L4 GPUs, H100 or H100_40GB.
  • Consider quantization for memory efficiency.
  • Implement autoscaling for variable traffic.
For high-performance deployments:
  • Use H100 GPUs with FP8 quantization.
  • Enable lookahead decoding for code generation.
  • Use tensor parallelism for large models.
For coding assistants:
  • Use models trained on code (Qwen-Coder, CodeLlama).
  • Enable lookahead decoding with window size 1 for maximum throughput.
  • Consider smaller models for faster response times.

Hardware optimization

GPU selection:
  • L4 or H100_40GB: Best for models under 15B parameters, cost-effective.
  • H100_80GB: Recommended for models 15-70B parameters for optimal performance.
  • H100: Best for models 15-70B parameters, high performance.
  • B200: Required for FP4 quantization.
Memory optimization:
  • Use quantization to reduce memory usage.
  • Lower max_seq_len or enable chunked prefill.
  • Monitor memory usage during deployment.

Performance tuning

For lowest latency:
  • Use smaller models when possible.
  • Enable lookahead decoding for code generation.
For highest throughput:
  • Use larger batch sizes.
  • Enable FP8/FP4 quantization.
  • Use tensor parallelism for large models.
For cost efficiency:
  • Use L4 GPUs with quantization.
  • Implement efficient autoscaling.
  • Choose appropriately sized models.

Migration guide

From other deployment systems

Coming from vLLM? Here’s how the configuration maps:
# vLLM configuration (old)
model: "meta-llama/Llama-3.3-70B-Instruct"
tensor_parallel_size: 4
quantization: "fp8"

# Engine-Builder-LLM configuration (new)
trt_llm:
  build:
    base_model: decoder
    checkpoint_repository:
      source: HF
      repo: "meta-llama/Llama-3.3-70B-Instruct"
    quantization_type: fp8_kv
    tensor_parallel_count: 4

Further reading