Skip to main content
Engine-Builder-LLM optimizes dense text generation models with TensorRT-LLM, delivering up to 4000 tokens/second for code generation with lookahead decoding. The engine supports structured outputs for JSON schema validation. Engine-Builder-LLM deployments mirror build artifacts to the Baseten Delivery Network automatically. No extra configuration is required.

Use cases

Model families:
  • Llama: meta-llama/Llama-3.3-70B-Instruct, meta-llama/Llama-3.2-3B-Instruct. For Llama 4, use BIS-LLM.
  • Qwen: Qwen/Qwen3-235B-A22B-Instruct-2507-FP8, Qwen/Qwen2.5-72B-Instruct.
  • Mistral: mistralai/Mistral-Small-24B-Instruct-2501, mistralai/Mistral-7B-Instruct-v0.3.
  • GPT-OSS: openai/gpt-oss-20b.
  • Nemotron: nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4.
  • Gemma: google/gemma-3-27b-it, google/gemma-3-12b-it.
  • Microsoft: microsoft/Phi-4.
Engine-Builder-LLM handles high-throughput dialogue systems, coding assistants with lookahead decoding, and content generation with structured outputs. The engine’s speculative decoding accelerates code generation by 2-4x, making it ideal for coding agents and JSON-heavy workloads.

LoRA support

Engine-Builder-LLM serves multiple LoRA adapters per deployment with engine-level adapter switching. Define adapters at build time and select between them per request.

Structured outputs

Engine-Builder-LLM supports OpenAI-compatible structured outputs with JSON schema validation, including nested schemas and complex types.

Key benefits

Low latency

TensorRT-LLM compilation optimizes time-to-first-token.

High throughput

Batching and kernel optimization maximize tokens per second.

Lookahead decoding

Speculative decoding accelerates coding agents and predictable content.

Structured outputs

JSON schema validation for controlled text generation.

Architecture support

Supported architectures

Engine-Builder-LLM auto-detects the Hugging Face architectures field from your checkpoint. The build maps each architecture to an optimized TensorRT-LLM backend:
Hugging Face architectureBackendExample models
LlamaForCausalLM, LLaMAForCausalLMLLaMALlama 3.2, Llama 3.3
MistralForCausalLMLLaMAMistral 7B, Mistral Small
AquilaForCausalLM, AquilaModelLLaMAAquila family
InternLMForCausalLMLLaMAInternLM
XverseForCausalLMLLaMAXverse
Qwen2ForCausalLMQwenQwen 2.5 dense
Qwen2MoeForCausalLMQwenQwen 2 MoE (prefer BIS-LLM for production MoE)
Qwen3ForCausalLMQwen3Qwen 3 dense
Qwen3MoeForCausalLMQwen3Qwen 3 MoE (for example, Qwen3-235B-A22B)
Palmyra4ForCausalLMQwenWriter Palmyra
Gemma2ForCausalLM, Gemma3ForCausalLMGemmaGemma 2/3 (bf16 only)
DeciLMForCausalLMNemotron NASNVIDIA Nemotron NAS
Architectures not in this table: If the checkpoint’s architectures value is not listed (including Phi3ForCausalLM and other ForCausalLM variants), the build still uses base_model: decoder and auto-detects the architecture, logging a warning that it may miss model-specific optimizations. The legacy named base_model values (llama, qwen, mistral, deepseek) are no longer accepted and raise an error on push. Prefer checkpoints with explicit architecture metadata. Not on Engine-Builder-LLM: Llama 4, DeepSeek MoE, Kimi, and GLM MoE use different architectures. Deploy them with BIS-LLM.

Model size support

Model SizeSingle GPUTensor ParallelRecommended GPU
<8BH100_40GB, H100, B200N/AH100_40GB (cost-effective)
8B-30BH100, B200TP1H100
30B-70BH100TP2-TP4H100 (4 GPUs)
70B+H100, B200TP4-TP8H100 (8 GPUs) or B200 (2-4 GPUs)

Advanced features

Lookahead decoding

Lookahead decoding accelerates inference for code generation, JSON output, and templated content by speculating on future tokens using n-gram patterns. Best for:
  • Code generation: Highly predictable patterns in code.
  • Structured content: Reliable JSON, YAML, XML generation.
  • Mathematical expressions: Predictable mathematical notation.
  • Template completion: Filling in predictable templates.
Enable lookahead decoding by adding a speculator section:
trt_llm:
  build:
    speculator:
      speculative_decoding_mode: LOOKAHEAD_DECODING
      lookahead_windows_size: 1
      lookahead_ngram_size: 8
      lookahead_verification_set_size: 1
      enable_b10_lookahead: true
Performance impact:
  • Speed improvement: Up to 2x faster for code and structured content.
  • Prompt lookup: Up to 10x faster for prompt-lookup workloads like code apply, reaching 4000 tokens/s per request on Qwen-3-8B with a single H100.
  • Optimal batch size: Less than 32 requests for best performance.

Structured outputs

Generate text that conforms to JSON schemas for reliable data extraction and controlled generation. Use cases:
  • Data extraction: Extract structured information from unstructured text.
  • API response generation: Generate JSON responses for APIs.
  • Configuration generation: Create structured configuration files.
  • Content validation: Ensure generated content meets specific criteria.
Structured outputs work out of the box with no extra configuration. Define a Pydantic schema:
import os
from pydantic import BaseModel
from openai import OpenAI

class User(BaseModel):
    name: str
    age: int
    email: str

client = OpenAI(
    api_key=os.environ['BASETEN_API_KEY'],
    base_url="https://model-xxxxxx.api.baseten.co/environments/production/sync/v1"
)

response = client.beta.chat.completions.parse(
    model="not-required",
    messages=[
        {"role": "user", "content": "Extract user info from: John is 25 years old and his email is john@example.com"}
    ],
    response_format=User
)

user = response.choices[0].message.parsed
print(f"Name: {user.name}, Age: {user.age}, Email: {user.email}")

Quantization options

Engine-Builder-LLM supports multiple quantization formats. For the full GPU support matrix, model-specific recommendations, and calibration guidance, see the quantization guide.
QuantizationMinimum GPUMemory reduction
no_quantA100None
fp8L4~50%
fp8_kvL4~60%
fp4 / fp4_kv / fp4_mlp_onlyB200~75%

Configuration examples

Basic Llama 3.3 70B deployment

Llama 3.3 70B on H100 GPUs with FP8 quantization:
model_name: Llama-3.3-70B-Instruct
resources:
  accelerator: H100:4  # 4 GPUs for 70B model
  cpu: '4'
  memory: 40Gi
  use_gpu: true
trt_llm:
  build:
    base_model: decoder
    checkpoint_repository:
      source: HF
      repo: "meta-llama/Llama-3.3-70B-Instruct"
      revision: main
      runtime_secret_name: hf_access_token
    max_seq_len: 131072
    max_batch_size: 256
    max_num_tokens: 8192
    quantization_type: fp8_kv
    tensor_parallel_count: 4
    plugin_configuration:
      paged_kv_cache: true
      use_paged_context_fmha: true
      use_fp8_context_fmha: true
    quantization_config:
      calib_size: 1024
      calib_dataset: "abisee/cnn_dailymail"
      calib_max_seq_length: 2048
  runtime:
    kv_cache_free_gpu_mem_fraction: 0.9
    enable_chunked_context: true
    batch_scheduler_policy: guaranteed_no_evict
    served_model_name: "Llama-3.3-70B-Instruct"

Qwen 2.5 32B with lookahead decoding

Qwen 2.5 32B with speculative decoding for faster inference. See Lookahead decoding for the full configuration reference.
model_name: Qwen-2.5-32B-Lookahead
resources:
  accelerator: H100:1
  cpu: '2'
  memory: 20Gi
  use_gpu: true
trt_llm:
  build:
    base_model: decoder
    checkpoint_repository:
      source: HF
      repo: "Qwen/Qwen2.5-Coder-32B-Instruct"
      revision: main
    max_seq_len: 32768
    max_batch_size: 128
    max_num_tokens: 8192
    quantization_type: fp8 # no fp8_kv for qwen2.5 models
    tensor_parallel_count: 1
    num_builder_gpus: 2 # Loaded in BF16 for quantization; requires ~2x32GB (2 H100s)
    speculator:
      speculative_decoding_mode: LOOKAHEAD_DECODING
      lookahead_windows_size: 3
      lookahead_ngram_size: 8
      lookahead_verification_set_size: 3
      enable_b10_lookahead: true
    plugin_configuration:
      paged_kv_cache: true
      use_paged_context_fmha: true
      use_fp8_context_fmha: true
  runtime:
    kv_cache_free_gpu_mem_fraction: 0.85
    enable_chunked_context: true
    batch_scheduler_policy: guaranteed_no_evict
    served_model_name: "Qwen-2.5-Coder-32B-Instruct"

Small model for cost-effective deployment

Llama 3.2 3B on an L4 GPU for cost efficiency:
model_name: Llama-3.2-3B-Instruct
resources:
  accelerator: L4
  cpu: '1'
  memory: 10Gi
  use_gpu: true
trt_llm:
  build:
    base_model: decoder
    checkpoint_repository:
      source: HF
      repo: "meta-llama/Llama-3.2-3B-Instruct"
      revision: main
    max_seq_len: 8192
    max_batch_size: 256
    max_num_tokens: 4096
    quantization_type: fp8
    tensor_parallel_count: 1
    plugin_configuration:
      paged_kv_cache: true
      use_paged_context_fmha: true
      use_fp8_context_fmha: false
  runtime:
    kv_cache_free_gpu_mem_fraction: 0.9
    enable_chunked_context: true
    batch_scheduler_policy: guaranteed_no_evict
    served_model_name: "Llama-3.2-3B-Instruct"

Integration examples

Engine-Builder-LLM deployments are OpenAI compatible. Point base_url to your model’s production endpoint and use the standard OpenAI SDK:
import os
from openai import OpenAI

client = OpenAI(
    api_key=os.environ['BASETEN_API_KEY'],
    base_url="https://model-xxxxxx.api.baseten.co/environments/production/sync/v1"
)

response = client.chat.completions.create(
    model="not-required",
    messages=[{"role": "user", "content": "Explain quantum computing in simple terms."}],
    temperature=0.7,
    max_tokens=500
)

print(response.choices[0].message.content)
For high-throughput batch processing, use the Performance Client. For structured outputs and function calling, see their dedicated pages.

Sizing and tuning

Throughput, latency, and cost depend on four levers: model size, quantization (FP8 on H100 cuts memory roughly in half, FP4 on B200 by 75%), tensor parallelism, and whether lookahead decoding earns its keep for your workload. For the full GPU support matrix and calibration guidance, see the quantization guide. For per-flag detail on max_seq_len, max_batch_size, KV cache, and chunked prefill, see the Engine-Builder-LLM configuration reference.