Use cases
Model families:- Llama:
meta-llama/Llama-3.3-70B-Instruct,meta-llama/Llama-3.2-3B-Instruct. For Llama 4, use BIS-LLM. - Qwen:
Qwen/Qwen3-235B-A22B-Instruct-2507-FP8,Qwen/Qwen2.5-72B-Instruct. - Mistral:
mistralai/Mistral-Small-24B-Instruct-2501,mistralai/Mistral-7B-Instruct-v0.3. - GPT-OSS:
openai/gpt-oss-20b. - Nemotron:
nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4. - Gemma:
google/gemma-3-27b-it,google/gemma-3-12b-it. - Microsoft:
microsoft/Phi-4.
LoRA support
Engine-Builder-LLM serves multiple LoRA adapters per deployment with engine-level adapter switching. Define adapters at build time and select between them per request.Structured outputs
Engine-Builder-LLM supports OpenAI-compatible structured outputs with JSON schema validation, including nested schemas and complex types.Key benefits
Low latency
TensorRT-LLM compilation optimizes time-to-first-token.
High throughput
Batching and kernel optimization maximize tokens per second.
Lookahead decoding
Speculative decoding accelerates coding agents and predictable content.
Structured outputs
JSON schema validation for controlled text generation.
Architecture support
Supported architectures
Engine-Builder-LLM auto-detects the Hugging Facearchitectures field from your checkpoint. The build maps each architecture to an optimized TensorRT-LLM backend:
| Hugging Face architecture | Backend | Example models |
|---|---|---|
LlamaForCausalLM, LLaMAForCausalLM | LLaMA | Llama 3.2, Llama 3.3 |
MistralForCausalLM | LLaMA | Mistral 7B, Mistral Small |
AquilaForCausalLM, AquilaModel | LLaMA | Aquila family |
InternLMForCausalLM | LLaMA | InternLM |
XverseForCausalLM | LLaMA | Xverse |
Qwen2ForCausalLM | Qwen | Qwen 2.5 dense |
Qwen2MoeForCausalLM | Qwen | Qwen 2 MoE (prefer BIS-LLM for production MoE) |
Qwen3ForCausalLM | Qwen3 | Qwen 3 dense |
Qwen3MoeForCausalLM | Qwen3 | Qwen 3 MoE (for example, Qwen3-235B-A22B) |
Palmyra4ForCausalLM | Qwen | Writer Palmyra |
Gemma2ForCausalLM, Gemma3ForCausalLM | Gemma | Gemma 2/3 (bf16 only) |
DeciLMForCausalLM | Nemotron NAS | NVIDIA Nemotron NAS |
architectures value is not listed (including Phi3ForCausalLM and other ForCausalLM variants), the build still uses base_model: decoder and auto-detects the architecture, logging a warning that it may miss model-specific optimizations. The legacy named base_model values (llama, qwen, mistral, deepseek) are no longer accepted and raise an error on push. Prefer checkpoints with explicit architecture metadata.
Not on Engine-Builder-LLM: Llama 4, DeepSeek MoE, Kimi, and GLM MoE use different architectures. Deploy them with BIS-LLM.
Model size support
| Model Size | Single GPU | Tensor Parallel | Recommended GPU |
|---|---|---|---|
<8B | H100_40GB, H100, B200 | N/A | H100_40GB (cost-effective) |
| 8B-30B | H100, B200 | TP1 | H100 |
| 30B-70B | H100 | TP2-TP4 | H100 (4 GPUs) |
70B+ | H100, B200 | TP4-TP8 | H100 (8 GPUs) or B200 (2-4 GPUs) |
Advanced features
Lookahead decoding
Lookahead decoding accelerates inference for code generation, JSON output, and templated content by speculating on future tokens using n-gram patterns. Best for:- Code generation: Highly predictable patterns in code.
- Structured content: Reliable JSON, YAML, XML generation.
- Mathematical expressions: Predictable mathematical notation.
- Template completion: Filling in predictable templates.
speculator section:
- Speed improvement: Up to 2x faster for code and structured content.
- Prompt lookup: Up to 10x faster for prompt-lookup workloads like code apply, reaching 4000 tokens/s per request on Qwen-3-8B with a single H100.
- Optimal batch size: Less than 32 requests for best performance.
Structured outputs
Generate text that conforms to JSON schemas for reliable data extraction and controlled generation. Use cases:- Data extraction: Extract structured information from unstructured text.
- API response generation: Generate JSON responses for APIs.
- Configuration generation: Create structured configuration files.
- Content validation: Ensure generated content meets specific criteria.
Quantization options
Engine-Builder-LLM supports multiple quantization formats. For the full GPU support matrix, model-specific recommendations, and calibration guidance, see the quantization guide.| Quantization | Minimum GPU | Memory reduction |
|---|---|---|
no_quant | A100 | None |
fp8 | L4 | ~50% |
fp8_kv | L4 | ~60% |
fp4 / fp4_kv / fp4_mlp_only | B200 | ~75% |
Configuration examples
Basic Llama 3.3 70B deployment
Llama 3.3 70B on H100 GPUs withFP8 quantization:
Qwen 2.5 32B with lookahead decoding
Qwen 2.5 32B with speculative decoding for faster inference. See Lookahead decoding for the full configuration reference.Small model for cost-effective deployment
Llama 3.2 3B on an L4 GPU for cost efficiency:Integration examples
Engine-Builder-LLM deployments are OpenAI compatible. Pointbase_url to your model’s production endpoint and use the standard OpenAI SDK:
Sizing and tuning
Throughput, latency, and cost depend on four levers: model size, quantization (FP8 on H100 cuts memory roughly in half, FP4 on B200 by 75%), tensor parallelism, and whether lookahead decoding earns its keep for your workload. For the full GPU support matrix and calibration guidance, see the quantization guide. For per-flag detail on max_seq_len, max_batch_size, KV cache, and chunked prefill, see the Engine-Builder-LLM configuration reference.
Related
- Configure Engine-Builder-LLM deployments: Complete build and runtime options.
- Set up structured outputs: JSON schema validation and controlled generation.
- Enable lookahead decoding: Speculative decoding for coding agents.
- Build custom inference logic: Custom model.py implementation.
- Choose a quantization format: FP8/FP4 trade-offs and hardware requirements.
- Deploy LoRA adapters: Multi-LoRA with runtime switching.
- Scale Engine-Builder-LLM replicas: Autoscaling settings and concurrency targets.