Use cases
Model families:- Llama:
meta-llama/Llama-3.3-70B-Instruct,meta-llama/Llama-3.2-3B-Instruct. - Qwen:
Qwen/Qwen2.5-72B-Instruct,Qwen3/Qwen3-8B,Qwen/QwQ-32B-Preview. - Mistral:
mistralai/Mistral-7B-Instruct-v0.3,mistralai/Mistral-Small-24B-Instruct. - DeepSeek:
deepseek-ai/DeepSeek-R1-Distill-Llama-70B. - Gemma 3:
google/gemma-3-27b-it,google/gemma-3-12b-it. - Microsoft:
microsoft/Phi-4.
LoRA support
Engine-Builder-LLM supports multi-LoRA deployments with engine adapter switching:Multi-LoRA
Multiple adapters, engine switching, parameter-efficient fine-tuning
Quick start
Deploy LoRA adapters in minutes
Structured outputs
Engine-Builder-LLM supports OpenAI-compatible structured outputs with JSON schema validation:Features
Full OpenAI compatibility, JSON schema validation, complex nested schemas
Quick start
Get started with structured outputs in minutes
Key benefits
Low latency
TensorRT-LLM compilation optimizes time-to-first-token.
High throughput
Batching and kernel optimization maximize tokens per second.
Lookahead decoding
Speculative decoding accelerates coding agents and predictable content.
Structured outputs
JSON schema validation for controlled text generation.
Architecture support
Supported model types
Engine-Builder-LLM supports all causal language model architectures that end withForCausalLM:
Primary architectures:
LlamaForCausalLM: Llama family models.Qwen2ForCausalLM: Qwen family models.MistralForCausalLM: Mistral family models.Gemma2ForCausalLM: Gemma family models.Phi3ForCausalLM: Phi family models.
Model size support
| Model Size | Single GPU | Tensor Parallel | Recommended GPU |
|---|---|---|---|
<8B | L4, A10G, H100 | N/A | L4 (cost-effective) |
| 8B-70B | H100 | TP2 | H100 (2 GPUs) |
| 70B+ | H100 / B200 | TP4+ | H100 (4+ GPUs) |
Advanced features
Lookahead decoding
Lookahead decoding accelerates inference for code generation, JSON output, and templated content by speculating on future tokens using n-gram patterns. Best for:- Code generation: Highly predictable patterns in code.
- Structured content: Reliable JSON, YAML, XML generation.
- Mathematical expressions: Predictable mathematical notation.
- Template completion: Filling in predictable templates.
speculator section:
- Speed improvement: Up to 2x faster for code and structured content.
- Prompt lookup: Up to 10x faster for prompt-lookup workloads like code apply, reaching 4000 tokens/s per request on Qwen-3-8B with a single H100.
- Optimal batch size: Less than 32 requests for best performance.
Structured outputs
Generate text that conforms to JSON schemas for reliable data extraction and controlled generation. Use cases:- Data extraction: Extract structured information from unstructured text.
- API response generation: Generate JSON responses for APIs.
- Configuration generation: Create structured configuration files.
- Content validation: Ensure generated content meets specific criteria.
Quantization options
Engine-Builder-LLM supports multiple quantization formats for different performance and accuracy trade-offs. Quantization types:no_quant:FP16/BF16precision (baseline).fp8:FP8weights + 16-bit KV cache (2x speedup).fp8_kv:FP8weights +FP8KV cache (2.5x speedup).fp4:FP4weights + 16-bit KV cache (4x speedup, B200 only).fp4_kv:FP4weights +FP8KV cache (4.5x speedup, B200 only).fp4_mlp_only:FP4MLP only + 16-bit KV (3x speedup, B200 only).
| Quantization | Minimum GPU | Memory reduction | Speed improvement |
|---|---|---|---|
no_quant | A100 | None | Baseline |
fp8 | L4, H100, H200, B200 | 50% | 2x |
fp8_kv | L4, H100, H200, B200 | 60% | 2.5x |
fp4, fp4_kv, fp4_mlp_only | B200 only | 75% | 3-4.5x |
Configuration examples
Basic Llama 3.3 70B deployment
Llama 3.3 70B on H100 GPUs withFP8 quantization:
Qwen 2.5 32B with lookahead decoding
Qwen 2.5 32B with speculative decoding for faster inference. Read more on lookahead decoding hereSmall model for cost-effective deployment
Llama 3.2 3B on an L4 GPU for cost efficiency:Performance characteristics
Latency and throughput factors
Performance depends on model size (smaller models respond faster), quantization (FP8/FP4 reduces memory and improves throughput), lookahead decoding (effective for code and structured content), batch size (larger batches improve throughput at the cost of latency), and hardware (H100 and B200 GPUs deliver the best results).
Memory usage considerations
Memory optimization factors:- Quantization:
FP8reduces memory by ~50%,FP4by ~75%. - Lookahead decoding: Minimal additional memory overhead.
- Tensor parallelism: Distributes memory across multiple GPUs.
- KV cache management: Configurable memory allocation for context handling.
Integration examples
OpenAI-compatible inference
Engine-Builder-LLM deployments are OpenAI compatible, enabling use of the standard OpenAI SDK.base_url to your modelβs production endpoint. Find this URL in your Baseten dashboard after deployment. The model parameter can be any string since Baseten routes based on the URL, not this field. Set stream=True to receive tokens as theyβre generated.
Running this returns a chat completion response with the modelβs answer in response.choices[0].message.content, or streams chunks with partial content in delta.content.
Performant Client Usage
For high-throughput batch processing, use the Performance Client which handles concurrent requests efficiently.Structured outputs
Structured outputs guarantee the response matches your Pydantic schema.response_format and use beta.chat.completions.parse instead of the regular create method.
The response includes a parsed attribute with your data already converted to a Task object, so no JSON parsing is needed.
Function calling
Function calling lets the model invoke your functions with structured arguments. Define available tools, and the model returns function calls when appropriate.name, description, and JSON schema for parameters. The description helps the model decide when to use the tool.
When the model chooses to call a function, tool_calls contains the function name and JSON-encoded arguments. Your code executes the function and optionally sends the result back for a final response.
Best practices
Model selection
For cost-effective deployments:- Use models under 8B parameters on L4 GPUs, H100 or H100_40GB.
- Consider quantization for memory efficiency.
- Implement autoscaling for variable traffic.
- Use H100 GPUs with
FP8quantization. - Enable lookahead decoding for code generation.
- Use tensor parallelism for large models.
- Use models trained on code (Qwen-Coder, CodeLlama).
- Enable lookahead decoding with window size 1 for maximum throughput.
- Consider smaller models for faster response times.
Hardware optimization
GPU selection:- L4 or H100_40GB: Best for models under 15B parameters, cost-effective.
- H100_80GB: Recommended for models 15-70B parameters for optimal performance.
- H100: Best for models 15-70B parameters, high performance.
- B200: Required for
FP4quantization.
- Use quantization to reduce memory usage.
- Lower max_seq_len or enable chunked prefill.
- Monitor memory usage during deployment.
Performance tuning
For lowest latency:- Use smaller models when possible.
- Enable lookahead decoding for code generation.
- Use larger batch sizes.
- Enable
FP8/FP4quantization. - Use tensor parallelism for large models.
- Use L4 GPUs with quantization.
- Implement efficient autoscaling.
- Choose appropriately sized models.
Migration guide
From other deployment systems
Coming from vLLM? Hereβs how the configuration maps:Further reading
- Engine-Builder-LLM reference config: Complete configuration options.
- Structured outputs: JSON schema validation and controlled generation.
- Lookahead decoding guide: Advanced speculative decoding.
- Custom engine builder: Custom model.py implementation.
- Quantization guide:
FP8/FP4trade-offs and hardware requirements. - TensorRT-LLM examples: Concrete deployment examples.