model.py while maintaining TensorRT-LLM performance. Custom engine builder enables billing integration, request tracing, fan-out generation, and multi-response workflows.
Overview
The custom engine builder lets you:- Implement business logic: Billing, usage tracking, access control.
- Add custom logging: Request tracing, performance monitoring, audit trails.
- Create advanced inference patterns: Fan-out generation, custom chat templates.
- Integrate external services: APIs, databases, monitoring systems.
- Optimize performance: Concurrent processing, custom batching strategies.
When to use custom engine builder
Ideal use cases
Business logic integration:- Usage tracking: Monitor token usage per customer/request.
- Access control: Implement custom authentication/authorization.
- Rate limiting: Custom rate limiting based on user tiers.
- Audit logging: Compliance and security requirements.
- Fan-out generation: Generate multiple responses from one request.
- Custom chat templates: Domain-specific conversation formats.
- Multi-response workflows: Parallel processing of variations.
- Conditional generation: Business rule-based output modification.
- Custom logging: Request tracing, performance metrics.
- Concurrent processing: Parallel generation for improved throughput.
- Usage analytics: Track patterns and optimize accordingly.
- Error handling: Custom error responses and fallback logic.
Implementation
Fan-out generation example
Multi-generation fan-out generates multiple texts from a single request. Running them sequentially ensures the KV cache is created before subsequent generations.model/model.py
Fan-out generation configuration
To deploy the above example, create a new directory, e.g.fanout and create a fanout/model/model.py file.
Then create the following config.yaml at fanout/config.yaml
config.yaml
truss push --publish.
Limitations and considerations
What custom engine builder cannot do
Custom tokenization:- Cannot modify the underlying tokenizer implementation
- Cannot add custom vocabulary or special tokens
- Must use the modelβs native tokenization
- Cannot modify the TensorRT-LLM engine structure
- Cannot change attention mechanisms or model layers
- Cannot add custom model components
When to use standard engine instead
- Standard chat completions without special requirements
- No need for business logic integration
Monitoring and debugging
Request tracing
Further reading
- Engine-Builder-LLM overview: Main engine documentation.
- Engine-Builder-LLM configuration: Complete reference config.
- Examples section: Deployment examples.
- Chains documentation: Multi-model workflows.