Overview
Deploy multiple LoRA adapters on a single base model and switch between them at inference time. The engine shares base model weights across all adapters for memory efficiency.Configuration
Basic LoRA configuration
Limitations
- Same rank and same modules: For optimal performance and stability, the LoRA adapters for one deployment should be uniform. All target modules must be the same.
- Build time availability: The engine relies on numpy-style weights. These need to be pre-converted during deployment and distributed to each replica. For Engine-Builder-LLM, these repos must be known ahead of time.
- Inference performance: If you’re using only one LoRA adapter, merging the adapter into the base weights provides better performance. Additional LoRA adapters add complexity to kernel selection and fundamentally increase flops.
LoRA adapter configuration
Adapter repository structure
LoRA adapters must follow the standard HuggingFace repository structure:Required files
adapter_config.json- NumPy array containing LoRA weight matrices
- Shape:
(num_layers, rank, hidden_size, hidden_size) - Must match the target modules specified in config
- NumPy array containing LoRA configuration
- Includes scaling factors and other parameters
- Must match the adapter_config.json specifications
Build configuration options
lora_adapters
Dictionary of LoRA adapters to load during build:
max_lora_rank
Maximum LoRA rank for all adapters.
r that you use for all adapters.
plugin_configuration
LoRA plugin configuration:
float16: Reduced memory usage, slight accuracy impact.float32: Higher precision, much slower inference.
Engine inference configuration
The model parameter in OpenAI-format requests selects which adapter to use. For the above example, valid model names areQwen2.5-Coder-base or lora1.
This lets you select different adapters at runtime through the OpenAI client.
Further reading
- Engine-Builder-LLM overview: Main engine documentation.
- Engine-Builder-LLM configuration: Complete reference config.
- Custom engine builder: Custom model.py implementation.
- Quantization guide: Performance optimization.