Overview
Deploy multiple LoRA adapters on a single base model and switch between them at inference time. The engine shares base model weights across all adapters for memory efficiency.Configuration
Basic LoRA configuration
Limitations
- Same rank and same modules: For optimal performance and stability, the LoRA adapters for one deployment should be uniform. All target modules must be the same.
- Build time availability: The engine relies on numpy-style weights. These need to be pre-converted during deployment and distributed to each replica. For Engine-Builder-LLM, these repos must be known ahead of time.
- Inference performance: If you’re using only one LoRA adapter, merging the adapter into the base weights provides better performance. Additional LoRA adapters add complexity to kernel selection and fundamentally increase flops.
LoRA adapter configuration
Adapter repository structure
LoRA adapters must follow the standard HuggingFace repository structure:Required files
adapter_config.json- The LoRA adapter weights in safetensors format.
You don’t create or upload any
.npy files. The engine builder converts your adapter into its internal format (model.lora_weights.npy, model.lora_config.npy) server-side at build time, deriving the rank and target modules from adapter_config.json. Supply a standard Hugging Face adapter repo (adapter_config.json plus adapter_model.safetensors).Build configuration options
lora_adapters
Dictionary of LoRA adapters to load during build. Adapter names must match the pattern ^[a-zA-Z0-9_\-\.:]+$: letters, digits, underscores, hyphens, dots, and colons only.
max_lora_rank
Maximum LoRA rank for all adapters. Default: 64. Set this to exactly the rank r you use across all adapters. A higher value wastes memory; a lower value truncates weights.
lora_configuration
LoRA-specific configuration nested under build:
max_lora_rank: Maximum LoRA rank across all adapters. Default: 64.lora_target_modules: Target modules for LoRA. Usually auto-detected from adapter config.
Engine inference configuration
The model parameter in OpenAI-format requests selects which adapter to use. For the above example, valid model names areQwen2.5-Coder-base or lora1.
This lets you select different adapters at runtime through the OpenAI client.
Related
- Engine-Builder-LLM overview: Main engine documentation.
- Engine-Builder-LLM configuration: Complete reference config.
- Custom engine builder: Custom model.py implementation.
- Quantization guide: Performance optimization.