Skip to main content
Baseten engines optimize model inference for specific architectures using TensorRT-LLM. Select an engine based on your model type (embeddings, dense LLMs, or mixture-of-experts) to achieve the best latency and throughput.

Engine ecosystem

Engine selection

Select an engine based on your model’s architecture and expected workload.
Model typeArchitectureRecommended engineKey featuresHardware
Dense LLMCausalLM (text generation)Engine-Builder-LLMLookahead decoding, structured outputsH100, B200
MoE ModelsMixture of ExpertsBIS-LLMKV-aware routing, advanced quantizationH100, B200
Large Models700B+ parametersBIS-LLMDistributed inference, FP4 supportH100, B200
EmbeddingsBERT-based (bidirectional)BEI-BertCold-start optimization, 16-bit precisionT4, L4, A10G, H100, B200
EmbeddingsCausal (Llama, Mistral, Qwen)BEIFP8 quantization, high throughputL4, A10G, H100, B200
RerankingCross-encoder architecturesBEI / BEI-BertLow latency, batch processingL4, A10G, H100, B200
ClassificationSequence classificationBEI / BEI-BertHigh throughput, cached weightsL4, A10G, H100, B200

Feature availability

FeatureBIS-LLMEngine-Builder-LLMBEIBEI-BertNotes
QuantizationBEI-Bert: FP16/BF16 only
KV quantizationFP8_KV, FP4_KV supported
Speculative lookahead decodingGatedn-gram based speculation
Self-serviceableGated/✅All engines self-service
KV-routingGatedBIS-LLM only
Disaggregated servingGatedBIS-LLM enterprise
Tool calling & structured outputFunction calling support
Classification modelsSequence classification
Embedding modelsEmbedding generation
Mixture-of-expertsMixture of Experts models like DeepSeek
MTP and Eagle 3 speculationGatedModel-based speculation
HTTP request cancellationEngine-Builder supports it within the first 10ms

Architecture recommendations

BEI vs BEI-Bert (embeddings)

BEI-Bert optimizes BERT-based architectures (sentence-transformers, jinaai, nomic-ai) with fast cold-start performance and 16-bit precision. Choose BEI-Bert for bidirectional models under 4B parameters where cold-start latency matters. Jina-BERT, Nomic, and ModernBERT architectures all run well on this engine. BEI handles causal embedding architectures (Llama, Mistral, Qwen) with FP8/FP4 quantization support. Choose BEI when you need maximum throughput or want to run larger embedding models like BAAI/bge, Qwen3-Embedding, or Salesforce/SFR-Embedding with quantization.

Engine-Builder-LLM vs BIS-LLM (text generation)

Engine-Builder-LLM serves dense models (non-MoE) with lookahead decoding and structured outputs. Choose it for Llama 3.3, Qwen-3, Qwen2.5, Mistral, or Gemma-3 when you need speculative decoding for coding agents or JSON schema validation. BIS-LLM serves large MoE models with KV-aware routing and advanced tool calling. Choose it for DeepSeek-R1, Qwen3MoE, Kimi-K2, Llama-4, or GLM-4.7 when you need enterprise features like disaggregated serving or H100/B200 optimization.

Performance benchmarks

Benchmark results depend on model size, GPU type, and quantization settings. The figures below represent typical performance on H100 GPUs.

Embedding performance (BEI/BEI-Bert)

  • Throughput: Up to 1400 client embeddings per second.
  • Latency: Sub-millisecond response times.
  • Quantization: FP8/FP4 provides 2x speedup with less than 1% accuracy loss.

Text generation performance (Engine-Builder-LLM/BIS-LLM)

  • Speculative decoding: Faster inference for code and structured content through lookahead decoding.
  • Quantization: Memory reduction and speed improvements with FP8/FP4.
  • Distributed inference: Scalable deployment with tensor parallelism.

Hardware requirements and optimization

Quantization reduces memory usage and improves inference speed.
QuantizationMinimum GPURecommended GPUMemory reductionNotes
FP16/BF16A100H100NoneBaseline precision
FP8L4H100~50%Good balance of performance and accuracy
FP8_KVL4H100~60%KV cache quantization for memory efficiency
FP4B200B200~75%B200-only quantization
FP4_KVB200B200~80%Maximum memory reduction
Some models require specialized engines that are not self-serviceable:
  • Whisper: Audio transcription and speech recognition.
  • Orpheus: Audio generation.

Next steps

Examples: