Skip to main content
Baseten engines optimize model inference for specific architectures using TensorRT-LLM. All engines mirror build artifacts to the Baseten Delivery Network automatically.
  • BEI: Embedding, reranking, and classification models on causal architectures with FP8 and FP4 quantization.
  • BEI-Bert: Bidirectional BEI variant tuned for BERT-family encoders and cold-start-sensitive models under 4B parameters.
  • Engine-Builder-LLM: Dense text generation for Llama, Qwen, Mistral, and Gemma with lookahead decoding and multi-LoRA support.
  • BIS-LLM: MoE and Enterprise serving with KV-aware routing, disaggregated prefill/decode, and Eagle/MTP speculation.

Choose an engine

Pick the row below that matches what you’re deploying. Cost, quality, and latency targets drive later choices (GPU, quantization, autoscaling) inside that engine.
  • Embedding, reranking, classification, or NER models: use BEI for decoder embedders (Qwen3-Embedding, BAAI/bge, LlamaForSequenceClassification) or BEI-Bert for BERT-family encoders (BERT, ModernBERT, EuroBERT, XLM-RoBERTa). NER lives on BEI-Bert /predict_tokens.
  • Dense text-generation LLMs (Llama 3 or 4, Qwen 3 or 3.5, Mistral, Gemma, Phi, GPT-OSS-20B): use Engine-Builder-LLM, with lookahead decoding and multi-LoRA available.
  • MoE models (GLM 5.x, Kimi K2.5 or K2.6, DeepSeek V3, R1, or V4, MiniMax 2.5, Qwen3 MoE, GPT-OSS-120B) or workloads that need KV-cache-aware routing or disaggregated prefill/decode: use BIS-LLM. Currently a co-engineering pilot.
  • Speech, image, video, or custom Python models: ship a custom Truss. Browse model examples for Whisper, Orpheus, Flux, and other pre-built deployments, or see build your first model for custom inference logic.
If your workload doesn’t fit one of the rows above (custom architectures, hybrid pipelines, BIS-LLM pilot access, sizing for unusual traffic shapes), email support@baseten.co and an engineer will route you.

Performance and operations

Compare engines

FeatureBIS-LLMEngine-Builder-LLMBEIBEI-BertNotes
Quantizationβœ…βœ…βœ…βŒBEI-Bert: FP16/BF16 only.
KV quantizationβœ…βœ…βš οΈβš οΈFP8_KV, FP4_KV supported.
Lookahead decodingβŒβœ…βŒβŒEngine-Builder-LLM (v1) only; BIS-LLM uses MTP/Eagle/N-gram speculation instead.
Self-serviceableπŸ”’βœ…βœ…βœ…BIS-LLM requires Enterprise; other engines are self-serve.
KV-routingπŸ”’βŒβŒβŒBIS-LLM only.
Disaggregated servingπŸ”’βŒβŒβŒBIS-LLM Enterprise.
Tool calling & structured outputβœ…βœ…βŒβŒFunction calling support.
Classification modelsβŒβŒβœ…βœ…Sequence classification.
Embedding modelsβŒβŒβœ…βœ…Embedding generation.
Mixture-of-expertsβœ…βš οΈ (Qwen3MoE only)❌❌MoE models like DeepSeek-R1.
MTP / Eagle / N-gram speculationπŸ”’βŒβŒβŒv2 speculative decoding via speculative_config.
HTTP request cancellationβœ…βš οΈβœ…βœ…Engine-Builder-LLM: within the first 10ms only.
MultiModal InputsπŸ”’βŒβš οΈβŒSelected architectures only.