These examples walk through common ways to deploy and serve models on Baseten. Each section below covers a different packaging approach, so pick whichever fits your model and workflow. If you’re new to Baseten, start with Deploy your first model.Documentation Index
Fetch the complete documentation index at: https://docs.baseten.co/llms.txt
Use this file to discover all available pages before exploring further.
Engines
Config-only deploys on Baseten’s optimized inference engines. This is the fastest path for LLMs, embeddings, and other common architectures, with no Python or Dockerfile required. See engines for architecture support, quantization options, and performance guidance.Fast LLMs with TensorRT-LLM
Speculative decoding
Embeddings with BEI
Custom Docker servers
Bring your own inference server, such as vLLM, SGLang, or anything that speaks HTTP. Baseten runs the container, and you own the serving stack. See Docker server for configuration.Run any LLM with vLLM
Deploy LLMs with SGLang
Dockerized model
Custom Python models
Write the TrussModel class for full control over load and predict. Use when no engine or open-source server fits your architecture. See custom model code for the API.