Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.baseten.co/llms.txt

Use this file to discover all available pages before exploring further.

These examples walk through common ways to deploy and serve models on Baseten. Each section below covers a different packaging approach, so pick whichever fits your model and workflow. If you’re new to Baseten, start with Deploy your first model.

Engines

Config-only deploys on Baseten’s optimized inference engines. This is the fastest path for LLMs, embeddings, and other common architectures, with no Python or Dockerfile required. See engines for architecture support, quantization options, and performance guidance.

Fast LLMs with TensorRT-LLM

Speculative decoding

Embeddings with BEI

Custom Docker servers

Bring your own inference server, such as vLLM, SGLang, or anything that speaks HTTP. Baseten runs the container, and you own the serving stack. See Docker server for configuration.

Run any LLM with vLLM

Deploy LLMs with SGLang

Dockerized model

Custom Python models

Write the Truss Model class for full control over load and predict. Use when no engine or open-source server fits your architecture. See custom model code for the API.

Build and deploy a LLM

Image generation

Customize a model

Chains

Compose multi-step AI workflows across models, routing, parallelism, and post-processing. See Chains for the SDK.

RAG pipeline with Chains

Transcribe audio with Chains

Training

Train and fine-tune models with Baseten’s scalable training infrastructure. From fine-tuning large language models to training custom models, our platform provides the tools and compute you need.
Our training infrastructure supports popular frameworks including VERL, Megatron, and Unsloth, as well as models trained directly with Hugging Face Transformers.