Quickstart: Make your first inference call
Call a model through Model APIs in under two minutes. No deployment, no setup, just an API key and a request.
How models get deployed
The most common way to deploy a model on Baseten is with Truss, an open-source framework that packages your model into a deployable container. For supported architectures, which covers most popular open-source LLMs, embedding models, and image generators, deployment requires only aconfig.yaml file.
You specify the model, the hardware, and the engine, and Truss handles the rest.
truss push, and Baseten builds a TensorRT-optimized container, deploys it to GPU infrastructure, and gives you an endpoint.
The model serves an OpenAI-compatible API out of the box.
When you need custom behavior like preprocessing, postprocessing, or a model architecture that the built-in engines don’t support, Truss also supports custom Python model code.
You write a Model class with load and predict methods, and Truss packages it the same way.
Most teams start with config-only deployments and add custom code only when they need it.
Your first model
Deploy a model to Baseten with just a config file. No custom code needed.
Inference engines
Baseten optimizes every deployment with an inference engine tuned for your model’s architecture. You select the engine that best supports your use case, and it handles the low-level performance work: quantization, tensor parallelism, KV cache management, and batching.Engine-Builder-LLM
Dense text generation models compiled with TensorRT-LLM. Supports lookahead decoding and structured outputs.
BIS-LLM
Large mixture-of-experts models like DeepSeek R1 and Qwen3 MoE with KV-aware routing and distributed inference.
BEI
Embedding, reranking, and classification models with up to 1,400 embeddings per second throughput.
config.yaml, or Baseten selects it automatically based on your model architecture.
Multi-step workflows with Chains
Some applications need more than a single model call. A RAG pipeline retrieves documents, embeds them, and generates a response. An image generation workflow runs a diffusion model, upscales the result, and applies safety filtering. Chains is Baseten’s framework for orchestrating these multi-step pipelines. Each step runs on its own hardware with its own dependencies, and Chains manages the data flow between them. You define the pipeline in Python, and Chains deploys, scales, and monitors each step independently.Training
Baseten also provides training infrastructure for fine-tuning and pre-training. You bring your training scripts (Axolotl, TRL, Megatron, or custom code) and run jobs on H100, H200, or A10G GPUs. Checkpoints sync automatically during training, and you can deploy a fine-tuned model from checkpoint to production endpoint in a single command withtruss train deploy_checkpoints.
Production infrastructure
Every deployment on Baseten runs on autoscaling infrastructure that adjusts replicas based on traffic. You configure minimum and maximum replicas, concurrency targets, and scale-down delays. Or use the defaults, which handle most workloads well. Models can scale to zero when idle, eliminating costs during quiet periods, and scale up within seconds when traffic arrives. Baseten schedules workloads across multiple cloud providers and regions through Multi-Cloud Capacity Management. This means your models stay available even during provider-level disruptions, and traffic routes to the lowest-latency region automatically. Built-in observability gives you real-time metrics, logs, and request traces for every deployment. You can export data to tools like Datadog or Prometheus, and debug behavior with full visibility into inputs, outputs, and errors.Resources
How Baseten works
The build pipeline, request routing, autoscaling, and deployment lifecycle under the hood.
Examples
End-to-end guides for deploying and optimizing popular models.
Model library
Ready-to-deploy configurations for models like DeepSeek, Llama, Qwen, Whisper, and Stable Diffusion.
API reference
Reference for the inference API, management API, and Truss CLI.