Quickstart: Make your first inference call
Call a model through Model APIs in under two minutes. No deployment, no setup, just an API key and a request.
How models get deployed
The most common way to deploy a model on Baseten is with Truss, an open-source framework that packages your model into a deployable container. For supported architectures—which cover most popular open-source LLMs, embedding models, and image generators—deployment requires only aconfig.yaml file.
Specify the model, the hardware, and the engine, and Truss handles the rest.
config.yaml
truss push and Baseten builds a TensorRT-optimized container, deploys it to GPU infrastructure, and provides an endpoint.
The model serves an OpenAI-compatible API out of the box.
When you need custom behavior like preprocessing, postprocessing, or a model architecture that the built-in engines don’t support, Truss also supports custom Python model code.
Write a Model class with load and predict methods and Truss packages it the same way.
Most teams start with config-only deployments and add custom code only when they need it.
Your first model
Deploy a model to Baseten with just a config file. No custom code needed.
Inference engines
Baseten optimizes every deployment with an inference engine tuned for your model’s architecture. Select the engine that best supports your use case and it handles the low-level performance work: quantization, tensor parallelism, KV cache management, and batching.Engine-Builder-LLM
Dense text generation models compiled with TensorRT-LLM. Supports lookahead decoding and structured outputs.
BIS-LLM
Large mixture-of-experts models like DeepSeek R1 and Qwen3 MoE with KV-aware routing and distributed inference.
BEI
Embedding, reranking, and classification models with up to 1,400 embeddings per second throughput.
config.yaml, or Baseten selects it automatically based on your model architecture.
Multi-step workflows with Chains
Some applications need more than a single model call. A RAG pipeline retrieves documents, embeds them, and generates a response. An image generation workflow runs a diffusion model, upscales the result, and applies safety filtering. Chains is Baseten’s framework for orchestrating these multi-step pipelines. Each step runs on its own hardware with its own dependencies, and Chains manages the data flow between them. Define the pipeline in Python, and Chains deploys, scales, and monitors each step independently.Training
Baseten also provides training infrastructure for fine-tuning and pre-training. Bring your training scripts (Axolotl, TRL, Megatron, or custom code) and run jobs on H200 or A10G GPUs. Checkpoints sync automatically during training, and you can deploy a fine-tuned model from checkpoint to production endpoint in a single command withtruss train deploy_checkpoints.