Skip to main content
Baseten is a training and inference platform. You bring a model, whether it’s an open-source LLM from Hugging Face, a fine-tuned checkpoint, or a custom model, and Baseten turns it into a production API endpoint with autoscaling, observability, and optimized serving infrastructure. The platform handles containerization, GPU scheduling across multiple clouds, and engine-level optimizations like TensorRT-LLM compilation, so you can focus on your model and your application. If you want to skip deployment entirely and start making inference calls right now, Model APIs give you OpenAI-compatible endpoints for models like DeepSeek, Qwen, and GLM. Point the OpenAI SDK at Baseten’s URL, and you’re running inference in seconds.

Quickstart: Make your first inference call

Call a model through Model APIs in under two minutes. No deployment, no setup, just an API key and a request.

How models get deployed

The most common way to deploy a model on Baseten is with Truss, an open-source framework that packages your model into a deployable container. For supported architectures, which covers most popular open-source LLMs, embedding models, and image generators, deployment requires only a config.yaml file. You specify the model, the hardware, and the engine, and Truss handles the rest.
model_name: Qwen-2.5-3B
resources:
  accelerator: L4
  use_gpu: true
trt_llm:
  build:
    base_model: decoder
    checkpoint_repository:
      source: HF
      repo: "Qwen/Qwen2.5-3B-Instruct"
Run truss push, and Baseten builds a TensorRT-optimized container, deploys it to GPU infrastructure, and gives you an endpoint. The model serves an OpenAI-compatible API out of the box. When you need custom behavior like preprocessing, postprocessing, or a model architecture that the built-in engines don’t support, Truss also supports custom Python model code. You write a Model class with load and predict methods, and Truss packages it the same way. Most teams start with config-only deployments and add custom code only when they need it.

Your first model

Deploy a model to Baseten with just a config file. No custom code needed.

Inference engines

Baseten optimizes every deployment with an inference engine tuned for your model’s architecture. You select the engine that best supports your use case, and it handles the low-level performance work: quantization, tensor parallelism, KV cache management, and batching. You choose the engine through a field in your config.yaml, or Baseten selects it automatically based on your model architecture.

Multi-step workflows with Chains

Some applications need more than a single model call. A RAG pipeline retrieves documents, embeds them, and generates a response. An image generation workflow runs a diffusion model, upscales the result, and applies safety filtering. Chains is Baseten’s framework for orchestrating these multi-step pipelines. Each step runs on its own hardware with its own dependencies, and Chains manages the data flow between them. You define the pipeline in Python, and Chains deploys, scales, and monitors each step independently.

Training

Baseten also provides training infrastructure for fine-tuning and pre-training. You bring your training scripts (Axolotl, TRL, Megatron, or custom code) and run jobs on H100, H200, or A10G GPUs. Checkpoints sync automatically during training, and you can deploy a fine-tuned model from checkpoint to production endpoint in a single command with truss train deploy_checkpoints.

Production infrastructure

Every deployment on Baseten runs on autoscaling infrastructure that adjusts replicas based on traffic. You configure minimum and maximum replicas, concurrency targets, and scale-down delays. Or use the defaults, which handle most workloads well. Models can scale to zero when idle, eliminating costs during quiet periods, and scale up within seconds when traffic arrives. Baseten schedules workloads across multiple cloud providers and regions through Multi-Cloud Capacity Management. This means your models stay available even during provider-level disruptions, and traffic routes to the lowest-latency region automatically. Built-in observability gives you real-time metrics, logs, and request traces for every deployment. You can export data to tools like Datadog or Prometheus, and debug behavior with full visibility into inputs, outputs, and errors.

Resources