Skip to main content
The overview covers Baseten’s capabilities. This page covers the underlying mechanics: how a config file becomes a running endpoint, how Baseten routes requests to your model, how the autoscaler manages capacity, and how you promote a model from development to production.

Multi-cloud Capacity Management (MCM)

Behind every Baseten deployment is our Multi-cloud Capacity Management (MCM) system. MCM acts as the infrastructure control plane, unifying thousands of GPUs across 10+ cloud service providers and multiple geographic regions. When you request a resource—whether an H100 in US-East-1 or a cluster of B200s in a private region—MCM provisions the hardware, configures networking, and monitors health. It abstracts differences between cloud providers to ensure the Baseten Inference Stack runs identically on any underlying infrastructure. This system powers Baseten’s high availability by enabling active-active deployments across different clouds. If a region or provider faces a capacity crunch or outage, MCM rapidly re-routes and re-provisions workloads to maintain service continuity.

The build pipeline

1

Upload project

When you run truss push, the CLI validates your config.yaml, archives your project directory, and uploads it to cloud storage. Baseten receives the archive and starts the build.
2

Process model weights

For Engine-Builder-LLM, Baseten downloads model weights from the source repository (Hugging Face, S3, or GCS) and compiles them with TensorRT-LLM. The compilation step builds optimized CUDA kernels for the target GPU architecture, applies quantization (FP8, FP4) if configured, and sets up tensor parallelism across multiple GPUs.
3

Package and deploy

Baseten packages the compiled engine, runtime configuration, and serving infrastructure into a container, deploys it to GPU infrastructure, and exposes it as an API endpoint.
The truss push command returns once the upload finishes. For engine-based deployments, compilation can take several minutes. Watch progress in the deployment logs or check the dashboard, which shows “Active” when the endpoint is ready for requests. For custom model code deployments, the build is faster: Baseten installs your Python dependencies, packages your Model class into a container, and deploys it. You remain responsible for any inference optimization in custom builds.

Request routing

Each deployment gets a dedicated subdomain: https://model-{model_id}.api.baseten.co/. The URL path determines which deployment handles the request. Requests to /production/predict go to the production environment, while /development/predict goes to the development deployment. You can also target a specific deployment by ID or a custom environment by name. Once the environment is resolved, the load balancer routes the request to an active replica. If the model has scaled to zero, Baseten spins up a replica and queues the request until the model loads and becomes ready. The caller receives the response regardless of whether the model was warm or cold. Engine-based deployments serve an OpenAI-compatible API at the /v1/chat/completions path, so any code written for the OpenAI SDK works without modification. Custom model deployments use the predict API, which accepts and returns arbitrary JSON. For long-running workloads, async requests return a request ID immediately. The request enters a queue managed by an async request service. A background worker then calls your model and delivers the result via webhook. Sync requests take priority over async requests when competing for concurrency slots to prevent background work from starving real-time traffic.

Autoscaling

Baseten’s autoscaler watches in-flight request counts and adjusts replicas to maintain each one near its concurrency target. Scaling up is immediate. When average utilization crosses the target threshold (default 70%) within the autoscaling window (default 60 seconds), the autoscaler adds replicas up to the configured maximum. Scaling down is deliberately slow. When traffic drops, the autoscaler flags excess replicas for removal but keeps them alive for a configurable delay (default 900 seconds). It uses exponential backoff: removing half the excess replicas, waiting, and then removing half again. This prevents the cluster from thrashing during bursty traffic. Setting min_replica to 0 enables scale-to-zero. The model stops incurring GPU cost when idle, but the next request triggers a cold start. Setting min_replica to 1 or higher keeps warm capacity ready at all times, trading cost for lower latency.

Cold starts and the weight delivery network

The slowest part of a cold start is loading model weights, which can reach hundreds of gigabytes. Baseten addresses this with the Baseten Delivery Network (BDN), a multi-tier caching system for model weights. When you first deploy, BDN mirrors your model weights from the source repository to Baseten’s own blob storage. After that, no cold start depends on an upstream service like Hugging Face or S3. When a new replica starts, the BDN agent on the node fetches a manifest for the weights, downloads them through an in-cluster cache (shared across all pods in the cluster), and stores them in a node-level cache (shared across all replicas on the same node). Identical files across different models are deduplicated, so a GLM fine-tune that shares most weights with the base model only downloads the delta. Subsequent cold starts on the same node or in the same cluster are significantly faster than the first. Container images use streaming, so the model begins loading weights before the image download completes.

Environments and promotion

Every model starts with a development deployment: a single replica with scale-to-zero enabled and live reload for fast iteration. When the model is ready for production traffic, promote it to an environment. The production environment exists by default. You can create additional environments—like staging, shadow, or canary—for testing and gradual rollouts. Each environment has a stable endpoint URL, its own autoscaling settings, and dedicated metrics. The endpoint URL remains constant when you promote new deployments, so your application code doesn’t need to change. Promotion replaces the current deployment in an environment with the new one. The new deployment inherits the environment’s autoscaling settings. Baseten demotes the previous deployment and scales it to zero, allowing you to roll back by re-promoting it. You can also push directly to an environment with truss push --environment staging to skip the development stage. Only one promotion can be active per environment at a time to prevent conflicting updates.