Autoscaling

Autoscaling is a control loop that adjusts the number of replicas backing a deployment based on demand. The goal is to balance performance (latency and throughput) against cost (GPU hours). Autoscaling is reactive by nature.

Reference

Baseten provides default settings that work for most workloads. Tune your autoscaling settings based on your model and traffic.

Parameter	Default	Range	What it controls
Min replicas	0	≥ 0	Baseline capacity (0 = scale to zero).
Max replicas	1	≥ 1	Cost/capacity ceiling.
Autoscaling window	60s	10–3600s	Time window for traffic analysis.
Scale-down delay	900s	0–3600s	Wait time before removing idle replicas.
Concurrency target	1	≥ 1	Requests per replica before scaling.
Target utilization	70%	1–100%	Headroom before scaling triggers.

Configure autoscaling settings through the Baseten UI or API:

UI
cURL
Python

Select your deployment.
Under Replicas for your production environment, choose Configure.
Configure the autoscaling settings and choose Update.

curl -X PATCH \
  https://api.baseten.co/v1/models/{model_id}/deployments/{deployment_id}/autoscaling_settings \
  -H "Authorization: Api-Key $BASETEN_API_KEY" \
  -d '{
    "min_replica": 2,
    "max_replica": 10,
    "concurrency_target": 32,
    "target_utilization_percentage": 70,
    "autoscaling_window": 60,
    "scale_down_delay": 900
  }'

For more information, see the API reference.

import requests
import os

API_KEY = os.environ.get("BASETEN_API_KEY")

response = requests.patch(
    "https://api.baseten.co/v1/models/{model_id}/deployments/{deployment_id}/autoscaling_settings",
    headers={"Authorization": f"Api-Key {API_KEY}"},
    json={
        "min_replica": 2,
        "max_replica": 10,
        "concurrency_target": 32,
        "target_utilization_percentage": 70,
        "autoscaling_window": 60,
        "scale_down_delay": 900
    }
)

print(response.json())

For more information, see the API reference.

How autoscaling works

When the average requests per active replica exceeds the concurrency target × target utilization within the autoscaling window, more replicas are created until:

The concurrency target is met.
The maximum replica count is reached.

When traffic drops below the concurrency target, excess replicas are flagged for removal. The scale-down delay ensures replicas are not removed prematurely:

If traffic returns before the delay ends, replicas remain active.
Scale-down uses exponential back-off: cut half the excess replicas, wait, then cut half again.
Scaling stops when the minimum replica count is reached.

Replicas

Replicas are individual instances of your model, each capable of serving requests independently. The autoscaler adjusts the number of replicas based on traffic, but you control the boundaries with minimum and maximum replica settings.

min_replica

integer

default:"0"

The floor for your deployment’s capacity. The autoscaler won’t scale below this number.Range: ≥ 0The default of 0 enables scale-to-zero: your deployment costs nothing when idle, but the first request triggers a cold start. For large models, cold starts can take minutes.

For production deployments, set min_replica to at least 2. This provides redundancy if one replica fails and eliminates cold starts.

max_replica

integer

default:"1"

The ceiling for your deployment’s capacity. The autoscaler won’t scale above this number.Range: ≥ 1This setting protects against runaway scaling and unexpected costs. If traffic exceeds max replica capacity, requests queue rather than triggering new replicas. The default of 1 means no autoscaling, exactly one replica regardless of load.Estimate max replicas:

(peak\_requests\_per\_second / throughput\_per\_replica) + buffer

For high-volume workloads requiring guaranteed capacity, contact Baseten about reserved capacity options.

Scaling triggers

Scaling triggers determine when the autoscaler adds or removes capacity. The two key settings: concurrency target and target utilization work together to define when your deployment needs more or fewer replicas.

concurrency_target

integer

default:"1"

How many requests each replica can handle simultaneously. This directly determines replica count for a given load.Range: ≥ 1The autoscaler calculates desired replicas:

ceiling(in\_flight\_requests / (concurrency\_target \times target\_utilization))

In-flight requests are requests sent to your model that haven’t returned a response (for streaming, until the stream completes). This count is exposed as baseten_concurrent_requests in the metrics dashboard and metrics export.The default of 1 is appropriate for models that process one request at a time (like image generation consuming all GPU memory). For models with batching (LLMs, embeddings), higher values reduce cost.Tradeoff: Higher concurrency = fewer replicas (lower cost) but more per-replica queueing (higher latency). Lower concurrency = more replicas (higher cost) but less queueing (lower latency).

Starting points by model type:

Model type	Starting concurrency
Standard Truss model	1
vLLM / LLM inference	32–128
SGLang	32
Text embeddings (TEI)	32
BEI embeddings	96+ (min ≥ 8)
Whisper (async batch)	256
Image generation (SDXL)	1

For engine-specific guidance, see Autoscaling engines.

Concurrency target controls requests sent to a replica and triggers autoscaling. predict_concurrency (Truss config.yaml) controls requests processed inside the container. Concurrency target should be less than or equal to predict_concurrency. See the predict_concurrency field in the Truss configuration reference for details.

target_utilization_percentage

integer

default:"70"

Headroom before scaling triggers. The autoscaler scales when utilization reaches this percentage of the concurrency target.Range: 1–100%The effective threshold is:

concurrency\_target × target\_utilization

With concurrency target 10 and utilization 70%, scaling triggers at 7 concurrent requests (10 × 0.70), leaving 30% headroom.Lower values (50–60%) provide more headroom for spikes but cost more. Higher values (80%+) are cost-efficient for steady traffic but absorb spikes less effectively.

Target utilization is not GPU utilization. It measures request slot usage relative to your concurrency target, not hardware utilization.

Scaling dynamics

Scaling dynamics control how quickly and smoothly the autoscaler responds to traffic changes. These settings help you balance responsiveness against stability.

autoscaling_window

integer

default:"60"

How far back (in seconds) the autoscaler looks when measuring traffic. Traffic is averaged over this window to make scaling decisions.Range: 10–3600 secondsA 60-second window considers average load over the past minute, smoothing out momentary spikes. Shorter windows (30–60s) react quickly to traffic changes. Longer windows (2–5 min) ignore short-lived fluctuations and prevent chasing noise.

scale_down_delay

integer

default:"900"

How long (in seconds) the autoscaler waits after load drops before removing replicas. Prevents premature scale-down during temporary dips.Range: 0–3600 secondsWhen load drops, the autoscaler starts a countdown. If load stays low for the full delay, it removes replicas using exponential back-off (half the excess, wait, half again).This is your primary lever for preventing oscillation (thrashing). If replicas repeatedly scale up and down, increase this first.

A short window with a long delay gives you fast scale-up while maintaining capacity during temporary dips. This is a good starting configuration for most workloads.

Development deployments

Development deployments have fixed replica limits but allow modification of other autoscaling settings. The replica constraints are optimized for the development workflow rapid iteration with live reloading using the truss watch command, rather than production traffic handling.

Setting	Value	Modifiable
Min replicas	0	No
Max replicas	1	No
Autoscaling window	60 seconds	Yes
Scale-down delay	900 seconds	Yes
Concurrency target	1	Yes
Target utilization	70%	Yes

The single-replica limit means development deployments aren’t suitable for load testing or handling real traffic. To enable full autoscaling with configurable replica settings, promote the deployment to production.

Next steps

Traffic patterns

Identify your traffic pattern and get recommended starting settings.

Cold starts

Understand cold starts and how to minimize their impact.

Traffic patterns

Tune autoscaling settings for your traffic pattern.

API reference

Complete autoscaling API documentation.

Troubleshooting

Having issues with autoscaling? See Autoscaling troubleshooting for solutions to common problems like oscillation, slow scale-up, and unexpected costs.

Get started

Concepts

Development

Deployment

Inference

Engines

Training

Organization

Observability

Troubleshooting

How autoscaling works

Replicas

Scaling triggers

Scaling dynamics

Development deployments

Next steps

Traffic patterns

Cold starts

Traffic patterns

API reference

Troubleshooting

Get started

Concepts

Development

Deployment

Inference

Engines

Training

Organization

Observability

Troubleshooting

​How autoscaling works

​Replicas

​Scaling triggers

​Scaling dynamics

​Development deployments

​Next steps

Traffic patterns

Cold starts

Traffic patterns

API reference

​Troubleshooting

How autoscaling works

Replicas

Scaling triggers

Scaling dynamics

Development deployments

Next steps

Troubleshooting