Skip to main content
Autoscaling is a control loop that adjusts the number of replicas backing a deployment based on demand. The goal is to balance performance (latency and throughput) against cost (GPU hours). Autoscaling is reactive by nature.
Baseten provides default settings that work for most workloads. Tune your autoscaling settings based on your model and traffic.
ParameterDefaultRangeWhat it controls
Min replicas0≥ 0Baseline capacity (0 = scale to zero).
Max replicas1≥ 1Cost/capacity ceiling.
Autoscaling window60s10–3600sTime window for traffic analysis.
Scale-down delay900s0–3600sWait time before removing idle replicas.
Concurrency target1≥ 1Requests per replica before scaling.
Target utilization70%1–100%Headroom before scaling triggers.
Configure autoscaling settings through the Baseten UI or API:
  1. Select your deployment.
  2. Under Replicas for your production environment, choose Configure.
  3. Configure the autoscaling settings and choose Update. UI view to configure autoscaling

How autoscaling works

When the average requests per active replica exceeds the concurrency target × target utilization within the autoscaling window, more replicas are created until:
  • The concurrency target is met.
  • The maximum replica count is reached.
When traffic drops below the concurrency target, excess replicas are flagged for removal. The scale-down delay ensures replicas are not removed prematurely:
  • If traffic returns before the delay ends, replicas remain active.
  • Scale-down uses exponential back-off: cut half the excess replicas, wait, then cut half again.
  • Scaling stops when the minimum replica count is reached.

Replicas

Replicas are individual instances of your model, each capable of serving requests independently. The autoscaler adjusts the number of replicas based on traffic, but you control the boundaries with minimum and maximum replica settings.
min_replica
integer
default:"0"
The floor for your deployment’s capacity. The autoscaler won’t scale below this number.Range: ≥ 0The default of 0 enables scale-to-zero: your deployment costs nothing when idle, but the first request triggers a cold start. For large models, cold starts can take minutes.
For production deployments, set min_replica to at least 2. This provides redundancy if one replica fails and eliminates cold starts.
max_replica
integer
default:"1"
The ceiling for your deployment’s capacity. The autoscaler won’t scale above this number.Range: ≥ 1This setting protects against runaway scaling and unexpected costs. If traffic exceeds max replica capacity, requests queue rather than triggering new replicas. The default of 1 means no autoscaling, exactly one replica regardless of load.Estimate max replicas:(peak_requests_per_second/throughput_per_replica)+buffer(peak\_requests\_per\_second / throughput\_per\_replica) + buffer
For high-volume workloads requiring guaranteed capacity, contact Baseten about reserved capacity options.

Scaling triggers

Scaling triggers determine when the autoscaler adds or removes capacity. The two key settings: concurrency target and target utilization work together to define when your deployment needs more or fewer replicas.
concurrency_target
integer
default:"1"
How many requests each replica can handle simultaneously. This directly determines replica count for a given load.Range: ≥ 1The autoscaler calculates desired replicas:ceiling(in_flight_requests/concurrency_target)ceiling(in\_flight\_requests / concurrency\_target) In-flight requests are requests sent to your model that haven’t returned a response (for streaming, until the stream completes).The default of 1 is appropriate for models that process one request at a time (like image generation consuming all GPU memory). For models with batching (LLMs, embeddings), higher values reduce cost.Tradeoff: Higher concurrency = fewer replicas (lower cost) but more per-replica queueing (higher latency). Lower concurrency = more replicas (higher cost) but less queueing (lower latency).
Starting points by model type:
Model typeStarting concurrency
Standard Truss model1
vLLM / LLM inference32–128
SGLang32
Text embeddings (TEI)32
BEI embeddings96+ (min ≥ 8)
Whisper (async batch)256
Image generation (SDXL)1
For engine-specific guidance, see Autoscaling engines.
Concurrency target controls requests sent to a replica and triggers autoscaling. predict_concurrency (Truss config.yaml) controls requests processed inside the container. Concurrency target should be less than or equal to predict_concurrency. See Request concurrency for details.
target_utilization_percentage
integer
default:"70"
Headroom before scaling triggers. The autoscaler scales when utilization reaches this percentage of the concurrency target.Range: 1–100%The effective threshold is:concurrency_target×target_utilizationconcurrency\_target × target\_utilization With concurrency target 10 and utilization 70%, scaling triggers at 7 concurrent requests (10 × 0.70), leaving 30% headroom.Lower values (50–60%) provide more headroom for spikes but cost more. Higher values (80%+) are cost-efficient for steady traffic but absorb spikes less effectively.
Target utilization is not GPU utilization. It measures request slot usage relative to your concurrency target, not hardware utilization.

Scaling dynamics

Scaling dynamics control how quickly and smoothly the autoscaler responds to traffic changes. These settings help you balance responsiveness against stability.
autoscaling_window
integer
default:"60"
How far back (in seconds) the autoscaler looks when measuring traffic. Traffic is averaged over this window to make scaling decisions.Range: 10–3600 secondsA 60-second window considers average load over the past minute, smoothing out momentary spikes. Shorter windows (30–60s) react quickly to traffic changes. Longer windows (2–5 min) ignore short-lived fluctuations and prevent chasing noise.
scale_down_delay
integer
default:"900"
How long (in seconds) the autoscaler waits after load drops before removing replicas. Prevents premature scale-down during temporary dips.Range: 0–3600 secondsWhen load drops, the autoscaler starts a countdown. If load stays low for the full delay, it removes replicas using exponential back-off (half the excess, wait, half again).This is your primary lever for preventing oscillation (thrashing). If replicas repeatedly scale up and down, increase this first.
A short window with a long delay gives you fast scale-up while maintaining capacity during temporary dips. This is a good starting configuration for most workloads.

Development deployments

Development deployments have fixed replica limits but allow modification of other autoscaling settings. The replica constraints are optimized for the development workflow rapid iteration with live reloading using the truss watch command, rather than production traffic handling.
SettingValueModifiable
Min replicas0No
Max replicas1No
Autoscaling window60 secondsYes
Scale-down delay900 secondsYes
Concurrency target1Yes
Target utilization70%Yes
The single-replica limit means development deployments aren’t suitable for load testing or handling real traffic. To enable full autoscaling with configurable replica settings, promote the deployment to production.

Next steps


Troubleshooting

Having issues with autoscaling? See Autoscaling troubleshooting for solutions to common problems like oscillation, slow scale-up, and unexpected costs.