Reference
Reference
Baseten provides default settings that work for most workloads.
Tune your autoscaling settings based on your model and traffic.
| Parameter | Default | Range | What it controls |
|---|---|---|---|
| Min replicas | 0 | ≥ 0 | Baseline capacity (0 = scale to zero). |
| Max replicas | 1 | ≥ 1 | Cost/capacity ceiling. |
| Autoscaling window | 60s | 10–3600s | Time window for traffic analysis. |
| Scale-down delay | 900s | 0–3600s | Wait time before removing idle replicas. |
| Concurrency target | 1 | ≥ 1 | Requests per replica before scaling. |
| Target utilization | 70% | 1–100% | Headroom before scaling triggers. |
- UI
- cURL
- Python
- Select your deployment.
- Under Replicas for your production environment, choose Configure.
-
Configure the autoscaling settings and choose Update.

How autoscaling works
When the average requests per active replica exceeds the concurrency target × target utilization within the autoscaling window, more replicas are created until:- The concurrency target is met.
- The maximum replica count is reached.
- If traffic returns before the delay ends, replicas remain active.
- Scale-down uses exponential back-off: cut half the excess replicas, wait, then cut half again.
- Scaling stops when the minimum replica count is reached.
Replicas
Replicas are individual instances of your model, each capable of serving requests independently. The autoscaler adjusts the number of replicas based on traffic, but you control the boundaries with minimum and maximum replica settings.The floor for your deployment’s capacity. The autoscaler won’t scale below this number.Range: ≥ 0The default of 0 enables scale-to-zero: your deployment costs nothing when idle, but the first request triggers a cold start. For large models, cold starts can take minutes.
For production deployments, set
min_replica to at least 2. This provides redundancy if one replica fails and eliminates cold starts.The ceiling for your deployment’s capacity.
The autoscaler won’t scale above this number.Range: ≥ 1This setting protects against runaway scaling and unexpected costs.
If traffic exceeds max replica capacity, requests queue rather than triggering new replicas.
The default of 1 means no autoscaling, exactly one replica regardless of load.Estimate max replicas:
Scaling triggers
Scaling triggers determine when the autoscaler adds or removes capacity. The two key settings: concurrency target and target utilization work together to define when your deployment needs more or fewer replicas.How many requests each replica can handle simultaneously. This directly determines replica count for a given load.Range: ≥ 1The autoscaler calculates desired replicas:In-flight requests are requests sent to your model that haven’t returned a response (for streaming, until the stream completes).The default of 1 is appropriate for models that process one request at a time (like image generation consuming all GPU memory). For models with batching (LLMs, embeddings), higher values reduce cost.Tradeoff: Higher concurrency = fewer replicas (lower cost) but more per-replica queueing (higher latency). Lower concurrency = more replicas (higher cost) but less queueing (lower latency).
| Model type | Starting concurrency |
|---|---|
| Standard Truss model | 1 |
| vLLM / LLM inference | 32–128 |
| SGLang | 32 |
| Text embeddings (TEI) | 32 |
| BEI embeddings | 96+ (min ≥ 8) |
| Whisper (async batch) | 256 |
| Image generation (SDXL) | 1 |
Concurrency target controls requests sent to a replica and triggers autoscaling.
predict_concurrency (Truss config.yaml) controls requests processed inside the container.
Concurrency target should be less than or equal to predict_concurrency.
See Request concurrency for details.
Headroom before scaling triggers.
The autoscaler scales when utilization reaches this percentage of the concurrency target.Range: 1–100%The effective threshold is:With concurrency target 10 and utilization 70%, scaling triggers at 7 concurrent requests (10 × 0.70), leaving 30% headroom.Lower values (50–60%) provide more headroom for spikes but cost more. Higher values (80%+) are cost-efficient for steady traffic but absorb spikes less effectively.
Scaling dynamics
Scaling dynamics control how quickly and smoothly the autoscaler responds to traffic changes. These settings help you balance responsiveness against stability.How far back (in seconds) the autoscaler looks when measuring traffic. Traffic is averaged over this window to make scaling decisions.Range: 10–3600 secondsA 60-second window considers average load over the past minute, smoothing out momentary spikes. Shorter windows (30–60s) react quickly to traffic changes. Longer windows (2–5 min) ignore short-lived fluctuations and prevent chasing noise.
How long (in seconds) the autoscaler waits after load drops before removing replicas. Prevents premature scale-down during temporary dips.Range: 0–3600 secondsWhen load drops, the autoscaler starts a countdown. If load stays low for the full delay, it removes replicas using exponential back-off (half the excess, wait, half again).This is your primary lever for preventing oscillation (thrashing). If replicas repeatedly scale up and down, increase this first.
Development deployments
Development deployments have fixed replica limits but allow modification of other autoscaling settings. The replica constraints are optimized for the development workflow rapid iteration with live reloading using thetruss watch command, rather than production traffic handling.
| Setting | Value | Modifiable |
|---|---|---|
| Min replicas | 0 | No |
| Max replicas | 1 | No |
| Autoscaling window | 60 seconds | Yes |
| Scale-down delay | 900 seconds | Yes |
| Concurrency target | 1 | Yes |
| Target utilization | 70% | Yes |
Next steps
Traffic patterns
Identify your traffic pattern and get recommended starting settings.
Cold starts
Understand cold starts and how to minimize their impact.
Find your concurrency target
Benchmark your model to determine optimal concurrency.
API reference
Complete autoscaling API documentation.