Reference
Reference
Baseten provides default settings that work for most workloads.
Tune your autoscaling settings based on your model and traffic.
| Parameter | Default | Range | What it controls |
|---|---|---|---|
| Min replicas | 0 | ≥ 0 | Baseline capacity (0 = scale to zero). |
| Max replicas | 1 | ≥ 1 | Cost/capacity ceiling. |
| Autoscaling window | 60s | 10–3600s | Time window for traffic analysis. |
| Scale-down delay | 900s | 0–3600s | Wait time before removing idle replicas. |
| Concurrency target | 1 | ≥ 1 | Requests per replica before scaling. |
| Target utilization | 70% | 1–100% | Headroom before scaling triggers. |
- UI
- cURL
- Python
- Select your deployment.
- Under Replicas for your production environment, choose Configure.
-
Configure the autoscaling settings and choose Update.

How autoscaling works
The autoscaler monitors in-flight requests across all active replicas. Every autoscaling window (60 seconds by default), it compares the average load per replica against your concurrency target adjusted by target utilization. When that threshold is crossed, the autoscaler adds replicas until the concurrency target is met or the maximum replica count is reached. Consider a deployment with a concurrency target of 10 and target utilization of 70%. The autoscaler triggers at 7 concurrent requests per replica (10 x 0.70). If traffic jumps from 5 to 25 in-flight requests, the autoscaler calculates that 4 replicas are needed (ceiling of 25 / 7) and begins provisioning them. Scaling down is deliberately slower. When traffic drops, the autoscaler doesn’t remove replicas immediately. Instead, it waits for the scale-down delay (15 minutes by default), then removes half the excess replicas, waits again, and removes half of what remains. This exponential back-off prevents oscillation: if traffic briefly dips and returns, your replicas are still warm. Scaling down stops at the minimum replica count.Replicas
Each replica is an independent instance of your model, running on its own hardware and capable of serving requests in parallel with other replicas. The autoscaler controls how many replicas are active at any given time, but you set the boundaries.The floor for your deployment’s capacity. The autoscaler won’t scale below this number.Range: ≥ 0The default of 0 enables scale-to-zero: when no requests arrive for long enough, all replicas shut down and your deployment incurs no charges. The tradeoff is that the next request triggers a cold start, which can take minutes for large models. During that wake-up period, billing is per minute even though the replica isn’t yet serving responses.
For production deployments, set
min_replica to at least 2. This eliminates cold starts and provides redundancy if one replica fails.The ceiling for your deployment’s capacity. The autoscaler won’t scale above this number.Range: ≥ 1This setting protects against runaway scaling and unexpected costs. If traffic exceeds what your maximum replicas can handle, requests queue rather than triggering new replicas. The default of 1 effectively disables autoscaling: you get exactly one replica regardless of load.Estimate max replicas:
Scaling triggers
The autoscaler needs to know when your replicas are “full.” Two settings define that threshold: concurrency target sets how many simultaneous requests each replica should handle, and target utilization adds headroom so the autoscaler acts before replicas are completely saturated.How many requests each replica can handle simultaneously. This directly determines replica count for a given load.Range: ≥ 1The autoscaler calculates desired replicas:In-flight requests are requests sent to your model that haven’t returned a response (for streaming, until the stream completes). This count is exposed as
baseten_concurrent_requests in the metrics dashboard and metrics export.The right value depends on how your model uses hardware. Image generation models that consume all GPU memory per request can only process one at a time, so a concurrency target of 1 is correct. LLMs and embedding models batch requests internally and can handle dozens simultaneously, so higher targets (32 or more) reduce cost by packing more work onto each replica.Tradeoff: Higher concurrency = fewer replicas (lower cost) but more per-replica queueing (higher latency). Lower concurrency = more replicas (higher cost) but less queueing (lower latency).| Model type | Starting concurrency |
|---|---|
| Standard Truss model | 1 |
| vLLM / LLM inference | 32–128 |
| SGLang | 32 |
| Text embeddings (TEI) | 32 |
| BEI embeddings | 96+ (min ≥ 8) |
| Whisper (async batch) | 256 |
| Image generation (SDXL) | 1 |
Concurrency target controls requests sent to a replica and triggers autoscaling.
predict_concurrency (Truss config.yaml) controls requests processed inside the container.
Concurrency target should be less than or equal to predict_concurrency.
See the
predict_concurrency field in the Truss configuration reference for details.Headroom before scaling triggers. The autoscaler scales when utilization reaches this percentage of the concurrency target, not when replicas are fully loaded.Range: 1–100%The effective threshold is:With a concurrency target of 10 and utilization of 70%, scaling triggers at 7 concurrent requests (10 x 0.70), leaving 30% headroom for absorbing spikes while new replicas start.Lower values (50-60%) provide more headroom for spikes but cost more. Higher values (80%+) are cost-efficient for steady traffic but absorb spikes less effectively.
Scaling dynamics
Once the autoscaler decides to scale, two settings control the pace. The autoscaling window determines how far back the autoscaler looks when measuring traffic, and the scale-down delay determines how long it waits before removing idle replicas. Together, they let you tune the tradeoff between responsiveness and stability.How far back (in seconds) the autoscaler looks when measuring traffic. Traffic is averaged over this window to make scaling decisions.Range: 10–3600 secondsA 60-second window smooths out momentary spikes by averaging load over the past minute. Shorter windows (30-60s) react quickly to traffic changes, which suits bursty workloads. Longer windows (2-5 min) ignore short-lived fluctuations and prevent the autoscaler from chasing noise.
How long (in seconds) the autoscaler waits after load drops before removing replicas.Range: 0–3600 secondsWhen load drops, the autoscaler starts a countdown. If load stays low for the full delay, it removes replicas using exponential back-off (half the excess, wait, half again). If traffic returns before the countdown finishes, the replicas stay active and the countdown resets.This is your primary lever for preventing oscillation. If replicas repeatedly scale up and down, increase this value first.
Development deployments
Development deployments are designed for iteration, not production traffic. Replicas are fixed at 0-1 to match thetruss watch workflow, where you’re testing changes on a single instance rather than handling concurrent users. You can still adjust timing and concurrency settings.
| Setting | Value | Modifiable |
|---|---|---|
| Min replicas | 0 | No |
| Max replicas | 1 | No |
| Autoscaling window | 60 seconds | Yes |
| Scale-down delay | 900 seconds | Yes |
| Concurrency target | 1 | Yes |
| Target utilization | 70% | Yes |
Next steps
Traffic patterns
Identify your traffic pattern and get recommended starting settings.
Cold starts
Understand cold starts and how to minimize their impact.
API reference
Complete autoscaling API documentation.