Autoscaling
Autoscaling dynamically adjusts the number of active replicas to handle variable traffic while minimizing idle compute costs.

Configuring autoscaling
Autoscaling settings are per deployment and are inherited when promoting a model to production unless overridden.
Configure autoscaling through:
- UI → Manage settings in your Baseten workspace.
- API → Use the autoscaling API.
Replica Scaling
Each deployment scales within a configured range of replicas:
- Minimum replicas → The lowest number of active replicas.
- Default:
0
(scale to zero). - Maximum value: Cannot exceed the maximum replica count.
- Default:
- Maximum replicas → The upper limit of active replicas.
- Default:
1
. - Max:
10
by default (contact support to increase).
- Default:
When first deployed, the model starts with 1
replica (or the minimum count, if higher). As traffic increases, additional replicas scale up until the maximum count is reached. When traffic decreases, replicas scale down to match demand.
Autoscaler settings
The autoscaler logic is controlled by three key parameters:
- Autoscaling window → Time window for traffic analysis before scaling up/down. Default: 60 seconds.
- Scale down delay → Time before an unused replica is removed. Default: 900 seconds (15 minutes).
- Concurrency target → Number of requests a replica should handle before scaling. Default: 1 request.
A short autoscaling window with a longer scale-down delay is recommended for fast upscaling while maintaining capacity during temporary dips.
Autoscaling behavior
Scaling Up
When the average requests per active replica exceed the concurrency target within the autoscaling window, more replicas are created until:
- The concurrency target is met, or
- The maximum replica count is reached.
Scaling Down
When traffic drops below the concurrency target, excess replicas are flagged for removal. The scale-down delay ensures that replicas are not removed prematurely:
- If traffic spikes again before the delay ends, replicas remain active.
- If the minimum replica count is reached, no further scaling down occurs.
Scale to zero
If you’re just testing your model or anticipate light and inconsistent traffic, scale to zero can save you substantial amounts of money.
Scale to zero means that when a deployed model is not receiving traffic, it scales down to zero replicas. When the model is called, Baseten spins up a new instance to serve model requests.
To turn on scale to zero, just set a deployment’s minimum replica count to zero. Scale to zero is enabled by default in the standard autoscaling config.
Models that have not received any traffic for more than two weeks will be automatically deactivated. These models will need to be activated manually before they can serve requests again.
Cold starts
A cold start is the time required to initialize a new replica when scaling up. Cold starts impact:
- Scaled-to-zero deployments → The first request must wait for a new replica to start.
- Scaling events → When traffic spikes and a deployment requires more replicas.
Cold Start Optimizations
Network accelerator
Baseten speeds up model loading from Hugging Face, CloudFront, S3, and OpenAI using parallelized byte-range downloads, reducing cold start delays.
Cold start pods
Baseten pre-warms specialized cold start pods to accelerate loading times. These pods appear in logs as [Coldboost]
.
Autoscaling for development deployments
Development deployments have fixed autoscaling constraints to optimize for live reload workflows:
- **Min replicas: **
0
- **Max replicas: **
1
- **Autoscaling window: **
60 seconds
- Scale down delay:
900 seconds (15 min)
- **Concurrency target: **
1 request
To enable full autoscaling, promote the deployment and environment like production.
Was this page helpful?