Autoscaling dynamically adjusts the number of active replicas to handle variable traffic while minimizing idle compute costs.
Autoscaling settings are per deployment and are inherited when promoting a model to production unless overridden.
Configure autoscaling through:
Each deployment scales within a configured range of replicas:
0
(scale to zero).1
.10
by default (contact support to increase).When first deployed, the model starts with 1
replica (or the minimum count, if higher). As traffic increases, additional replicas scale up until the maximum count is reached. When traffic decreases, replicas scale down to match demand.
The autoscaler logic is controlled by three key parameters:
A short autoscaling window with a longer scale-down delay is recommended for fast upscaling while maintaining capacity during temporary dips.
When the average requests per active replica exceed the concurrency target within the autoscaling window, more replicas are created until:
When traffic drops below the concurrency target, excess replicas are flagged for removal. The scale-down delay ensures that replicas are not removed prematurely:
If you’re just testing your model or anticipate light and inconsistent traffic, scale to zero can save you substantial amounts of money.
Scale to zero means that when a deployed model is not receiving traffic, it scales down to zero replicas. When the model is called, Baseten spins up a new instance to serve model requests.
To turn on scale to zero, just set a deployment’s minimum replica count to zero. Scale to zero is enabled by default in the standard autoscaling config.
Models that have not received any traffic for more than two weeks will be automatically deactivated. These models will need to be activated manually before they can serve requests again.
A cold start is the time required to initialize a new replica when scaling up. Cold starts impact:
Network accelerator
Baseten speeds up model loading from Hugging Face, CloudFront, S3, and OpenAI using parallelized byte-range downloads, reducing cold start delays.
Cold start pods
Baseten pre-warms specialized cold start pods to accelerate loading times. These pods appear in logs as [Coldboost]
.
Development deployments have fixed autoscaling constraints to optimize for live reload workflows:
0
1
60 seconds
900 seconds (15 min)
1 request
To enable full autoscaling, promote the deployment and environment like production.