Autoscaling model replicas
Autoscaling lets you handle highly variable traffic while minimizing spend on idle compute resources.
Autoscaling configuration
Autoscaling settings are configurable for each deployment of a model. New production deployments will inherit the autoscaling settings of the previous production deployment or be set to the default configuration if no prior production deployment exists.
Min and max replicas
Every deployment can scale between a range of replicas:
- Minimum count: the deployment will not scale below this many active replicas.
- Lowest possible value: 0.
- Default value: 0.
- Highest possible value: the maximum replica count
- Maximum count: the deployment will not scale above this many active replicas.
- Lowest possible value: 1 or the minimum replica count, whichever is greater.
- Default value: 1.
- Highest possible value: 10 by default, contact us to unlock higher replica maximums.
When the model is first deployed, it will be deployed on one replica or the minimum number of replicas, whichever is greater. As it receives traffic, it will scale up to use additional replicas as necessary, up to the maximum replica count, then scale down to fewer replicas as traffic subsides.
Autoscaling parameters
All three configuration options apply when scaling beyond a single replica:
- Scaling delay: The time the autoscaler waits before spinning up and spinning down new replicas. Default: 1200 seconds (20 minutes).
- Additional scale down delay: The additional time the autoscaler waits before spinning down a replica. Default: 0 seconds.
- Concurrency target: The number of concurrent requests you want each replica to be responsible for handling. Default: 3 requests.
These three settings, in combination, allow flexible response to your specific traffic patterns. The autoscaler only provisions new nodes when the moving average of requests per replica exceeds the concurrency target for the duration of a scaling delay. And when traffic drops, the autoscaler waits for the additional scale down delay before dropping instances to prevent unnecessary cold starts after brief lulls in traffic.
Scale to zero
If you’re just testing your model or anticipate light and inconsistent traffic, scale to zero can save you substantial amounts of money.
Scale to zero means that when a deployed model is not receiving traffic, it scales down to zero replicas. When the model is called, Baseten spins up a new instance to serve model requests.
To turn on scale to zero, just set a deployment’s minimum replica count to zero. Scale to zero is enabled by default in the standard autoscaling config.
Cold starts
A “cold start” is the time it takes to spin up a new instance of a model server. Cold starts apply in two situations:
- When a model is scaled to zero and receives a request
- When the number of concurrent requests trigger the autoscaler to increase the number of active replicas
Cold starts are especially noticable for scaled-to-zero models as the time to process the first request includes the cold start time. Baseten has heavily invested in reducing cold start times for all models.
Network accelerator
Baseten uses a network accelerator to speed up model loads from common model artifact stores, including HuggingFace, CloudFront, S3, and OpenAI. Our accelerator employs byte range downloads in the background to maximize the parallelism of downloads. This improves cold start times by reducing the amount of time it takes to load model weights and other required data.
Cold start pods
To shorten cold start times, we spin up specifically designated pods to accelerate model loading that are not counted toward your ordinary model resources. You may see these pods in your logs and metrics.
Coldboost logs have [Coldboost]
as a prefix to signify that a cold start pod is in use:
Oct 09 9:20:25pm [Coldboost] Completed model.load() execution in 12650 ms
Further optimizations
Read our how-to guide for optimizing cold starts to learn how you can edit your Truss and application to reduce the impact of cold starts.
Autoscaling for development deployments
Most autoscaling settings are fixed for development deployments to allow for live reload workflows and a simplified testing setup. The standard configuration is:
- Minimum replicas: 0.
- Maximum replicas: 1.
- Scaling delay: 1200 seconds.
- Additional scaling delay: 0 seconds.
- Concurrency target: 3 requests.
Of these settings, only scaling delay and concurrency target are editable.
To unlock full autoscaling for a development deployment, promote it to production.