Autoscaling lets you handle highly variable traffic while minimizing spend on idle compute resources.

Autoscaling configuration

Autoscaling settings are configurable for each deployment of a model. New production deployments will inherit the autoscaling settings of the previous production deployment or be set to the default configuration if no prior production deployment exists.

Min and max replicas

Every deployment can scale between a range of replicas:

  • Minimum count: the deployment will not scale below this many active replicas.
    • Lowest possible value: 0.
    • Default value: 0.
    • Highest possible value: the maximum replica count
  • Maximum count: the deployment will not scale above this many active replicas.
    • Lowest possible value: 1 or the minimum replica count, whichever is greater.
    • Default value: 1.
    • Highest possible value: 10 by default, contact us to unlock higher replica maximums.

When the model is first deployed, it will be deployed on one replica or the minimum number of replicas, whichever is greater. As it receives traffic, it will scale up to use additional replicas as necessary, up to the maximum replica count, then scale down to fewer replicas as traffic subsides.

Autoscaler settings

There are three autoscaler settings:

  • Autoscaling window: The timeframe of traffic considered for scaling replicas up and down. Default: 60 seconds.
  • Scale down delay: The additional time the autoscaler waits before spinning down a replica. Default: 900 seconds (15 minutes).
  • Concurrency target: The number of concurrent requests you want each replica to be responsible for handling. Default: 1 request.

Autoscaler settings aren’t universal, but we generally recommend a shorter autoscaling window with a longer scale down delay to respond quickly to traffic spikes while maintaining capacity through variable traffic. This is reflected in the default values.

Autoscaling in action

Here’s how the autoscaler handles spikes in traffic without wasting money on unnecessary model resources:

  • The autoscaler analyzes incoming traffic to your model. When the average number of requests divided by the number of active replicas exceeds the concurrency target for the duration of the autoscaling window, additional replicas are created until:
    • The average requests divided by the number of active replicas drops below the concurrency target, or
    • The maximum count of replicas is reached.
  • When traffic dies down, fewer replicas are needed to stay below the concurrency target. When this has been true for the duration of the autoscaling window, excess replicas are marked for removal. The autoscaler waits for the scale down delay before gracefully spinning down any unneeded replicas. Replicas will not spin down if:
    • Traffic picks back up during the scale down delay, or
    • The deployment’s minimum count of replicas is reached.

Scale to zero

If you’re just testing your model or anticipate light and inconsistent traffic, scale to zero can save you substantial amounts of money.

Scale to zero means that when a deployed model is not receiving traffic, it scales down to zero replicas. When the model is called, Baseten spins up a new instance to serve model requests.

To turn on scale to zero, just set a deployment’s minimum replica count to zero. Scale to zero is enabled by default in the standard autoscaling config.

Cold starts

A “cold start” is the time it takes to spin up a new instance of a model server. Cold starts apply in two situations:

  • When a model is scaled to zero and receives a request
  • When the number of concurrent requests trigger the autoscaler to increase the number of active replicas

Cold starts are especially noticable for scaled-to-zero models as the time to process the first request includes the cold start time. Baseten has heavily invested in reducing cold start times for all models.

Network accelerator

Baseten uses a network accelerator to speed up model loads from common model artifact stores, including HuggingFace, CloudFront, S3, and OpenAI. Our accelerator employs byte range downloads in the background to maximize the parallelism of downloads. This improves cold start times by reducing the amount of time it takes to load model weights and other required data.

Cold start pods

To shorten cold start times, we spin up specifically designated pods to accelerate model loading that are not counted toward your ordinary model resources. You may see these pods in your logs and metrics.

Coldboost logs have [Coldboost] as a prefix to signify that a cold start pod is in use:

Example coldboost log line
Oct 09 9:20:25pm [Coldboost] Completed model.load() execution in 12650 ms

Further optimizations

Read our how-to guide for optimizing cold starts to learn how you can edit your Truss and application to reduce the impact of cold starts.

Autoscaling for development deployments

Autoscaling settings for development deployments are optimized for live reload workflows and a simplified testing setup. The standard configuration is:

  • Minimum replicas: 0.
  • Maximum replicas: 1.
  • Autoscaling window: 60 seconds.
  • Scale down delay: 900 seconds (15 minutes).
  • Concurrency target: 1 request.

Development deployments cannot scale beyond 1 replica. To unlock full autoscaling for your deployment, promote it to production.