Autoscaling

Configuring autoscaling

Autoscaling settings are per deployment and are inherited when promoting a model to production unless overridden. Configure autoscaling through:

UI → Manage settings in your Baseten workspace.
API → Use the autoscaling API.

Replica Scaling

Each deployment scales within a configured range of replicas:

Minimum replicas → The lowest number of active replicas.
- Default: 0 (scale to zero).
- Maximum value: Cannot exceed the maximum replica count.
Maximum replicas → The upper limit of active replicas.
- Default: 1.
- Max: 10 by default (contact support to increase).

When first deployed, the model starts with 1 replica (or the minimum count, if higher). As traffic increases, additional replicas scale up until the maximum count is reached. When traffic decreases, replicas scale down to match demand.

Autoscaler settings

The autoscaler logic is controlled by three key parameters:

Autoscaling window → Time window for traffic analysis before scaling up/down. Default: 60 seconds.
Scale down delay → Time before an unused replica is removed. Default: 900 seconds (15 minutes).
Concurrency target → Number of requests a replica should handle before scaling. Default: 1 request.

A short autoscaling window with a longer scale-down delay is recommended for fast upscaling while maintaining capacity during temporary dips.

Autoscaling behavior

Scaling Up

When the average requests per active replica exceed the concurrency target within the autoscaling window, more replicas are created until:

The concurrency target is met, or
The maximum replica count is reached.

Scaling Down

When traffic drops below the concurrency target, excess replicas are flagged for removal. The scale-down delay ensures that replicas are not removed prematurely:

If traffic spikes again before the delay ends, replicas remain active.
If the minimum replica count is reached, no further scaling down occurs.

Scale to zero

If you’re just testing your model or anticipate light and inconsistent traffic, scale to zero can save you substantial amounts of money. Scale to zero means that when a deployed model is not receiving traffic, it scales down to zero replicas. When the model is called, Baseten spins up a new instance to serve model requests. To turn on scale to zero, just set a deployment’s minimum replica count to zero. Scale to zero is enabled by default in the standard autoscaling config.

Models that have not received any traffic for more than two weeks will be automatically deactivated. These models will need to be activated manually before they can serve requests again. For production deployments this threshold is two months.

Cold starts

A cold start is the time required to initialize a new replica when scaling up. Cold starts impact:

Scaled-to-zero deployments → The first request must wait for a new replica to start.
Scaling events → When traffic spikes and a deployment requires more replicas.

Cold Start Optimizations

Network accelerator Baseten speeds up model loading from Hugging Face, CloudFront, S3, and OpenAI using parallelized byte-range downloads, reducing cold start delays. Cold start pods Baseten pre-warms specialized cold start pods to accelerate loading times. These pods appear in logs as [Coldboost].

Example coldboost log line

Oct 09 9:20:25pm [Coldboost] Completed model.load() execution in 12650 ms

Model Image streaming and optimization To further reduce initialization latency, Baseten uses image streaming to optimize container startup.

Initial non-optimized image: When a model is first deployed, a standard image is built without optimization. During this stage, the runtime monitors which parts of the image are accessed during startup and inference.
Call graph–based optimization: Baseten analyzes the model’s call graph to identify which layers, weights, and binaries are actually needed during initialization. This information drives creation of an optimized image.
Prefetch and lazy fetch: The optimized image is split into two content groups:
- Prefetched content: Frequently accessed layers and dependencies are loaded eagerly at container start.
- Lazy-fetched content: Less critical data is fetched on demand, reducing initial I/O overhead.
Streaming-enabled image pull: Images optimized through this process are streamed into the node filesystem during startup, allowing the model to begin loading before the entire image is downloaded. Pulling an optimized image appears in logs as:
Example streaming image pull log line
```
Successfully pulled streaming-enabled image in 15.851s. Image size: 32 GB.
```

Autoscaling for development deployments

Development deployments have fixed autoscaling constraints to optimize for live reload workflows:

Min replicas: 0
Max replicas: 1
Autoscaling window: 60 seconds
Scale down delay: 900 seconds (15 min)
Concurrency target: 1 request

To enable full autoscaling, promote the deployment and environment like production.

Get started

Concepts

Development

Deployment

Inference

Training

Observability

Troubleshooting

Configuring autoscaling

Replica Scaling

Autoscaler settings

Autoscaling behavior

Scaling Up

Scaling Down

Scale to zero

Cold starts

Cold Start Optimizations

Autoscaling for development deployments

Get started

Concepts

Development

Deployment

Inference

Training

Observability

Troubleshooting

​Configuring autoscaling

​Replica Scaling

​Autoscaler settings

​Autoscaling behavior

​Scaling Up

​Scaling Down

​Scale to zero

​Cold starts

​Cold Start Optimizations

​Autoscaling for development deployments

Configuring autoscaling

Replica Scaling

Autoscaler settings

Autoscaling behavior

Scaling Up

Scaling Down

Scale to zero

Cold starts

Cold Start Optimizations

Autoscaling for development deployments