Cold start triggers
Every new replica cold-starts before it can serve traffic, no matter why it was created. Scale-from-zero applies when a deployment’smin_replica is 0. Once traffic stays at zero for the full scale_down_delay, the autoscaler shuts down every replica. The next request finds nothing running and waits for a full startup, so users feel this cold start directly.
Scaling events happen while a deployment is already serving traffic. When load crosses the scaling threshold, the autoscaler adds replicas, and each one cold-starts before it can serve traffic. The replicas already running keep serving in the meantime, so users notice only when load grows faster than new replicas can start up.
Contributing factors
A new replica works through these steps in order, and their durations add up to the cold-start time:| Step | What happens |
|---|---|
| Container pull | The replica downloads your Docker image layers. |
| Weight load | Model weights (often 10s to 100s of GB) move from storage into GPU memory. |
| Engine initialization | Your model’s setup code runs. For inference engines like vLLM and SGLang, this includes capturing CUDA graphs, compiling kernels with torch.compile, and profiling the KV cache. |
torch.compile can run well over a minute, and Baseten doesn’t cache those artifacts unless you opt in. For the largest models (70B+ parameters or large mixture-of-experts), even BDN can’t make hundreds of gigabytes instant, so weight load stays the dominant step.
Cold start time isn’t a fixed number. It varies with model size and the GPU you run on, so benchmark your own model rather than relying on a single figure.
Reducing cold starts
The biggest win comes from shrinking whichever step dominates startup. When that isn’t enough, keep replicas warm so requests skip the cold start entirely.Faster weight loading
BDN runs automatically on engine-builder deployments. On any other deployment, turn it on by adding aweights block to your config.
Compilation caching
torch.compile and CUDA graph capture rerun on every fresh replica unless their output is cached. Torch compile caching, built on b10cache, persists those artifacts so a new replica loads them instead of recompiling, which cuts compilation from minutes to roughly 5 to 20 seconds.
Warm replicas
min_replica sets a floor on running replicas. Keep it at 1 or higher so a replica stays warm to serve the first request. You pay for that replica while it’s idle, but the request no longer waits for a startup. Set it in the dashboard or through the autoscaling settings API:
min_replica to 2 or higher so one replica can fail during maintenance without causing cold starts.
Your replica floor trades cost against latency:
| Approach | Cost | Latency | Best for |
|---|---|---|---|
Scale to zero (min_replica: 0) | No charge while idle; wake-up minutes are billed | First request waits for a full cold start | Batch jobs, development, and spiky low-volume traffic |
Always on (min_replica ≥ 1) | Pay for idle replicas | No cold start from idle, though new replicas still cold-start | Latency-sensitive production traffic |
Pre-warming
For predictable traffic spikes, raisemin_replica ahead of the expected load:
Scale-down delay
A longer scale-down delay keeps replicas warm through brief traffic dips. The default is 15 minutes (900 seconds); this example doubles it to 30 minutes:Next steps
- Request lifecycle: What happens to requests during cold starts, including queuing and timeout behavior.
- Autoscaling: Configure
min_replica,scale_down_delay, and the rest of the scaling settings. - Traffic patterns: Pre-warming strategies for different traffic types.
- Billing and usage: How cold-start time is metered.
- Troubleshooting: Diagnose cold start issues.