Skip to main content
A cold start is the time required to initialize a new replica when scaling up. Cold starts affect the latency of requests that trigger new replica creation.

When cold starts happen

Cold starts occur in two scenarios:
  1. Scale-from-zero: When a deployment with zero active replicas receives its first request.
  2. Scaling events: When traffic increases and the autoscaler adds new replicas.

What contributes to cold start time

Cold start duration depends on several factors:
FactorImpact
Model loadingLoading model weights (10s–100s of GBs) — typically the dominant factor
Container pullDownloading Docker image layers
InitializationRunning your model’s setup code
For large models, cold starts can take minutes. Model weight downloads are usually the bottleneck — even with optimizations, the physics of moving hundreds of gigabytes of data creates inherent lag.

Minimizing cold starts

Keep replicas warm

Set min_replica to always have at least one replica ready to serve requests. This eliminates cold starts for the first request but increases cost.
{
  "min_replica": 1
}
For production redundancy, set min_replica ≥ 2 so one replica can fail during maintenance without causing cold starts.

Pre-warm before expected traffic

For predictable traffic spikes, increase min replicas before the expected load:
# 10-15 minutes before expected spike
curl -X PATCH \
  https://api.baseten.co/v1/models/{model_id}/deployments/{deployment_id}/autoscaling_settings \
  -H "Authorization: Api-Key $BASETEN_API_KEY" \
  -d '{"min_replica": 5}'
After traffic stabilizes, reset to your normal minimum.

Use longer scale-down delay

A longer scale-down delay keeps replicas warm during temporary traffic dips:
{
  "scale_down_delay": 900
}
This prevents cold starts when traffic returns within the delay window.

Platform optimizations

Baseten automatically applies several optimizations to reduce cold start times: Baseten Delivery Network (Recommended): The weights configuration optimizes cold starts by mirroring weights to Baseten’s infrastructure and caching them close to your model pods. See Baseten Delivery Network (BDN) for full configuration options. Network accelerator (Legacy): Parallelized byte-range downloads speed up model loading from Hugging Face, S3, GCS, and R2.
Network Acceleration is deprecated in favor of the new weights configuration, which provides superior cold start performance through multi-tier caching. See Baseten Delivery Network (BDN) for the recommended approach.
Image streaming: Optimized images stream into nodes, allowing model loading to begin before the full download completes:
Successfully pulled streaming-enabled image in 15.851s. Image size: 32 GB.
These optimizations are applied automatically.

The tradeoff

Cold starts create a fundamental tradeoff between cost and latency:
ApproachCostLatency
Scale to zero (min_replica: 0)Lower: no cost when idleHigher: first request waits for cold start
Always on (min_replica: ≥1)Higher: pay for idle replicasLower: no cold starts
For latency-sensitive production workloads, the cost of keeping replicas warm is usually justified. For batch workloads or development, scale-to-zero often makes sense.

Next steps