Cold starts

A cold start is the time required to initialize a new replica when scaling up. Cold starts affect the latency of requests that trigger new replica creation.

When cold starts happen

Cold starts occur in two scenarios:

Scale-from-zero: When a deployment with zero active replicas receives its first request.
Scaling events: When traffic increases and the autoscaler adds new replicas.

What contributes to cold start time

Cold start duration depends on several factors:

Factor	Impact
Model loading	Loading model weights (10s–100s of GBs) — typically the dominant factor
Container pull	Downloading Docker image layers
Initialization	Running your model’s setup code

For large models, cold starts can take minutes. Model weight downloads are usually the bottleneck — even with optimizations, the physics of moving hundreds of gigabytes of data creates inherent lag.

Minimizing cold starts

Keep replicas warm

Set min_replica to always have at least one replica ready to serve requests. This eliminates cold starts for the first request but increases cost.

{
  "min_replica": 1
}

For production redundancy, set min_replica ≥ 2 so one replica can fail during maintenance without causing cold starts.

Pre-warm before expected traffic

For predictable traffic spikes, increase min replicas before the expected load:

# 10-15 minutes before expected spike
curl -X PATCH \
  https://api.baseten.co/v1/models/{model_id}/deployments/{deployment_id}/autoscaling_settings \
  -H "Authorization: Api-Key $BASETEN_API_KEY" \
  -d '{"min_replica": 5}'

After traffic stabilizes, reset to your normal minimum.

Use longer scale-down delay

A longer scale-down delay keeps replicas warm during temporary traffic dips:

{
  "scale_down_delay": 900
}

This prevents cold starts when traffic returns within the delay window.

Platform optimizations

Baseten automatically applies several optimizations to reduce cold start times: Baseten Delivery Network (Recommended): The weights configuration optimizes cold starts by mirroring weights to Baseten’s infrastructure and caching them close to your model pods. See Baseten Delivery Network (BDN) for full configuration options. Network accelerator (Legacy): Parallelized byte-range downloads speed up model loading from Hugging Face, S3, GCS, and R2.

Network Acceleration is deprecated in favor of the new weights configuration, which provides superior cold start performance through multi-tier caching. See Baseten Delivery Network (BDN) for the recommended approach.

Image streaming: Optimized images stream into nodes, allowing model loading to begin before the full download completes:

Successfully pulled streaming-enabled image in 15.851s. Image size: 32 GB.

These optimizations are applied automatically.

The tradeoff

Cold starts create a fundamental tradeoff between cost and latency:

Approach	Cost	Latency
Scale to zero (`min_replica: 0`)	Lower: no cost when idle	Higher: first request waits for cold start
Always on (`min_replica: ≥1`)	Higher: pay for idle replicas	Lower: no cold starts

For latency-sensitive production workloads, the cost of keeping replicas warm is usually justified. For batch workloads or development, scale-to-zero often makes sense.

Next steps

Autoscaling: Configure min replicas and scale-down delay.
Traffic patterns: Pre-warming strategies for different traffic types.
Troubleshooting: Diagnose cold start issues.

Get started

Concepts

Development

Deployment

Inference

Engines

Training

Organization

Observability

Troubleshooting

When cold starts happen

What contributes to cold start time

Minimizing cold starts

Keep replicas warm

Pre-warm before expected traffic

Use longer scale-down delay

Platform optimizations

The tradeoff

Next steps

Get started

Concepts

Development

Deployment

Inference

Engines

Training

Organization

Observability

Troubleshooting

​When cold starts happen

​What contributes to cold start time

​Minimizing cold starts

​Keep replicas warm

​Pre-warm before expected traffic

​Use longer scale-down delay

​Platform optimizations

​The tradeoff

​Next steps

When cold starts happen

What contributes to cold start time

Minimizing cold starts

Keep replicas warm

Pre-warm before expected traffic

Use longer scale-down delay

Platform optimizations

The tradeoff

Next steps