Skip to main content
Different traffic patterns require different autoscaling configurations. Identify your pattern below for recommended starting settings.
These are starting points, not final answers. Monitor your deployment’s performance and adjust based on observed behavior. See Autoscaling for parameter details.

Jittery traffic

Small, frequent spikes that quickly return to baseline.

Characteristics

  • Baseline replica count is steady, but spikes up by 2x several times per hour.
  • Spikes are short-lived and return to baseline quickly.
  • Often not real load growth, just temporary surges causing overreaction.

Common causes

  • Consumer products with intermittent usage bursts.
  • Traffic splitting or A/B testing with low percentages.
  • Polling clients with synchronized intervals.
ParameterValueWhy
Autoscaling window2-5 minutesSmooth out noise, avoid reacting to every spike
Scale-down delay300-600sModerate stability
Target utilization70%Default is fine
Concurrency targetBenchmarked valueStart conservative
A longer autoscaling window averages out the jitter so the autoscaler doesn’t chase every small spike. You’re trading reaction speed for stability, which is acceptable when the spikes aren’t sustained load increases.
If you’re still seeing oscillation with these settings, increase the scale-down delay before lowering target utilization.

Bursty traffic

Characteristics

  • Traffic jumps sharply (2x+ within 60 seconds).
  • Stays high for a sustained period before dropping.
  • The “pain” is queueing and latency spikes while new replicas start.

Common causes

  • Daily morning ramp-up (users starting their day).
  • Marketing events, product launches, viral moments.
  • Top-of-hour scheduled jobs or cron-triggered traffic.
ParameterValueWhy
Autoscaling window30-60sReact quickly to genuine load increases
Scale-down delay900s+Handle back-to-back waves without thrashing
Target utilization50-60%More headroom absorbs the burst while scaling
Min replicas≥2Redundancy + reduces cold start impact
Short window means fast reaction. Long delay prevents scaling down between waves. Lower utilization gives you buffer capacity while new replicas start.

Pre-warming for predictable bursts

If your bursts are predictable (morning ramp, scheduled events), pre-warm by bumping min replicas before the expected spike:
curl -X PATCH \
  https://api.baseten.co/v1/models/{model_id}/deployments/{deployment_id}/autoscaling_settings \
  -H "Authorization: Api-Key $BASETEN_API_KEY" \
  -d '{"min_replica": 5}'
After the burst subsides, reset to your normal minimum:
curl -X PATCH \
  https://api.baseten.co/v1/models/{model_id}/deployments/{deployment_id}/autoscaling_settings \
  -H "Authorization: Api-Key $BASETEN_API_KEY" \
  -d '{"min_replica": 2}'
Automate pre-warming with cron jobs or your orchestration system. Bumping min replicas 10-15 minutes before known peaks avoids cold starts for the first requests after the spike.

Scheduled traffic

Characteristics

  • Long periods of low or zero traffic.
  • Large bursts tied to job schedules (hourly, daily, weekly).
  • Traffic patterns are predictable but infrequent.

Common causes

  • ETL pipelines and data processing jobs.
  • Embedding backfills and batch inference.
  • Periodic evaluation or testing jobs.
  • Document processing triggered by user uploads.
ParameterValueWhy
Min replicas0 (if cold starts acceptable) or 1 (during job windows)Cost savings when idle
Scale-down delayModerate to highJobs often come in waves
Autoscaling window60-120sDon’t overreact to the first few requests
Target utilization70%Default is fine
Scale-to-zero saves significant cost during idle periods. The moderate window prevents overreacting to the initial requests of a batch. If jobs come in waves, a longer delay keeps replicas warm between them.

Scheduled pre-warming

For predictable batch jobs, use cron + API to pre-warm. 5 minutes before the hourly job, scale up:
0 * * * * curl -X PATCH \
  https://api.baseten.co/v1/models/{model_id}/deployments/{deployment_id}/autoscaling_settings \
  -H "Authorization: Api-Key $BASETEN_API_KEY" \
  -d '{"min_replica": 3}'
30 minutes after the job completes, scale back down:
30 * * * * curl -X PATCH \
  https://api.baseten.co/v1/models/{model_id}/deployments/{deployment_id}/autoscaling_settings \
  -H "Authorization: Api-Key $BASETEN_API_KEY" \
  -d '{"min_replica": 0}'
If you use scale-to-zero, the first request of each batch will experience a cold start. For latency-sensitive batch jobs, keep min replicas at 1 during expected job windows.

Steady traffic

Characteristics

  • Traffic rises and falls gradually over the day.
  • Classic diurnal pattern with no sharp edges.
  • Predictable, cyclical behavior.

Common causes

  • Always-on inference APIs with consistent user base.
  • B2B applications with business-hours usage.
  • Production workloads with stable, mature traffic.
ParameterValueWhy
Target utilization70-80%Can run replicas hotter safely
Autoscaling window60-120sModerate reaction speed
Scale-down delay300-600sModerate
Min replicas≥2Redundancy for production
Without sudden spikes, you don’t need as much headroom. You can run replicas at higher utilization (lower cost) because load changes are gradual and predictable. The autoscaler has time to react.
Smooth traffic is the easiest to tune. Start with defaults, monitor for a week, then optimize for cost by gradually raising target utilization while watching p95 latency.

Identifying your pattern

Not sure which pattern you have? Check your metrics:
  1. Go to your model’s Metrics tab in the Baseten dashboard
  2. Look at Inference volume and Replicas over the past week
  3. Compare to the patterns above
You see…Your pattern is…
Frequent small spikes that quickly return to baselineJittery
Sharp jumps that stay high for a whileBursty
Long flat periods with occasional large burstsScheduled
Gradual rises and falls, smooth curvesSteady
Some workloads are a mix of patterns. If your traffic has both smooth diurnal patterns AND occasional bursts, optimize for the bursts (they cause the most pain) and accept slightly higher cost during steady periods.

Next steps