Issue: truss push can’t find config.yaml
[Errno 2] No such file or directory: '/Users/philipkiely/Code/demo_docs/config.yaml'
Fix: set correct target directory
The directory truss push is looking at is not a Truss. Make sure you’re giving truss push access to the correct directory by:
- Running
truss push from the directory containing the Truss. You should see the file config.yaml when you run ls in your working directory.
- Or passing the target directory as an argument, such as
truss push /path/to/my-truss.
Issue: unexpected failure during model build
During the model build step, there can be unexpected failures from temporary circumstances. An example is a network error while downloading model weights from Hugging Face or installing a Python package from PyPi.
Fix: restart deploy from Baseten UI
First, check your model logs to determine the exact cause of the error. If it’s an error during model download, package installation, or similar, you can try restarting the deploy from the model dashboard in your workspace.
Autoscaling issues
Before troubleshooting, review Autoscaling for parameter details and Traffic patterns for pattern-specific recommendations.
Latency spikes during scaling events
Symptoms: TTFT (time to first token) or p95/p99 latency degrades when replicas are added or removed.
Causes:
- Replicas terminated while handling in-flight requests
- Cold start delays while new replicas initialize
Solutions (in order of priority):
- Increase scale-down delay (e.g., 300s → 900s) to reduce how often replicas are removed.
- Increase min replicas to reduce cold start frequency.
- Lower target utilization to provide more headroom during scaling.
Replicas oscillating (thrash)
Symptoms: Replica count bounces repeatedly (e.g., 8↔9) even with relatively stable traffic.
Causes: Autoscaler reacting to short-term traffic noise or internal model fluctuations.
Solutions (in order of priority):
- Increase scale-down delay — this is the primary lever for oscillation.
- Increase autoscaling window to smooth out noise.
- Only then consider lowering target utilization for more headroom.
Don’t use target utilization as the primary fix for thrash. Scale-down delay is more effective and doesn’t waste capacity.
Slow scale-up / “Scaling up replicas” persists
Symptoms: New replicas take many minutes (or longer) to become ready. The deployment shows “Scaling up replicas” for an extended period.
Causes:
- GPU capacity not available in your region
- Slow model initialization (large weights, slow downloads)
Solutions:
- Pre-warm by bumping min replicas via API before expected load spikes.
- Contact support about capacity pool availability.
- Check if optimized images are being used (look for “streaming-enabled image” in logs).
Model scales to zero before testing
Symptoms: A newly deployed model scales down to zero before you can send your first test request.
Solution: Set min_replica = 1 during testing. After testing, you can set it back to 0 if you want scale-to-zero behavior.
Async queue growing without bound
Symptoms: The async queue size keeps increasing and requests are not being processed fast enough.
Cause: Requests are arriving faster than the deployment can process them.
Solutions:
- Increase max replicas to add more processing capacity.
- Increase concurrency target if your model can handle more concurrent requests.
- Lower target utilization to trigger scaling earlier.
Bill higher than expected
Symptoms: GPU costs are higher than anticipated, especially during low-traffic periods.
Solutions:
- Raise concurrency target to squeeze more throughput from each replica.
- Monitor p95 latency as you raise concurrency — if latency stays stable, keep raising; if it rises sharply, you’ve gone too far.
- Enable scale-to-zero (min replicas = 0) for intermittent workloads.
- Review your traffic patterns and adjust settings accordingly — see Traffic patterns.
Cold starts taking too long
Symptoms: First request after scale-from-zero takes several minutes. Logs show extended time in model loading or container initialization.
Causes:
- Large model weights (10s–100s of GB)
- Slow network downloads from model registries
- Heavy initialization code in
load() method
Solutions:
- Look for “streaming-enabled image” in logs — this confirms image streaming is active.
- Keep
min_replica ≥ 1 to avoid cold starts entirely.
- Pre-warm before expected traffic spikes using the autoscaling API.
See Cold starts for detailed optimization strategies.
Development deployment won’t scale
Symptoms: Development deployment stays at 1 replica regardless of traffic. Can’t change autoscaling settings.
Cause: Development deployments have fixed autoscaling settings that cannot be modified. Max replicas is locked at 1.
Solution: Promote to a production deployment to enable full autoscaling. Development deployments are optimized for iteration with live reload, not traffic handling.
See Development deployments for the fixed settings.
Not sure which traffic pattern I have
Symptoms: Unsure how to configure autoscaling because traffic behavior is unclear.
Solution:
- Go to your model’s Metrics tab in the Baseten dashboard.
- Look at Inference volume and Replicas over the past week.
- Identify your pattern:
| You see… | Pattern | Key settings to adjust |
|---|
| Frequent small spikes returning to baseline | Noisy/jittery | Longer autoscaling window |
| Sharp jumps that stay high | Bursty | Short window, long delay, lower utilization |
| Long flat periods with occasional bursts | Batch/scheduled | Scale-to-zero, pre-warming |
| Gradual rises and falls | Smooth/steady | Higher utilization is safe |
See Traffic patterns for detailed recommendations.
Symptoms: Either unexpectedly high costs OR high latency despite having replicas available.
Diagnosis:
- Too low (common): Running many more replicas than needed. Default of 1 is conservative but expensive.
- Too high: Requests queue at replicas, causing latency even when replica count looks healthy.
Solutions:
- Benchmark your model to find actual throughput capacity.
- Use starting points by model type:
| Model type | Starting concurrency |
|---|
| Standard Truss | 1 |
| vLLM / LLM inference | 32–128 |
| Text embeddings (TEI) | 32 |
| Image generation (SDXL) | 1 |
- Gradually increase while monitoring p95 latency — stop when latency rises sharply.
See Concurrency target for full guidance.