The Metrics tab in the model dashboard tracks model load and performance. Use the dropdowns at the top of the tab to scope by environment, deployment, or time range. Environment scope aggregates metrics across every deployment in that environment, which helps you watch a rollout or compare trends across the whole environment. Deployment scope restricts metrics to a single deployment ID for diagnosing one version in isolation.Documentation Index
Fetch the complete documentation index at: https://docs.baseten.co/llms.txt
Use this file to discover all available pages before exploring further.
Inference volume
Tracks the request rate over time, segmented by HTTP status codes:2xx: 🟢 Successful requests4xx: 🟡 Client errors5xx: 🔴 Server errors (includes model prediction exceptions)
For non-HTTP models and Chains (WebSockets and gRPC), the status codes reflect the status codes for those protocols. For a full list of the WebSocket close codes surfaced here, see WebSocket status codes.
Response time
Measured at different percentiles (p50, p90, p95, p99):- End-to-end response time: Includes cold starts, queuing, and inference (excludes client-side latency). Reflects real-world performance.
- Inference time: Covers only model execution, including pre/post-processing. Useful for optimizing single-replica performance.
- Time to first byte: Measures the time-to-first-byte time distribution, including any queueing and routing time. A proxy for TTFT.
Request and response size
Measured at different percentiles (p50, p90, p95, p99):- Request size: Tracks the request size distribution. A proxy for input tokens.
- Response size: Tracks the response size distribution. A proxy for generated tokens.
Replicas
Tracks the number of active and starting replicas:- Starting: Waiting for resources or loading the model.
- Active: Ready to serve requests.
- For development deployments, a replica is considered active while running the live reload server.
Traffic ready replicas
Counts pods by their Kubernetes Ready condition over time. A pod isready once it passes the readiness probe and Baseten starts routing traffic to it.
- ready: 🟢 Pods that pass the readiness probe and are receiving traffic.
- not ready: 🔴 Pods that are starting up, failing their probe, or shutting down.
- unknown: 🟡 Pods whose Ready condition can’t be determined.
ready pods); this graph adds visibility into the not ready and unknown bands. A sustained not ready band during steady state usually means readiness probes are failing or replicas are slow to terminate.
This graph rolls out behind a feature flag. Contact your account team if it’s not yet visible for your deployments.
Restarts
Tracks the cumulative number of times the model container has been restarted. Restarts are typically caused by application crashes, out-of-memory kills, or failed liveness probes. Frequent restarts usually indicate one of:- A crash in
load()or in your model code. - An out-of-memory event: check the Memory usage graph.
- A liveness probe failing under load: review
restart_threshold_secondsand any custom health check logic.
This graph rolls out behind a feature flag. Contact your account team if it’s not yet visible for your deployments.
Concurrent requests
Tracks the total number of in-progress inference requests across replicas, including both requests currently being serviced and requests waiting in the queue. This is the primary signal that drives autoscaling decisions. For the full metric definition and labels, seebaseten_concurrent_requests.
CPU usage and memory
Displays resource utilization across replicas. Metrics are averaged and may not capture short spikes.Considerations:
- High CPU/memory usage: May degrade performance. Consider upgrading to a larger instance type.
- Low CPU/memory usage: Possible overprovisioning. Switch to a smaller instance to reduce costs.
GPU usage and memory
Shows GPU utilization across replicas.- GPU usage: Percentage of time a kernel function occupies the GPU.
- GPU memory: Total memory used.
Considerations:
- High GPU load: Can slow inference. Check response time metrics.
- High memory usage: May cause out-of-memory failures.
- Low utilization: May indicate overprovisioning. Consider a smaller GPU.
Async queue metrics
- Time in Async Queue: Time spent in the async queue before execution (p50, p90, p95, p99).
- Async Queue Size: Number of queued async requests.
Considerations:
- Large queue size indicates requests are queued faster than they are processed.
- To improve async throughput, increase the max replicas or adjust autoscaling concurrency.
Using metrics for autoscaling
Use these metrics to diagnose autoscaling behavior and tune your settings.Key metrics to watch
| Metric | What it tells you |
|---|---|
| Concurrent requests | Shows total demand (queued + active). This is the signal driving autoscaling. |
| Replicas (active vs starting) | Shows scaling activity. Large gaps indicate cold start delays. |
| Inference volume | Shows traffic patterns. Use to identify if you have noisy, bursty, or steady traffic. |
| Response time (p95, p99) | Shows if scaling is keeping up. Spikes aligned with replica changes indicate thrash. |
| Async queue size | Shows backpressure. Growing queue means you need more capacity. |
Diagnosing autoscaling issues
| You see… | Likely cause | Fix |
|---|---|---|
| Latency spikes aligned with replica count changes | Oscillation (thrash) | Increase scale-down delay |
| Replicas at max, latency still degrading | Insufficient capacity | Increase max replicas or concurrency target |
| Large gap between active and starting replicas | Cold start delays | Increase min replicas, check image optimization |
| Traffic high but replicas staying low | Concurrency target too high | Lower concurrency target or target utilization |
| Replicas scaling down too quickly | Scale-down delay too short | Increase scale-down delay |