Skip to main content
The Metrics tab in the model dashboard provides deployment-specific insights into model load and performance. Use the dropdowns to filter by environment or deployment and time range.

Inference volume

Tracks the request rate over time, segmented by HTTP status codes:
  • 2xx: 🟢 Successful requests
  • 4xx: 🟡 Client errors
  • 5xx: 🔴 Server errors (includes model prediction exceptions)
Note that for non-HTTP models and Chains (WebSockets and gRPC), the status codes will reflect the status codes for those protocols.

Response time

Measured at different percentiles (p50, p90, p95, p99):
  • End-to-end response time: Includes cold starts, queuing, and inference (excludes client-side latency). Reflects real-world performance.
  • Inference time: Covers only model execution, including pre/post-processing. Useful for optimizing single-replica performance.
  • Time to first byte: Measures the time-to-first-byte time distribution, including any queueing and routing time. A proxy for TTFT.

Request and response size

Measured at different percentiles (p50, p90, p95, p99):
  • Request size: Tracks the request size distribution. A proxy for input tokens.
  • Response size: Tracks the response size distribution. A proxy for generated tokens.

Replicas

Tracks the number of active and starting replicas:
  • Starting: Waiting for resources or loading the model.
  • Active: Ready to serve requests.
  • For development deployments, a replica is considered active while running the live reload server.

Concurrent requests

Tracks the total number of in-progress inference requests across replicas, including both requests currently being serviced and requests waiting in the queue. This is the primary signal that drives autoscaling decisions. For the full metric definition and labels, see baseten_concurrent_requests.

CPU usage and memory

Displays resource utilization across replicas. Metrics are averaged and may not capture short spikes.

Considerations:

  • High CPU/memory usage: May degrade performance—consider upgrading to a larger instance type.
  • Low CPU/memory usage: Possible overprovisioning—switch to a smaller instance to reduce costs.

GPU usage and memory

Shows GPU utilization across replicas.
  • GPU usage: Percentage of time a kernel function occupies the GPU.
  • GPU memory: Total memory used.

Considerations:

  • High GPU load: Can slow inference—check response time metrics.
  • High memory usage: May cause out-of-memory failures.
  • Low utilization: May indicate overprovisioning—consider a smaller GPU.

Async Queue Metrics

  • Time in Async Queue: Time spent in the async queue before execution (p50, p90, p95, p99).
  • Async Queue Size: Number of queued async requests.

Considerations:

  • Large queue size indicates requests are queued faster than they are processed.
  • To improve async throughput, increase the max replicas or adjust autoscaling concurrency.

Using metrics for autoscaling

Use these metrics to diagnose autoscaling behavior and tune your settings.

Key metrics to watch

MetricWhat it tells you
Concurrent requestsShows total demand (queued + active). This is the signal driving autoscaling.
Replicas (active vs starting)Shows scaling activity. Large gaps indicate cold start delays.
Inference volumeShows traffic patterns. Use to identify if you have noisy, bursty, or steady traffic.
Response time (p95, p99)Shows if scaling is keeping up. Spikes aligned with replica changes indicate thrash.
Async queue sizeShows backpressure. Growing queue means you need more capacity.

Diagnosing autoscaling issues

You see…Likely causeFix
Latency spikes aligned with replica count changesOscillation (thrash)Increase scale-down delay
Replicas at max, latency still degradingInsufficient capacityIncrease max replicas or concurrency target
Large gap between active and starting replicasCold start delaysIncrease min replicas, check image optimization
Traffic high but replicas staying lowConcurrency target too highLower concurrency target or target utilization
Replicas scaling down too quicklyScale-down delay too shortIncrease scale-down delay
For solutions to common autoscaling problems, see Autoscaling troubleshooting.