
Inference volume
Tracks the request rate over time, segmented by HTTP status codes:2xx
: 🟢 Successful requests4xx
: 🟡 Client errors5xx
: 🔴 Server errors (includes model prediction exceptions)
Note that for non-HTTP models and Chains (WebSockets and gRPC), the status codes will
reflect the status codes for those protocols.
Response time
Measured at different percentiles (p50, p90, p95, p99):- End-to-end response time: Includes cold starts, queuing, and inference (excludes client-side latency). Reflects real-world performance.
- Inference time: Covers only model execution, including pre/post-processing. Useful for optimizing single-replica performance.
- Time to first byte: Measures the time-to-first-byte time distribution, including any queueing and routing time. A proxy for TTFT.
Request and response size
Measured at different percentiles (p50, p90, p95, p99):- Request size: Tracks the request size distribution. A proxy for input tokens.
- Response size: Tracks the response size distribution. A proxy for generated tokens.
Replicas
Tracks the number of active and starting replicas:- Starting: Waiting for resources or loading the model.
- Active: Ready to serve requests.
- For development deployments, a replica is considered active while running the live reload server.
CPU usage and memory
Displays resource utilization across replicas. Metrics are averaged and may not capture short spikes.Considerations:
- High CPU/memory usage: May degrade performance—consider upgrading to a larger instance type.
- Low CPU/memory usage: Possible overprovisioning—switch to a smaller instance to reduce costs.
GPU usage and memory
Shows GPU utilization across replicas.- GPU usage: Percentage of time a kernel function occupies the GPU.
- GPU memory: Total memory used.
Considerations:
- High GPU load: Can slow inference—check response time metrics.
- High memory usage: May cause out-of-memory failures.
- Low utilization: May indicate overprovisioning—consider a smaller GPU.
Async Queue Metrics
- Time in Async Queue: Time spent in the async queue before execution (p50, p90, p95, p99).
- Async Queue Size: Number of queued async requests.
Considerations:
- Large queue size indicates requests are queued faster than they are processed.
- To improve async throughput, increase the max replicas or adjust autoscaling concurrency.