The metrics tab on the model dashboard shows charts for each deployment of your model. These metrics help to understand the relationship between model load and model performance.

Metrics are shown per deployment. Use the dropdowns on the metrics page to switch between deployments and time ranges.

Inference volume

Inference volume shows the rate of requests to the model over time. It’s broken out into 2xx, 4xx, and 5xx representing ranges of HTTP response status code. Any exceptions thrown in your model predict code will be represented under the 5xx responses.

Some older models may not show the 2xx, 4xx, and 5xx breakdown. If you don’t see this breakdown:

  • Get the latest version of Truss with pip install --upgrade truss.
  • Re-deploy your model with truss push.
  • Promote the updated model to production after testing it.

Response time

  • End-to-end response time includes time for cold starts, queuing, and inference (but not client-side latency). This most closely mirrors the performance of your model as experienced by your users.
  • Inference time includes just the time spent running the model, including pre- and post-processing. This is useful for optimizing the performance of your model code at the single replica level.

Response time is broken out into p50, p90, p95, and p99, referring to the 50th, 90th, 95th, and 99th percentile of response times.

Replicas

The replicas chart shows the number of replicas in both active and starting up states:

  • The starting up count includes replicas that are waiting for resources to be available and replicas that are in the process of loading the model.
  • The active count includes replicas that are ready to serve requests.

For development deployments, the replica shows as active while loading the model and running the live reload server.

CPU usage and memory

These charts show the CPU and memory usage of your deployment. If you have multiple replicas, they show the average across all your replicas. Note that this data is not instantanous, so sharp spikes in usage may not appear on the graph.

What to look out for:

  • When the load on the CPU or memory get too high, the performance of your deployment may degrade. You may want to consider updating your model’s instance type to one with more memory or CPUs.
  • If CPU load and memory usage are consistently very low, you may be using an instance with too many vCPU cores and too much RAM. If you’re using a CPU-only instance, or a GPU instance where a smaller instance type with the same GPU is available, you may be able to save money by switching.

GPU usage and memory

GPU usage shows the GPU usage and memory usage of your deployment. If you have multiple replicas, they show the average across all your replicas. Note that this data is not instantanous, so sharp spikes in usage may not appear on the graph.

In technical terms, the GPU usage is a measure of the fraction of time within a cycle that a kernel function is occupying GPU resources.

What to look out for:

  • When the load on the GPU gets too high, model inference can slow down. Look for corresponding increases in inference time.
  • When GPU memory usage gets too high, requests can fail with out-of-memory errors.
  • If GPU load and memory usage are consistently very low, you may be using an overpowered GPU and could save money with a less powerful card.

Time in async queue

The time in async queue chart shows the time in seconds that an async predict request spent in the async queue before getting processed by the model. This chart is broken out into p50, p90, p95, and p99, referring to the 50th, 90th, 95th, and 99th percentile of time spent in the async queue.

Async queue size

The async queue size chart shows the number of async predict requests that are currently queued to be executed.

What to look out for:

  • If the queue size is large, async requests are being queued faster than they can be executed. In this case, requests may take longer to complete or expire after the user-specified max_time_in_queue_seconds.
  • To increase the number of async requests your model can process, increase the max number of replicas or concurrency target in your autoscaling settings.