https://app.baseten.co/metrics. For the endpoint URL, authentication, scrape interval, and supported integrations (Prometheus, Datadog, Grafana, New Relic), see the export overview.
How to read this page
Each metric is listed by its Prometheus name, with:- Type:
counter(a cumulative total that only increases),gauge(a point-in-time value), orhistogram(a distribution you can compute percentiles from). - Labels: the dimensions you can filter and group by. Common labels are
model_id,model_name, anddeployment_id;environmentandrollout_phaseappear only for deployments tied to an environment. Some metrics, such as the engine metrics, use a smaller label set.
Availability
Some metrics are emitted only for certain deployments:- vLLM and SGLang metrics appear when Baseten detects that engine on your deployment. See vLLM and SGLang metrics.
- Pod health metrics (
baseten_container_restarts_totalandbaseten_pod_readiness) roll out behind a feature flag. Contact your account team if they aren’t yet visible for your organization.
baseten_inference_requests_total
Cumulative number of requests to the model.
Type: counter
Labels:
The ID of the model.
The name of the model.
The ID of the deployment.
The status code of the response.
Whether the request was an async inference request.
The environment that the deployment corresponds to. Empty if the deployment is not associated with an environment.
The phase of the deployment in the promote to production process. Empty if the deployment is not associated with an environment.Possible values:
"promoting""stable"
baseten_end_to_end_response_time_seconds
End-to-end response time in seconds.
Type: histogram
Labels:
The ID of the model.
The name of the model.
The ID of the deployment.
The status code of the response.
Whether the request was an async inference request.
The environment that the deployment corresponds to. Empty if the deployment is not associated with an environment.
The phase of the deployment in the promote to production process. Empty if the deployment is not associated with an environment.Possible values:
"promoting""stable"
baseten_container_cpu_usage_seconds_total
Cumulative CPU time consumed by the container in core-seconds.
Type: counter
Labels:
The ID of the model.
The name of the model.
The ID of the deployment.
The ID of the replica.
The environment that the deployment corresponds to. Empty if the deployment is
not associated with an environment.
The phase of the deployment in the promote to production process. Empty if the deployment is not associated with an environment.Possible values:
"promoting""stable"
baseten_replicas_active
Number of replicas ready to serve model requests.
Type: gauge
Labels:
The ID of the model.
The name of the model.
The ID of the deployment.
The environment that the deployment corresponds to. Empty if the deployment is
not associated with an environment.
The phase of the deployment in the promote to production process. Empty if the deployment is not associated with an environment.Possible values:
"promoting""stable"
baseten_replicas_starting
Number of replicas starting up—that is, either waiting for resources to be available or loading the model.
Type: gauge
Labels:
The ID of the model.
The name of the model.
The ID of the deployment.
The environment that the deployment corresponds to. Empty if the deployment is
not associated with an environment.
The phase of the deployment in the promote to production process. Empty if the deployment is not associated with an environment.Possible values:
"promoting""stable"
baseten_container_restarts_total
Cumulative number of times the model container has been restarted. Restarts are typically caused by application crashes, out-of-memory kills, or failed liveness probes. See custom health checks for how liveness affects restart behavior.
Type: counter
This metric rolls out behind a feature flag. Contact your account team if it’s not yet visible for your organization.
The ID of the model.
The name of the model.
The ID of the deployment.
The environment that the deployment corresponds to. Empty if the deployment is not associated with an environment.
The phase of the deployment in the promote to production process. Empty if the deployment is not associated with an environment.Possible values:
"promoting""stable"
baseten_pod_readiness
Number of pods grouped by their Kubernetes Ready condition. A pod with condition="true" is serving traffic; condition="false" means the pod is starting up, failing its readiness probe, or shutting down.
Type: gauge
This metric rolls out behind a feature flag. Contact your account team if it’s not yet visible for your organization.
The ID of the model.
The name of the model.
The ID of the deployment.
The Kubernetes Ready condition for the pods in this sample.Possible values:
"true": Pods are ready and serving traffic."false": Pods are starting up, failing readiness probes, or shutting down."unknown": The Ready condition can’t be determined (for example, the kubelet hasn’t reported recently).
The environment that the deployment corresponds to. Empty if the deployment is not associated with an environment.
The phase of the deployment in the promote to production process. Empty if the deployment is not associated with an environment.Possible values:
"promoting""stable"
baseten_container_cpu_memory_working_set_bytes
Working set memory usage of the container in bytes.
Type: gauge
Labels:
The ID of the model.
The name of the model.
The ID of the deployment.
The ID of the replica.
The environment that the deployment corresponds to. Empty if the deployment is not associated with an environment.
The phase of the deployment in the promote to production process. Empty if the deployment is not associated with an environment.Possible values:
"promoting""stable"
baseten_request_size_bytes
Request size in bytes. Proxy for input tokens.
Type: histogram
Labels:
The ID of the model.
The name of the model.
The ID of the deployment.
The status code of the response.
Whether the request was an async inference request.
The environment that the deployment corresponds to. Empty if the deployment is not associated with an environment.
The phase of the deployment in the promote to production process. Empty if the deployment is not associated with an environment.Possible values:
"promoting""stable"
baseten_response_size_bytes
Response size in bytes. Proxy for generated tokens.
Type: histogram
Labels:
The ID of the model.
The name of the model.
The ID of the deployment.
The status code of the response.
Whether the request was an async inference request.
The environment that the deployment corresponds to. Empty if the deployment is not associated with an environment.
The phase of the deployment in the promote to production process. Empty if the deployment is not associated with an environment.Possible values:
"promoting""stable"
baseten_time_to_first_byte_seconds
Time to first byte/write in seconds. Proxy for time-to-first-token (TTFT).
Type: histogram
Labels:
The ID of the model.
The name of the model.
The ID of the deployment.
The status code of the response.
Whether the request was an async inference request.
The environment that the deployment corresponds to. Empty if the deployment is not associated with an environment.
The phase of the deployment in the promote to production process. Empty if the deployment is not associated with an environment.Possible values:
"promoting""stable"
baseten_time_in_async_queue_seconds
Time async requests spend queued before processing.
Type: histogram
Labels:
The ID of the model.
The name of the model.
The ID of the deployment.
The environment that the deployment corresponds to. Empty if the deployment is not associated with an environment.
The phase of the deployment in the promote to production process. Empty if the deployment is not associated with an environment.Possible values:
"promoting""stable"
baseten_async_queue_size
Number of queued async requests over time.
Type: gauge
Labels:
The ID of the model.
The name of the model.
The ID of the deployment.
The environment that the deployment corresponds to. Empty if the deployment is not associated with an environment.
The phase of the deployment in the promote to production process. Empty if the deployment is not associated with an environment.Possible values:
"promoting""stable"
baseten_async_webhook_requests_total
Cumulative number of async inference webhook delivery requests sent.
Type: counter
Labels:
The ID of the model.
The name of the model.
The ID of the deployment.
The environment that the deployment corresponds to. Empty if the deployment is not associated with an environment.
The phase of the deployment in the promote to production process. Empty if the deployment is not associated with an environment.Possible values:
"promoting""stable"
baseten_async_webhook_latency_seconds
Latency of async inference webhook delivery requests in seconds.
Type: histogram
Labels:
The ID of the model.
The name of the model.
The ID of the deployment.
The environment that the deployment corresponds to. Empty if the deployment is not associated with an environment.
The phase of the deployment in the promote to production process. Empty if the deployment is not associated with an environment.Possible values:
"promoting""stable"
baseten_gpu_memory_used
GPU memory used in MiB.
Type: gauge
Labels:
The ID of the model.
The name of the model.
The ID of the deployment.
The ID of the replica.
The ID of the GPU.
The environment that the deployment corresponds to. Empty if the deployment is not associated with an environment.
The phase of the deployment in the promote to production process. Empty if the deployment is not associated with an environment.Possible values:
"promoting""stable"
baseten_gpu_utilization
GPU utilization as a percentage (between 0 and 100).
Type: gauge
Labels:
The ID of the model.
The name of the model.
The ID of the deployment.
The ID of the replica.
The ID of the GPU.
The environment that the deployment corresponds to. Empty if the deployment is not associated with an environment.
The phase of the deployment in the promote to production process. Empty if the deployment is not associated with an environment.Possible values:
"promoting""stable"
baseten_ongoing_websocket_connections
Number of ongoing websocket connections.
Type: gauge
Labels:
The ID of the model.
The name of the model.
The ID of the deployment.
The environment that the deployment corresponds to. Empty if the deployment is not associated with an environment.
The phase of the deployment in the promote to production process. Empty if the deployment is not associated with an environment.Possible values:
"promoting""stable"
baseten_concurrent_requests
Total in-flight inference requests for a deployment, including both requests currently being serviced by replicas and requests waiting to be processed. Async inference requests are not included in this metric. This is the primary signal that drives autoscaling decisions.
Type: gauge
Labels:
The ID of the model.
The name of the model.
The ID of the deployment.
The environment that the deployment corresponds to. Empty if the deployment is not associated with an environment.
The phase of the deployment in the promote to production process. Empty if the deployment is not associated with an environment.Possible values:
"promoting""stable"
vLLM and SGLang metrics
When Baseten detects vLLM or SGLang on your deployment, it scrapes your container’s/metrics endpoint and exports the engine’s native metrics alongside Baseten’s own. These also appear as graphs in the Metrics tab.
The engines define these metrics, not Baseten, and they change between versions. For the complete, current list, always refer to the official vLLM and SGLang metrics documentation.
Baseten normalizes these metrics across engine versions and exports the most useful ones. Some exported metrics include tokens per second, time to first token, KV cache usage, and the number of requests running or queued.
Baseten attaches the same two labels to every exported engine metric:
The ID of the deployment.
The ID of the replica.