- Control traffic and restarts by configuring failure thresholds to suit your needs.
- Define replica health with custom logic (e.g. fail after a certain number of 500s or a specific CUDA error).
- Traffic is immediately stopped from reaching the failing replica.
- The failing replica is restarted.
Understanding readiness vs. liveness
Baseten uses two types of Kubernetes health probes that run continuously after your container starts: Readiness probe answers “Can I handle requests right now?” When it fails, Kubernetes stops sending traffic to the container but doesn’t restart it. Use this to prevent traffic during startup or temporary unavailability. The failure threshold is controlled bystop_traffic_threshold_seconds.
Liveness probe answers “Am I healthy enough to keep running?” When it fails,
Kubernetes restarts the container. Use this to recover from deadlocks or hung
processes. The failure threshold is controlled by restart_threshold_seconds.
For most servers, using the same endpoint (like /health) for both probes is
sufficient. The key difference is the action taken: readiness controls traffic
routing, while liveness controls container lifecycle.
Both probes wait before starting checks to allow your server time to initialize.
Configure this delay with restart_check_delay_seconds.
Custom health checks can be implemented in two ways:
- Configuring thresholds for when health check failures should stop traffic to or restart a replica.
- Writing custom health check logic to define how replica health is determined.
Configuring health checks
Parameters
You can customize the behavior of health checks on your deployments by setting the following parameters:The duration that health checks must continuously fail before traffic to the failing replica is stopped.
stop_traffic_threshold_seconds must be between 30 and 1800 seconds, inclusive.How long to wait before running health checks.
restart_check_delay_seconds must be between 0 and 1800 seconds, inclusive.The duration that health checks must continuously fail before triggering a restart of the failing replica.
restart_threshold_seconds must be between 30 and 1800 seconds, inclusive. The combined value of
restart_check_delay_seconds and restart_threshold_seconds must not exceed 1800 seconds. Model and custom server deployments
Configure health checks in yourconfig.yaml.
config.yaml
Chains
Useremote_config to configure health checks for your chainlet classes.
chain.py
Writing custom health checks
You can write custom health checks in both model deployments and chain deployments.Custom health checks are currently not supported in development deployments.
Custom health checks in models
model.py
Custom health checks in chains
Health checks can be customized for each chainlet in your chain.chain.py
Health checks in action
Identifying 5xx errors
You might create a custom health check to identify 5xx errors like the following:model.py
Example health check failure log line
Example restart log line
FAQs
Is there a rule of thumb for configuring thresholds for stopping traffic and restarting?
It depends on your health check implementation. If your health check relies on conditions that only change during inference (e.g.,_is_healthy is set in predict), restarting before stopping traffic is generally better, as it allows recovery without disrupting traffic.
Stopping traffic first may be preferable if a failing replica is actively degrading performance or causing inference errors, as it prevents the failing replica from affecting the overall deployment while allowing time for debugging or recovery.
When should I configure restart_check_delay_seconds?
Configure restart_check_delay_seconds to allow replicas sufficient time to initialize after deployment or a restart. This delay helps reduce unnecessary restarts, particularly for services with longer startup times.
Why am I seeing two health check failure logs in my logs?
These refer to two separate health checks we run every 10 seconds:- One to determine when to stop traffic to a replica.
- The other to determine when to restart a replica.
Does stopped traffic or replica restarts affect autoscaling?
Yes, both can impact autoscaling. If traffic stops or replicas restart, the remaining replicas handle more load. If the load exceeds the concurrency target during the autoscaling window, additional replicas are spun up. Similarly, when traffic stabilizes, excess replicas are scaled down after the scale down delay. See here for more details on autoscaling.How does billing get affected?
You are billed for the uptime of your deployment. This includes the time a replica is running, even if it is failing health checks, until it scales down.Will failing health checks cause my deployment to stay up forever?
No. If your deployment is configured with a scale down delay and the minimum number of replicas is set to 0, the replicas will scale down once the model is no longer receiving traffic for the duration of the scale down delay. This applies even if the replicas are failing health checks. See here for more details on autoscaling.What happens when my deployment is loading?
When your deployment is loading, your custom health check will not be running. Onceload() is completed, we’ll start using your custom is_healthy() health
check.