config.yaml file.
Custom servers also support WebSocket deployments. For WebSocket-specific configuration, see WebSockets documentation.
1. Configuring a Custom Server in config.yaml
Define a Docker-based server by adding docker_server:
config.yaml
Key Configurations
| Field | Required | Description |
|---|---|---|
start_command | ✅ | Command to start the server |
predict_endpoint | ✅ | Endpoint for serving requests (only one per model). This maps your server’s inference endpoint to Baseten’s prediction endpoint |
server_port | ✅ | Port where the server runs |
readiness_endpoint | ✅ | Used for Kubernetes readiness probes to determine when the container is ready to accept traffic. This must match an endpoint on your server |
liveness_endpoint | ✅ | Used for Kubernetes liveness probes to determine if the container needs to be restarted. This must match an endpoint on your server |
Understanding Readiness vs. Liveness
Both probes run continuously after your container starts, but serve different purposes:- Readiness probe: Answers “Can I handle requests right now?” When it fails, Kubernetes stops sending traffic to the container (but doesn’t restart it). Use this to prevent traffic during startup or temporary unavailability.
- Liveness probe: Answers “Am I healthy enough to keep running?” When it fails, Kubernetes restarts the container. Use this to recover from deadlocks or hung processes.
/health) for both is sufficient—as long as it accurately reflects whether your server can handle requests. The key difference is the action taken: readiness controls traffic routing, while liveness controls container lifecycle.
Initial delays: Both probes wait before starting checks to allow your server time to start up. See Custom health checks for configuration details.
Important: Docker Image Requirements
Container file system: The/app directory is used internally by Baseten. The model container runs as a nonroot user on some configurations, but /app and /tmp directories are still writable.
Endpoint Mapping
Mapping Rules:| Baseten Endpoint | Maps To | Description |
|---|---|---|
environments/{production}/predict | predict_endpoint route | Default endpoint for model predictions |
environments/{production}/sync/{any/route} | /{any/route} in your server | Access any route in your server |
predict_endpoint: /my/custom/route:
| Baseten Endpoint | Maps To |
|---|---|
environments/{production}/predict | /my/custom/route |
environments/{production}/sync/my/custom/route | /my/custom/route |
environments/{production}/sync/my/other/route | /my/other/route |
2. Example: Running a vLLM Server
This example deploys Meta-Llama-3.1-8B-Instruct using vLLM on an L4 GPU, with/health as the readiness and liveness probe.
config.yaml
More examples available in Truss examples repo.
3. Installing custom Python packages
To install additional Python dependencies, add arequirements.txt file to your Truss.
Example: Infinity embedding model server
config.yaml
4. Accessing secrets in custom servers
To use API keys or other secrets, first store them in Baseten. Baseten can then inject secrets into your container. They will be available at/secrets/{secret_name}.
Example: Accessing a Hugging Face token
Add secrets with placeholder values inconfig.yaml:
config.yaml
start_command or application code, read secrets from the /secrets directory:
More on secrets management here.