If you have an existing API server packaged in a Docker image—whether an open-source server like vLLM or a custom-built image—you can deploy it on Baseten with just a config.yaml
file.
1. Configuring a Custom Server in config.yaml
Define a Docker-based server by adding docker_server
:
base_image:
image: vllm/vllm-openai:latest
docker_server:
start_command: vllm serve meta-llama/Meta-Llama-3.1-8B-Instruct --port 8000 --max-model-len 1024
readiness_endpoint: /health
liveness_endpoint: /health
predict_endpoint: /v1/chat/completions
server_port: 8000
Key Configurations
start_command
(required) – Command to start the server.
predict_endpoint
(required) – Endpoint for serving requests (only one per model).
server_port
(required) – Port where the server runs.
readiness_endpoint
(required) – Used for Kubernetes readiness probes to determine when the container is ready to accept traffic.
liveness_endpoint
(required) – Used for Kubernetes liveness probes to determine if the container needs to be restarted.
2. Example: Running a vLLM Server
This example deploys Meta-Llama-3.1-8B-Instruct using vLLM on an A10G GPU, with /health
as the readiness and liveness probe.
base_image:
image: vllm/vllm-openai:latest
docker_server:
start_command: sh -c "HF_TOKEN=$(cat /secrets/hf_access_token) vllm serve meta-llama/Meta-Llama-3.1-8B-Instruct --port 8000 --max-model-len 1024"
readiness_endpoint: /health
liveness_endpoint: /health
predict_endpoint: /v1/chat/completions
server_port: 8000
resources:
accelerator: A10G
model_name: vllm-model-server
secrets:
hf_access_token: null
runtime:
predict_concurrency: 128
vLLM’s /health endpoint is used to determine when the server is ready or needs
restarting.
More examples available in Truss examples repo.
3. Installing custom Python packages
To install additional Python dependencies, add a requirements.txt
file to your Truss.
Example: Infinity embedding model server
base_image:
image: python:3.11-slim
docker_server:
start_command: sh -c "infinity_emb v2 --model-id BAAI/bge-small-en-v1.5"
readiness_endpoint: /health
liveness_endpoint: /health
predict_endpoint: /embeddings
server_port: 7997
resources:
accelerator: L4
use_gpu: true
model_name: infinity-embedding-server
requirements:
- infinity-emb[all]
environment_variables:
hf_access_token: null
4. Accessing secrets in custom servers
To use API keys or other secrets, store them in Baseten and access them from /secrets
in the container.
Example: Accessing a Hugging Face token
secrets:
hf_access_token: null
Inside your server, access it like this:
HF_TOKEN=$(cat /secrets/hf_access_token)
More on secrets management
here.
Responses are generated using AI and may contain mistakes.