Deploy Custom servers from Docker Images

If you have an existing API server packaged in a Docker image—whether an open-source server like vLLM or a custom-built image—you can deploy it on Baseten with just a config.yaml file.

1. Configuring a Custom Server in `config.yaml`

Define a Docker-based server by adding docker_server:

config.yaml

base_image:
  image: vllm/vllm-openai:latest
docker_server:
  start_command: vllm serve meta-llama/Meta-Llama-3.1-8B-Instruct --port 8000 --max-model-len 1024
  readiness_endpoint: /health
  liveness_endpoint: /health
  predict_endpoint: /v1/chat/completions
  server_port: 8000

Key Configurations

start_command (required) – Command to start the server.
predict_endpoint (required) – Endpoint for serving requests (only one per model).
server_port (required) – Port where the server runs.
readiness_endpoint (required) – Used for Kubernetes readiness probes to determine when the container is ready to accept traffic.
liveness_endpoint (required) – Used for Kubernetes liveness probes to determine if the container needs to be restarted.

2. Example: Running a vLLM Server

This example deploys Meta-Llama-3.1-8B-Instruct using vLLM on an A10G GPU, with /health as the readiness and liveness probe.

config.yaml

base_image:
  image: vllm/vllm-openai:latest
docker_server:
  start_command: sh -c "HF_TOKEN=$(cat /secrets/hf_access_token) vllm serve meta-llama/Meta-Llama-3.1-8B-Instruct --port 8000 --max-model-len 1024"
  readiness_endpoint: /health
  liveness_endpoint: /health
  predict_endpoint: /v1/chat/completions
  server_port: 8000
resources:
  accelerator: A10G
model_name: vllm-model-server
secrets:
  hf_access_token: null
runtime:
  predict_concurrency: 128

vLLM’s /health endpoint is used to determine when the server is ready or needs restarting.

More examples available in Truss examples repo.

3. Installing custom Python packages

To install additional Python dependencies, add a requirements.txt file to your Truss.

Example: Infinity embedding model server

config.yaml

base_image:
  image: python:3.11-slim
docker_server:
  start_command: sh -c "infinity_emb v2 --model-id BAAI/bge-small-en-v1.5"
  readiness_endpoint: /health
  liveness_endpoint: /health
  predict_endpoint: /embeddings
  server_port: 7997
resources:
  accelerator: L4
  use_gpu: true
model_name: infinity-embedding-server
requirements:
  - infinity-emb[all]
environment_variables:
  hf_access_token: null

4. Accessing secrets in custom servers

To use API keys or other secrets, store them in Baseten and access them from /secrets in the container.

Example: Accessing a Hugging Face token

config.yaml

secrets:
  hf_access_token: null

Inside your server, access it like this:

HF_TOKEN=$(cat /secrets/hf_access_token)

Get started

Concepts

Development

Deployment

Inference

Training

Observability

Troubleshooting

Deploy Custom servers from Docker Images

1. Configuring a Custom Server in `config.yaml`

Key Configurations

2. Example: Running a vLLM Server

3. Installing custom Python packages

Example: Infinity embedding model server

4. Accessing secrets in custom servers

Example: Accessing a Hugging Face token

Get started

Concepts

Development

Deployment

Inference

Training

Observability

Troubleshooting

​1. Configuring a Custom Server in config.yaml

​Key Configurations

​2. Example: Running a vLLM Server

​3. Installing custom Python packages

​Example: Infinity embedding model server

​4. Accessing secrets in custom servers

​Example: Accessing a Hugging Face token

1. Configuring a Custom Server in `config.yaml`

Key Configurations

2. Example: Running a vLLM Server

3. Installing custom Python packages

Example: Infinity embedding model server

4. Accessing secrets in custom servers

Example: Accessing a Hugging Face token