If you have an existing API server packaged in a Docker image—whether an open-source server like vLLM or a custom-built image—you can deploy it on Baseten with just a config.yaml file.

1. Configuring a Custom Server in config.yaml

Define a Docker-based server by adding docker_server:

config.yaml
base_image:
  image: vllm/vllm-openai:latest
docker_server:
  start_command: vllm serve meta-llama/Meta-Llama-3.1-8B-Instruct --port 8000 --max-model-len 1024
  readiness_endpoint: /health
  liveness_endpoint: /health
  predict_endpoint: /v1/chat/completions
  server_port: 8000

Key Configurations

  • start_command (required) – Command to start the server.
  • predict_endpoint (required) – Endpoint for serving requests (only one per model).
  • server_port (required) – Port where the server runs.
  • readiness_endpoint (required) – Used for Kubernetes readiness probes to determine when the container is ready to accept traffic.
  • liveness_endpoint (required) – Used for Kubernetes liveness probes to determine if the container needs to be restarted.

2. Example: Running a vLLM Server

This example deploys Meta-Llama-3.1-8B-Instruct using vLLM on an A10G GPU, with /health as the readiness and liveness probe.

config.yaml
base_image:
  image: vllm/vllm-openai:latest
docker_server:
  start_command: sh -c "HF_TOKEN=$(cat /secrets/hf_access_token) vllm serve meta-llama/Meta-Llama-3.1-8B-Instruct --port 8000 --max-model-len 1024"
  readiness_endpoint: /health
  liveness_endpoint: /health
  predict_endpoint: /v1/chat/completions
  server_port: 8000
resources:
  accelerator: A10G
model_name: vllm-model-server
secrets:
  hf_access_token: null
runtime:
  predict_concurrency: 128

vLLM’s /health endpoint is used to determine when the server is ready or needs restarting.

More examples available in Truss examples repo.

3. Installing Custom Python Packages

To install additional Python dependencies, add a requirements.txt file to your Truss.

Example: Infinity Embedding Model Server

config.yaml
base_image:
  image: python:3.11-slim
docker_server:
  start_command: sh -c "infinity_emb v2 --model-id BAAI/bge-small-en-v1.5"
  readiness_endpoint: /health
  liveness_endpoint: /health
  predict_endpoint: /embeddings
  server_port: 7997
resources:
  accelerator: L4
  use_gpu: true
model_name: infinity-embedding-server
requirements:
  - infinity-emb[all]
environment_variables:
  hf_access_token: null

4. Accessing Secrets in Custom Servers

To use API keys or other secrets, store them in Baseten and access them from /secrets in the container.

Example: Accessing a Hugging Face Token

config.yaml
secrets:
  hf_access_token: null

Inside your server, access it like this:

HF_TOKEN=$(cat /secrets/hf_access_token)
More on secrets management here.