Skip to main content
If you have an existing API server packaged in a Docker imageβ€”whether an open-source server like vLLM or a custom-built imageβ€”you can deploy it on Baseten with just a config.yaml file.
Custom servers also support WebSocket deployments. For WebSocket-specific configuration, see WebSockets documentation.

1. Configuring a Custom Server in config.yaml

Define a Docker-based server by adding docker_server:
config.yaml
base_image:
  image: vllm/vllm-openai:latest
docker_server:
  start_command: vllm serve meta-llama/Meta-Llama-3.1-8B-Instruct --port 8000 --max-model-len 1024
  readiness_endpoint: /health
  liveness_endpoint: /health
  predict_endpoint: /v1/chat/completions
  server_port: 8000

Key Configurations

FieldRequiredDescription
start_commandβœ…Command to start the server
predict_endpointβœ…Endpoint for serving requests (only one per model)
server_portβœ…Port where the server runs
readiness_endpointβœ…Used for Kubernetes readiness probes to determine when the container is ready to accept traffic
liveness_endpointβœ…Used for Kubernetes liveness probes to determine if the container needs to be restarted

Endpoint Mapping

While predict_endpoint is required, you can still access any route in your server using the sync endpoint.
Mapping Rules:
Baseten EndpointMaps ToDescription
environments/{production}/predictpredict_endpoint routeDefault endpoint for model predictions
environments/{production}/sync/{any/route}/{any/route} in your serverAccess any route in your server
Example: If you set predict_endpoint: /my/custom/route:
Baseten EndpointMaps To
environments/{production}/predict/my/custom/route
environments/{production}/sync/my/custom/route/my/custom/route
environments/{production}/sync/my/other/route/my/other/route

2. Example: Running a vLLM Server

This example deploys Meta-Llama-3.1-8B-Instruct using vLLM on an L4 GPU, with /health as the readiness and liveness probe.
config.yaml
base_image:
  image: vllm/vllm-openai:latest
docker_server:
  start_command: sh -c "HF_TOKEN=$(cat /secrets/hf_access_token) vllm serve meta-llama/Meta-Llama-3.1-8B-Instruct --port 8000 --max-model-len 1024"
  readiness_endpoint: /health
  liveness_endpoint: /health
  predict_endpoint: /v1/chat/completions
  server_port: 8000
resources:
  accelerator: L4
model_name: vllm-model-server
secrets:
  hf_access_token: null
runtime:
  predict_concurrency: 128
vLLM’s /health endpoint is used to determine when the server is ready or needs restarting.
More examples available in Truss examples repo.

3. Installing custom Python packages

To install additional Python dependencies, add a requirements.txt file to your Truss.

Example: Infinity embedding model server

config.yaml
base_image:
  image: python:3.11-slim
docker_server:
  start_command: sh -c "infinity_emb v2 --model-id BAAI/bge-small-en-v1.5"
  readiness_endpoint: /health
  liveness_endpoint: /health
  predict_endpoint: /embeddings
  server_port: 7997
resources:
  accelerator: L4
  use_gpu: true
model_name: infinity-embedding-server
requirements:
  - infinity-emb[all]
environment_variables:
  hf_access_token: null

4. Accessing secrets in custom servers

To use API keys or other secrets, store them in Baseten and access them from /secrets in the container.

Example: Accessing a Hugging Face token

config.yaml
secrets:
  hf_access_token: null
Inside your server, access it like this:
HF_TOKEN=$(cat /secrets/hf_access_token)
More on secrets management here.
⌘I