If you have a ready-to-use API server packaged in a Docker image, either an open source serve image like vLLM or a customized Docker image built in house, it’s very easy to deploy on Baseten β€” all you need is a config.yaml file.

Specifying a Docker image in config.yaml

To specify a Docker image of Custom Server, in your config.yaml, add a docker_server field:

config.yaml
base_image:
  image: vllm/vllm-openai:latest
docker_server:
  start_command: vllm serve meta-llama/Meta-Llama-3.1-8B-Instruct --port 8000 --max-model-len 1024
  readiness_endpoint: /health
  liveness_endpoint: /health
  predict_endpoint: /v1/chat/completions
  server_port: 8000

where

  • start_command(required) is the command to start the server
  • predict_endpoint(required) is the endpoint to send requests to the server, please note that deployed models can only support a single predict endpoint at the moment
  • server_port(required) is the port to run the server on
  • readiness_endpoint(required) is the endpoint used as Kubernetes readiness probe to determine when a container is ready to start accepting traffic
  • liveness_endpoint(required) is the endpoint used as Kubernetes liveness probe to determine when to restart a container

Example usage: run vLLM server from Docker image

One great use case for Custom Server is to spin up a popular open source model server like vLLM OpenAI Compatible Server. Below is an example to deploy the Meta-Llama-3.1-8B-Instruct model using vLLM on 1 A10G GPU.

Also as you can see here, we are passing in /health endpoint provided by vLLM server as both readiness_endpoint and liveness_endpoint, this way we can use the internal health probe of vLLM server to decide if the server is ready to accept requests, or if it is unhealthy and needs to be restarted.

config.yaml
base_image:
  image: vllm/vllm-openai:latest
docker_server:
  start_command: sh -c "HF_TOKEN=$(cat /secrets/hf_access_token) vllm serve meta-llama/Meta-Llama-3.1-8B-Instruct --port 8000 --max-model-len 1024"
  readiness_endpoint: /health
  liveness_endpoint: /health
  predict_endpoint: /v1/chat/completions
  server_port: 8000
resources:
  accelerator: A10G
model_name: vllm-model-server
secrets:
  hf_access_token: null
runtime:
  predict_concurrency: 128

More usage examples of Custom Server can be found here.

Installing custom python packages

If you need to install additional python packages, you can do so by adding a requirements.txt file to your truss. The following example shows how to start the Infinity Embedding Model Server from a Docker image with python package infinity-embedding installed.

config.yaml
base_image:
  image: python:3.11-slim
docker_server:
  start_command: sh -c "infinity_emb v2 --model-id BAAI/bge-small-en-v1.5"
  readiness_endpoint: /health
  liveness_endpoint: /health
  predict_endpoint: /embeddings
  server_port: 7997
resources:
  accelerator: L4
  use_gpu: true
model_name: infinity-embedding-server
requirements:
- infinity-emb[all]
environment_variables:
  hf_access_token: null

Accessing secrets in Custom Server

As you might have noticed in the vLLM example above, you can access secrets in the Custom Server by reading them from the /secrets directory if you have stored those secrets in Baseten. This is useful if you need to pass in environment variables or other secrets to your server.