Model class, Truss uses the
Truss server base image
by default. However, you can deploy pre-built containers.
In this guide, you will learn how to set the your configuration file to run a
custom Docker image and deploy it to Baseten using Truss.
Configuration
To deploy a custom Docker image, setbase_image to your image
and use the docker_server argument to specify how to run it.
config.yaml
image: The Docker image to use.start_command: The command to start the server.server_port: The port to listen on.predict_endpoint: The endpoint to forward requests to.readiness_endpoint: The endpoint to check if the server is ready.liveness_endpoint: The endpoint to check if the server is alive.
Endpoint mapping
Endpoint mapping
While
Example: If you set
predict_endpoint maps your server’s inference route to Baseten’s
/predict endpoint, you can access any route in your server using the
sync endpoint.| Baseten endpoint | Maps to |
|---|---|
/environments/production/predict | Your predict_endpoint route |
/environments/production/sync/{any/route} | /{any/route} in your server |
predict_endpoint: /v1/chat/completions:| Baseten endpoint | Maps to |
|---|---|
/environments/production/predict | /v1/chat/completions |
/environments/production/sync/v1/models | /v1/models |
Deploy Ollama
This example deploys Ollama with the TinyLlama model using a custom Docker image. Ollama is a popular lightweight LLM inference server, similar to vLLM or SGLang. TinyLlama is small enough to run on a CPU.1. Create the config
Create aconfig.yaml file with the following configuration:
config.yaml
base_image field specifies the Docker image to use as your starting
point, in this case a lightweight Python image. The build_commands section
installs Ollama into the container at build time. You can also use this to
install model weights or other dependencies.
The start_command launches the Ollama server, waits for it to initialize, and
then pulls the TinyLlama model.
The readiness_endpoint and liveness_endpoint
both point to /api/tags, which returns successfully when Ollama is running.
The predict_endpoint maps Baseten’s /predict route to Ollama’s
/api/generate endpoint.
Finally, declare your resource requirements. This example only needs 4 CPUs and
8GB of memory. For a complete list of resource options, see the
Resources page.
2. Deploy
To deploy the model, use the following:readiness_endpoint and liveness_endpoint are successful, the model will be ready to use.
3. Run inference
Ollama uses OpenAI API compatible endpoints to run inference and calls/api/generate to generate text. Since you mapped the /predict route to
Ollama’s /api/generate endpoint, you can run inference by calling the
/predict endpoint.
- Truss CLI
- cURL
- Python
To run inference with Truss, use the
predict command:Next steps
- Private registries — Pull images from AWS ECR, Google Artifact Registry, or Docker Hub
- Secrets — Access API keys and tokens in your container
- WebSockets — Enable WebSocket connections
- vLLM, SGLang, TensorRT-LLM — Deploy LLMs with popular inference servers