Ollama is a popular lightweight LLM inference server, similar to vLLM or SGLang. This guide deploys an Ollama model as a custom Docker server on Baseten. This configuration serves TinyLlama with Ollama on a CPU instance. The deployment process is the same for larger Ollama models. Adjust theDocumentation Index
Fetch the complete documentation index at: https://docs.baseten.co/llms.txt
Use this file to discover all available pages before exploring further.
resources and the ollama pull target in start_command to match your model’s requirements.
Set up your environment
This guide usesuvx to run Truss commands without a separate install step. Sign in to Baseten and install requests to call the deployed model from Python. Browser login opens a tab to approve this device, so there’s no API key to copy and paste.
Sign in to Baseten
Install requests
Configure the model
Create a directory with aconfig.yaml file:
config.yaml:
config.yaml
base_image is a lightweight Python image. The build_commands install the system packages that the Ollama install script requires (curl, ca-certificates, and zstd), then download and install Ollama. The slim base image doesn’t include these packages by default.
The start_command launches the Ollama server, waits for it to initialize, and then pulls the TinyLlama model. The readiness_endpoint and liveness_endpoint both point to /api/tags, which returns successfully when Ollama is running. The predict_endpoint maps Baseten’s /predict route to Ollama’s /api/generate endpoint.
This example only needs 4 CPUs and 8 GB of memory. For a complete list of resource options, see the Resources page.
Deploy the model
Push the model to Baseten to start the deployment:Call the model
Ollama’s/api/generate is mapped to Baseten’s /predict route, so you can call the deployed model with any HTTP client:
- Truss CLI
- cURL
- Python
To run inference with Truss, use the
predict command:MODEL_ID with the model ID from your deployment output.
You should see:
Next steps
For higher-throughput serving on GPUs with OpenAI-compatible endpoints, see the vLLM and SGLang examples.Deploy LLMs with vLLM
Serve open-source LLMs on vLLM with prefix caching and the OpenAI-compatible API.
Deploy LLMs with SGLang
Serve open-source LLMs on SGLang’s high-performance runtime with the OpenAI-compatible API.