Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.baseten.co/llms.txt

Use this file to discover all available pages before exploring further.

Ollama is a popular lightweight LLM inference server, similar to vLLM or SGLang. This guide deploys an Ollama model as a custom Docker server on Baseten. This configuration serves TinyLlama with Ollama on a CPU instance. The deployment process is the same for larger Ollama models. Adjust the resources and the ollama pull target in start_command to match your model’s requirements.

Set up your environment

This guide uses uvx to run Truss commands without a separate install step. Sign in to Baseten and install requests to call the deployed model from Python. Browser login opens a tab to approve this device, so there’s no API key to copy and paste.
Sign in to Baseten
uvx truss login --browser
Install requests
uv pip install requests

Configure the model

Create a directory with a config.yaml file:
mkdir tinyllama-ollama
touch tinyllama-ollama/config.yaml
Copy the following configuration into config.yaml:
config.yaml
model_name: ollama-tinyllama
base_image:
  image: python:3.11-slim
build_commands:
  - apt-get update && apt-get install -y curl ca-certificates zstd
  - curl -fsSL https://ollama.com/install.sh | sh
docker_server:
  start_command: sh -c "ollama serve & sleep 5 && ollama pull tinyllama && wait"
  readiness_endpoint: /api/tags
  liveness_endpoint: /api/tags
  predict_endpoint: /api/generate
  server_port: 11434
resources:
  cpu: "4"
  memory: 8Gi
The base_image is a lightweight Python image. The build_commands install the system packages that the Ollama install script requires (curl, ca-certificates, and zstd), then download and install Ollama. The slim base image doesn’t include these packages by default. The start_command launches the Ollama server, waits for it to initialize, and then pulls the TinyLlama model. The readiness_endpoint and liveness_endpoint both point to /api/tags, which returns successfully when Ollama is running. The predict_endpoint maps Baseten’s /predict route to Ollama’s /api/generate endpoint. This example only needs 4 CPUs and 8 GB of memory. For a complete list of resource options, see the Resources page.

Deploy the model

Push the model to Baseten to start the deployment:
uvx truss push tinyllama-ollama
You should see output like:
Deploying truss using 4x16 instance type.
Model ollama-tinyllama was successfully pushed.
View logs at https://app.baseten.co/models/XXXXXXX/logs/XXXXXXX
Copy the model ID from the output for the next step. The first deploy can take several minutes while Baseten pulls the base image and Ollama downloads TinyLlama on container start. Subsequent scale-ups reuse the cached image and start much faster.

Call the model

Ollama’s /api/generate is mapped to Baseten’s /predict route, so you can call the deployed model with any HTTP client:
To run inference with Truss, use the predict command:
truss predict -d '{"model": "tinyllama", "prompt": "Write a short story about a robot dreaming", "stream": false, "options": {"num_predict": 50}}'
Replace MODEL_ID with the model ID from your deployment output. You should see:
It was a dreary, grey day when the robots started to dream.
They had been programmed to think like humans, but it wasn't until they began to dream that they realized just how far apart they actually were.

Next steps

For higher-throughput serving on GPUs with OpenAI-compatible endpoints, see the vLLM and SGLang examples.

Deploy LLMs with vLLM

Serve open-source LLMs on vLLM with prefix caching and the OpenAI-compatible API.

Deploy LLMs with SGLang

Serve open-source LLMs on SGLang’s high-performance runtime with the OpenAI-compatible API.