Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.baseten.co/llms.txt

Use this file to discover all available pages before exploring further.

Deploy open-source LLMs from Hugging Face on Baseten using vLLM and Truss. You write a config.yaml, push with the Truss CLI, and get an OpenAI-compatible API endpoint. No custom Python code or Dockerfile required. Deploy Gemma 4 26B Instruct on two H100 GPUs with vLLM, using EAGLE3 speculative decoding and prefix caching. Weights mirror once through the Baseten Delivery Network (BDN), so replicas scale up without re-downloading from Hugging Face.
Before you begin, sign up or sign in to Baseten and install uv. This example provisions two H100 GPUs and takes roughly 5–10 minutes on first deploy. New to Baseten? Start with Deploy your first model.

Set up your environment

1

Log in to Baseten with the Truss CLI

Authenticate by opening a browser. Truss caches the credentials for subsequent commands:
uvx truss login --browser
You should see:
Opening browser for authentication...
Successfully logged in.
Prefer not to install Truss? Use uvx truss … for every command in this guide, including uvx truss push in the deploy step.
2

Add a Hugging Face access token

Gemma is gated and requires a license click-through:
  1. Accept Google’s license terms on the Gemma model page. The weights in this example come from RedHatAI’s FP8 fork; your Hugging Face token grants access to both repos.
  2. Create a read-only user access token.
  3. Save the token as a secret named hf_access_token in your Baseten workspace.

Configure the model

Create a project directory and open it:
mkdir gemma-4-26b && cd gemma-4-26b
Create a config.yaml and copy the following configuration into it:
config.yaml
model_name: Gemma 4 26B Instruct
model_metadata:
  example_model_input:
    model: google/gemma-4-26B-A4B-it
    messages:
      - role: user
        content: "What does Gemma stand for?"
    stream: true
    max_tokens: 512
    temperature: 1.0
  tags:
    - openai-compatible

base_image:
  image: vllm/vllm-openai:v0.20.0

build_commands:
  - pip install --upgrade transformers==5.5.4

weights:
  - source: "hf://RedHatAI/gemma-4-26B-A4B-it-FP8-Dynamic@main"
    mount_location: "/app/checkpoint/gemma"
    auth_secret_name: "hf_access_token"

docker_server:
  start_command: >-
    sh -c "GPU_COUNT=$(nvidia-smi --list-gpus | wc -l) && vllm serve /app/checkpoint/gemma
    --tensor-parallel-size $GPU_COUNT
    --served-model-name google/gemma-4-26B-A4B-it
    --max-num-seqs 16
    --max-model-len auto
    --gpu-memory-utilization 0.9
    --enable-prefix-caching
    --speculative-config.model RedHatAI/gemma-4-26B-A4B-it-speculator.eagle3
    --speculative-config.num_speculative_tokens 3
    --speculative-config.method eagle3
    --trust-remote-code
    --enable-auto-tool-choice
    --reasoning-parser gemma4
    --tool-call-parser gemma4"
  readiness_endpoint: /health
  liveness_endpoint: /health
  predict_endpoint: /v1/chat/completions
  server_port: 8000

environment_variables:
  VLLM_LOGGING_LEVEL: INFO

resources:
  accelerator: H100:2
  use_gpu: true

secrets:
  hf_access_token: null

runtime:
  predict_concurrency: 8
  health_checks:
    startup_threshold_seconds: 300
    restart_threshold_seconds: 300
    stop_traffic_threshold_seconds: 120
Think of this config as four connected decisions, not a flat list of fields. Weights choose which Hugging Face checkpoint BDN mirrors and where it mounts inside the container. Server settings (base_image, build_commands, docker_server) define how vLLM starts and which routes Baseten forwards to. Resources and runtime pick the GPU shape and how Baseten handles traffic while the replica warms up or fails health checks. Secrets and metadata wire in authentication and populate the dashboard Try panel. When you port this template to another open-source LLM, change one layer at a time and redeploy. Start with weights[].source and the path in start_command so vLLM reads from the BDN mount rather than downloading from Hugging Face on every cold start. Update --served-model-name to the public model ID your clients will send, then adjust model-specific vLLM flags: reasoning parsers, tool-call parsers, --trust-remote-code, and any speculative-decoding config. Resize resources.accelerator to fit the new checkpoint’s memory footprint and tune runtime.predict_concurrency alongside --max-num-seqs once you know your traffic pattern. Several choices in this example are Gemma-specific defaults you can remove or swap. EAGLE3 needs a matching speculator repo; drop the --speculative-config.* flags if your target model has none published. The gemma4 parsers only apply to Gemma 4. FP8 weights cut memory use but require a compatible checkpoint. Keep auth_secret_name for gated models, pin source with @main or a commit hash for reproducible deploys, and confirm the --served-model-name in your API requests matches what you set in start_command. Check deployment logs and nvidia-smi output when sizing hardware for a new model.

Deploy the model

Push the model to Baseten:
uvx truss push
You should see output like:
Deploying truss using H100:2 instance type.
Model Gemma 4 26B Instruct was successfully pushed.
View logs at https://app.baseten.co/models/abc1d2ef/logs/xyz123
The logs URL contains your model ID, the string after /models/ (for example, abc1d2ef). You’ll need it to call the model’s API. You can also find it in your Baseten dashboard. The first deploy takes 5–10 minutes while Baseten pulls the vLLM base image and BDN mirrors the FP8 weights and the EAGLE3 speculator from Hugging Face. Subsequent scale-ups reuse the cached image and weights. Watch progress in the logs linked above.

Call the model

Once the deployment shows Active in the dashboard, call it with a Baseten API key. Export your key before sending the request:
export BASETEN_API_KEY="EMPTY"
Replace {model_id} in the examples below with your model ID from the deploy output.
Send a streaming chat completion from the command line:
curl -N -X POST "https://model-{model_id}.api.baseten.co/environments/production/sync/v1/chat/completions" \
  -H "Authorization: Api-Key $BASETEN_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "google/gemma-4-26B-A4B-it",
    "messages": [
      {"role": "user", "content": "Explain prefix caching in two sentences."}
    ],
    "max_tokens": 200,
    "stream": true
  }'
The model argument in your request must match the --served-model-name flag in start_command. A mismatch returns a 400 error from the API.
Any code that works with the OpenAI SDK works with your deployment. Point the base_url at your model’s endpoint and set the model field to match --served-model-name. To route traffic from a third-party OpenAI-compatible gateway, see External LLM gateways. The model value the gateway sends must match --served-model-name in start_command.

Run a production inference server

This deployment is a template for productionizing open-source LLMs from Hugging Face, not just a one-time demo. Baseten runs vLLM as a managed server with health checks, autoscaling, and BDN-cached weights, and exposes an OpenAI-compatible API your existing clients can call without changes. Two vLLM features in the start_command speed up inference at scale. EAGLE3 speculative decoding runs a small draft model alongside the main model and accepts matching token predictions, cutting decode latency by roughly 30–40% on most LLM workloads. Prefix caching reuses the KV cache when requests share a prompt prefix, such as a system prompt, RAG context, or multi-turn history, which can cut time-to-first-token by an order of magnitude on chat and retrieval workloads. Point the config at any compatible Hugging Face checkpoint, adjust --served-model-name and hardware sizing, and redeploy. The same pattern works across model families: BDN handles weight delivery, vLLM serves the model, and Baseten handles replicas, routing, and monitoring in production.

Next steps