Deploy a Hugging Face model

Deploy open-source LLMs from Hugging Face on Baseten using vLLM and Truss. You write a config.yaml, push with the Truss CLI, and get an OpenAI-compatible API endpoint. No custom Python code or Dockerfile required. This guide walks through deploying Gemma 4 26B Instruct on two H100 GPUs with vLLM, using EAGLE3 speculative decoding and prefix caching. You’ll add a Hugging Face token, write a config, deploy to Baseten, and call the model’s OpenAI-compatible endpoint. Weights mirror once through the Baseten Delivery Network (BDN), so replicas scale up without re-downloading from Hugging Face. Before you begin, sign up or sign in to Baseten, then install uv, a fast Python package manager. Install the Truss CLI and connect it to your Baseten account. Browser login opens a tab to approve this device, so there’s no API key to copy and paste.

Install Truss

uv tool install truss

Sign in

truss login --browser

Prefer not to install? Run uvx truss login --browser to use the same flow without a permanent install, and use uvx truss … for the rest of this guide.

Add a Hugging Face access token

Gemma is gated and requires a license click-through:

Accept Google’s license terms on the Gemma model page. The weights in this example come from RedHatAI’s FP8 fork; your Hugging Face token grants access to both repos.
Create a read-only user access token.
Save the token as a secret named hf_access_token in your Baseten workspace.

Create a Truss project

Create a directory for your project:

mkdir gemma-4-26b && cd gemma-4-26b

vLLM server deployments only need a config.yaml. No custom Python code is required, and the model/ directory (used for custom preprocessing or postprocessing) isn’t needed here.

Write the config

Create a config.yaml with:

config.yaml

model_name: Gemma 4 26B Instruct
model_metadata:
  example_model_input:
    model: google/gemma-4-26B-A4B-it
    messages:
      - role: user
        content: "What does Gemma stand for?"
    stream: true
    max_tokens: 512
    temperature: 1.0
  tags:
    - openai-compatible

base_image:
  image: vllm/vllm-openai:v0.21.0

weights:
  - source: "hf://RedHatAI/gemma-4-26B-A4B-it-FP8-Dynamic@main"
    mount_location: "/app/checkpoint/gemma"
    auth_secret_name: "hf_access_token"

docker_server:
  start_command: >-
    sh -c "GPU_COUNT=$(nvidia-smi --list-gpus | wc -l) && vllm serve /app/checkpoint/gemma
    --tensor-parallel-size $GPU_COUNT
    --served-model-name google/gemma-4-26B-A4B-it
    --max-num-seqs 16
    --max-model-len auto
    --gpu-memory-utilization 0.9
    --enable-prefix-caching
    --speculative-config.model RedHatAI/gemma-4-26B-A4B-it-speculator.eagle3
    --speculative-config.num_speculative_tokens 3
    --speculative-config.method eagle3
    --trust-remote-code
    --enable-auto-tool-choice
    --reasoning-parser gemma4
    --tool-call-parser gemma4"
  readiness_endpoint: /health
  liveness_endpoint: /health
  predict_endpoint: /v1/chat/completions
  server_port: 8000

environment_variables:
  VLLM_LOGGING_LEVEL: INFO

resources:
  accelerator: H100:2
  use_gpu: true

secrets:
  hf_access_token: null

runtime:
  predict_concurrency: 8
  health_checks:
    startup_threshold_seconds: 300
    restart_threshold_seconds: 300
    stop_traffic_threshold_seconds: 120

Here’s what each setting does:

weights tells BDN which Hugging Face checkpoint to mirror and where to mount it inside the container. auth_secret_name uses your hf_access_token secret for the gated download.
base_image and docker_server run vLLM as the serving process: start_command launches the server, and the endpoint fields tell Baseten which routes to forward for predictions and health checks.
--enable-prefix-caching reuses the KV cache when requests share a prompt prefix, such as a system prompt, RAG context, or multi-turn history.
The --speculative-config.* flags enable EAGLE3 speculative decoding, which runs a small draft model alongside the main model and accepts matching token predictions to cut decode latency.
resources provisions two H100 GPUs; start_command reads the GPU count with nvidia-smi and sets vLLM’s tensor parallelism to match.
runtime.health_checks gives vLLM time to load weights before Baseten routes traffic or restarts the replica.
model_metadata supplies the example request for the dashboard Try panel, and secrets declares which workspace secrets the container can read.

Deploy

Push the model to Baseten:

truss push

You should see:

✨ Model Gemma 4 26B Instruct was successfully pushed ✨

   Model ID:      abc1d2ef
   Deployment ID: xyz123
   Endpoint:      model-abc1d2ef.api.baseten.co
   Logs:          https://app.baseten.co/models/abc1d2ef/logs/xyz123

The first deploy takes 5-10 minutes while Baseten pulls the vLLM base image and BDN mirrors the FP8 weights and the EAGLE3 speculator from Hugging Face. Subsequent scale-ups reuse the cached image and weights. You can watch progress in the logs linked above.

Call the model

Once the deployment shows Active in the dashboard, call it with a Baseten API key. The endpoint follows this shape:

Anatomy of the model API endpoint. In https://model-abc123.api.baseten.co/environments/production/sync/v1, abc123 is the model ID and production is the environment that serves the request.

Export your key before sending the request:

export BASETEN_API_KEY="paste-your-api-key-here"

setx BASETEN_API_KEY "paste-your-api-key-here"

Replace {model_id} in the examples below with your model ID from the deploy output.

Python
cURL

Send a streaming chat completion with the OpenAI SDK. Save the following as call_model.py:

call_model.py

import os
from openai import OpenAI

client = OpenAI(
    base_url="https://model-{model_id}.api.baseten.co/environments/production/sync/v1",
    api_key=os.environ["BASETEN_API_KEY"],
)

stream = client.chat.completions.create(
    model="google/gemma-4-26B-A4B-it",
    messages=[
        {"role": "user", "content": "Explain prefix caching in two sentences."}
    ],
    stream=True,
)

for chunk in stream:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="")

Run the script with uv, which pulls the OpenAI SDK on the fly:

uv run --with openai python call_model.py

Send a streaming chat completion from the command line:

curl -N -X POST "https://model-{model_id}.api.baseten.co/environments/production/sync/v1/chat/completions" \
  -H "Authorization: Bearer $BASETEN_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "google/gemma-4-26B-A4B-it",
    "messages": [
      {"role": "user", "content": "Explain prefix caching in two sentences."}
    ],
    "max_tokens": 200,
    "stream": true
  }'

Tokens stream back as Server-Sent Events, one data: chunk at a time.

You should see the response stream back token by token:

Prefix caching is an optimization technique that stores the processed computational states (KV cache) of common prompt prefixes to avoid redundant processing. By reusing these cached states for similar subsequent requests, it significantly reduces latency and computational costs during inference.

The model argument in your request must match the --served-model-name flag in start_command, or the API returns a 400.

Any code that works with the OpenAI SDK works with your deployment: point base_url at your model’s endpoint. To route traffic through a third-party OpenAI-compatible gateway, see External LLM gateways.

Adapt to another model

The same pattern works across model families: BDN handles weight delivery, vLLM serves the model, and Baseten handles replicas, routing, and monitoring. Port the template incrementally, changing and validating one layer before moving to the next.

Weights: Point weights[].source at the new repo and update the path in start_command. Keep auth_secret_name for gated models, and pin a revision (for example, @main or a commit hash) for reproducibility.
Served model name: Set --served-model-name to the public model ID your clients will send, and update the model field in example_model_input to match.
Model-specific vLLM flags: Swap or drop reasoning and tool-call parsers (the gemma4 parsers only apply to Gemma 4). Remove the --speculative-config.* flags if no EAGLE3 speculator is published for your target.
Hardware: Resize resources.accelerator for the new checkpoint’s memory footprint. Confirm utilization in the deployment logs and nvidia-smi.
Runtime tuning: Tune runtime.predict_concurrency alongside --max-num-seqs once you know your traffic pattern.
Rollback: Promote a working config to a separate environment and roll forward only after smoke tests pass.

Next steps

Autoscaling

Configure replicas, concurrency targets, and scale-to-zero for production traffic.

Customize a model

Add custom Python when you need preprocessing, postprocessing, or unsupported architectures.

Overview

Get started

Model APIs

Inference

Development

Deployment

Engines

Frontier Gateway

Training

Organization

Observability

Troubleshooting

Deploy a Hugging Face model

Add a Hugging Face access token

Create a Truss project

Write the config

Deploy

Call the model

Adapt to another model

Next steps

Autoscaling

Customize a model

​Install and sign in

​Add a Hugging Face access token

​Create a Truss project

​Write the config

​Deploy

​Call the model

​Adapt to another model

​Next steps

Autoscaling

Customize a model

Install and sign in

Add a Hugging Face access token

Create a Truss project

Write the config

Deploy

Call the model

Adapt to another model

Next steps