config.yaml, push with the Truss CLI, and get an OpenAI-compatible API endpoint. No custom Python code or Dockerfile required.
Deploy Gemma 4 26B Instruct on two H100 GPUs with vLLM, using EAGLE3 speculative decoding and prefix caching. Weights mirror once through the Baseten Delivery Network (BDN), so replicas scale up without re-downloading from Hugging Face.
Set up your environment
Log in to Baseten with the Truss CLI
Authenticate by opening a browser. Truss caches the credentials for subsequent commands:You should see:
Add a Hugging Face access token
Gemma is gated and requires a license click-through:
- Accept Google’s license terms on the Gemma model page. The weights in this example come from RedHatAI’s FP8 fork; your Hugging Face token grants access to both repos.
- Create a read-only user access token.
- Save the token as a secret named
hf_access_tokenin your Baseten workspace.
Configure the model
Create a project directory and open it:config.yaml and copy the following configuration into it:
config.yaml
- Weights: which Hugging Face checkpoint BDN mirrors and where it mounts inside the container.
- Server settings (
base_image,docker_server): how vLLM starts and which routes Baseten forwards to. - Resources and runtime: the GPU shape and how Baseten handles traffic while the replica warms up or fails health checks.
- Secrets and metadata: the credentials injected into the container and what shows up in the dashboard Try panel.
Deploy the model
Push the model to Baseten:/models/ (for example, abc1d2ef). You’ll need it to call the model’s API. You can also find it in your Baseten dashboard.
The first deploy takes 5–10 minutes while Baseten pulls the vLLM base image and BDN mirrors the FP8 weights and the EAGLE3 speculator from Hugging Face. Subsequent scale-ups reuse the cached image and weights. Watch progress in the logs linked above.
Call the model
Once the deployment shows Active in the dashboard, call it with a Baseten API key. Export your key before sending the request:{model_id} in the examples below with your model ID from the deploy output.
- cURL
- Python
Send a streaming chat completion from the command line:Tokens stream back as Server-Sent Events, one
data: chunk at a time. The Python tab below shows the same call with the chunks reassembled into prose.base_url at your model’s endpoint. To route traffic through a third-party OpenAI-compatible gateway, see External LLM gateways.
Run a production inference server
This deployment is a template for productionizing open-source LLMs from Hugging Face, not just a one-time demo. Baseten runs vLLM as a managed server with health checks, autoscaling, and BDN-cached weights, and exposes an OpenAI-compatible API your existing clients can call without changes. Two vLLM features in thestart_command speed up inference at scale. EAGLE3 speculative decoding runs a small draft model alongside the main model and accepts matching token predictions, cutting decode latency by roughly 30–40% on most LLM workloads. Prefix caching reuses the KV cache when requests share a prompt prefix, such as a system prompt, RAG context, or multi-turn history, which can cut time-to-first-token by an order of magnitude on chat and retrieval workloads.
The same pattern works across model families: BDN handles weight delivery, vLLM serves the model, and Baseten handles replicas, routing, and monitoring in production.
Next steps
Adapt to another model
Port the template incrementally. Change and validate one layer before moving to the next.- Weights: Point
weights[].sourceat the new repo and update the path instart_command. Keepauth_secret_namefor gated models, and pin a revision (for example,@mainor a commit hash) for reproducibility. - Served model name: Set
--served-model-nameto the public model ID your clients will send, and update themodelfield inexample_model_inputto match. - Model-specific vLLM flags: Swap or drop reasoning and tool-call parsers (the
gemma4parsers only apply to Gemma 4). Remove the--speculative-config.*flags if no EAGLE3 speculator is published for your target. - Hardware: Resize
resources.acceleratorfor the new checkpoint’s memory footprint. Confirm utilization in the deployment logs andnvidia-smi. - Runtime tuning: Tune
runtime.predict_concurrencyalongside--max-num-seqsonce you know your traffic pattern. - Rollback: Promote a working config to a separate environment and roll forward only after smoke tests pass.
Related resources
Autoscaling
Configure replicas, concurrency targets, and scale-to-zero for production traffic.
Customize a model
Add custom Python when you need preprocessing, postprocessing, or unsupported architectures.