Deploy open-source LLMs from Hugging Face on Baseten using vLLM and Truss. You write aDocumentation Index
Fetch the complete documentation index at: https://docs.baseten.co/llms.txt
Use this file to discover all available pages before exploring further.
config.yaml, push with the Truss CLI, and get an OpenAI-compatible API endpoint. No custom Python code or Dockerfile required.
Deploy Gemma 4 26B Instruct on two H100 GPUs with vLLM, using EAGLE3 speculative decoding and prefix caching. Weights mirror once through the Baseten Delivery Network (BDN), so replicas scale up without re-downloading from Hugging Face.
Before you begin, sign up or sign in to Baseten and install uv. This example provisions two H100 GPUs and takes roughly 5–10 minutes on first deploy. New to Baseten? Start with Deploy your first model.
Set up your environment
Log in to Baseten with the Truss CLI
Authenticate by opening a browser. Truss caches the credentials for subsequent commands:You should see:
Add a Hugging Face access token
Gemma is gated and requires a license click-through:
- Accept Google’s license terms on the Gemma model page. The weights in this example come from RedHatAI’s FP8 fork; your Hugging Face token grants access to both repos.
- Create a read-only user access token.
- Save the token as a secret named
hf_access_tokenin your Baseten workspace.
Configure the model
Create a project directory and open it:config.yaml and copy the following configuration into it:
config.yaml
base_image, build_commands, docker_server) define how vLLM starts and which routes Baseten forwards to. Resources and runtime pick the GPU shape and how Baseten handles traffic while the replica warms up or fails health checks. Secrets and metadata wire in authentication and populate the dashboard Try panel.
When you port this template to another open-source LLM, change one layer at a time and redeploy. Start with weights[].source and the path in start_command so vLLM reads from the BDN mount rather than downloading from Hugging Face on every cold start. Update --served-model-name to the public model ID your clients will send, then adjust model-specific vLLM flags: reasoning parsers, tool-call parsers, --trust-remote-code, and any speculative-decoding config. Resize resources.accelerator to fit the new checkpoint’s memory footprint and tune runtime.predict_concurrency alongside --max-num-seqs once you know your traffic pattern.
Several choices in this example are Gemma-specific defaults you can remove or swap. EAGLE3 needs a matching speculator repo; drop the --speculative-config.* flags if your target model has none published. The gemma4 parsers only apply to Gemma 4. FP8 weights cut memory use but require a compatible checkpoint. Keep auth_secret_name for gated models, pin source with @main or a commit hash for reproducible deploys, and confirm the --served-model-name in your API requests matches what you set in start_command. Check deployment logs and nvidia-smi output when sizing hardware for a new model.
Deploy the model
Push the model to Baseten:/models/ (for example, abc1d2ef). You’ll need it to call the model’s API. You can also find it in your Baseten dashboard.
The first deploy takes 5–10 minutes while Baseten pulls the vLLM base image and BDN mirrors the FP8 weights and the EAGLE3 speculator from Hugging Face. Subsequent scale-ups reuse the cached image and weights. Watch progress in the logs linked above.
Call the model
Once the deployment shows Active in the dashboard, call it with a Baseten API key. Export your key before sending the request:{model_id} in the examples below with your model ID from the deploy output.
- cURL
- Python
Send a streaming chat completion from the command line:
base_url at your model’s endpoint and set the model field to match --served-model-name.
To route traffic from a third-party OpenAI-compatible gateway, see External LLM gateways. The model value the gateway sends must match --served-model-name in start_command.
Run a production inference server
This deployment is a template for productionizing open-source LLMs from Hugging Face, not just a one-time demo. Baseten runs vLLM as a managed server with health checks, autoscaling, and BDN-cached weights, and exposes an OpenAI-compatible API your existing clients can call without changes. Two vLLM features in thestart_command speed up inference at scale. EAGLE3 speculative decoding runs a small draft model alongside the main model and accepts matching token predictions, cutting decode latency by roughly 30–40% on most LLM workloads. Prefix caching reuses the KV cache when requests share a prompt prefix, such as a system prompt, RAG context, or multi-turn history, which can cut time-to-first-token by an order of magnitude on chat and retrieval workloads.
Point the config at any compatible Hugging Face checkpoint, adjust --served-model-name and hardware sizing, and redeploy. The same pattern works across model families: BDN handles weight delivery, vLLM serves the model, and Baseten handles replicas, routing, and monitoring in production.
Next steps
- Baseten Delivery Network:
weightsfields and authentication options. - Instance types and pricing: accelerator sizing and allocations.
- Performance optimization: engine selection and tuning beyond this example.
- Calling your model and Operations: client patterns, streaming, tool calls, and production error handling.
- Autoscaling: replicas, concurrency targets, and scale-to-zero.
- Customize a model: add Python when you need preprocessing or unsupported architectures.
- Deploy LLMs with vLLM: a smaller L4 example when you want a simpler starting point.