Each page in this section is a deploy-ready Truss config for an open-weight model family, with hardware, engine version, quantization, and serving flags chosen for a sensible default. If your model is a fine-tune of one of these base checkpoints, bring over your weights rather than starting from scratch. If you are new to Baseten, work through Deploy your first model first.Documentation Index
Fetch the complete documentation index at: https://docs.baseten.co/llms.txt
Use this file to discover all available pages before exploring further.
Browse recipes by category
Pick a category to land on a representative family. The sidebar lists every family under each group.LLMs
Chat-completions models served with vLLM or TensorRT-LLM, including Qwen, GLM, GPT-OSS, Gemma, Llama, MiniMax, and Nemotron.
Image generation
Text-to-image and text-to-video diffusion models.
Transcription
Speech-to-text with Voxtral and Qwen3-ASR.
Embedding
Dense embeddings and cross-encoder rerankers served with BEI.
Browse by capability
Every family page tags its capabilities. Open a capability page to see every recipe that supports it.Reasoning
Tool calling
Multimodal (image)
Long context
Agentic
Speech to text
Bring over fine-tune weights
Most recipes in this section work as-is with a fine-tuned version of the base model. If you have your own weights from fine-tuning one of these base checkpoints, the only parameter you usually change is the Hugging Face ID pointing at your weights. The rest of the config (engine version, parsers, prefix caching, health checks) stays as written. The sections below cover the additional parameters that may need adjustment in specific cases.Interested in fine-tuning? See training on Baseten to set up a workspace and start a run.
Swap the Hugging Face repo
Three parameters in the config name the checkpoint. Change all three to point at your fine-tuned repo:| Parameter | Change to |
|---|---|
model_metadata.repo_id | Your fine-tuned Hugging Face repo. |
weights.source | The hf:// URI for the same repo with a branch or revision, like hf://your-org/gemma-4-31B-it-finetune@main. |
weights.auth_secret_name | Keep hf_access_token for gated repos and add the secret in your workspace before pushing. |
--served-model-name in start_command | The model string your clients pass in chat.completions.create(model=...). |
config.yaml
s3://, gs://, and bdn:// source URIs.
Match hardware to model size
Pick the variant tab whose base model matches the size and architecture of your fine-tune, then copy that config. The--tensor-parallel-size $GPU_COUNT flag reads the GPU count at runtime, so the start command adapts automatically when you change resources.accelerator. See resources for the accelerator list.
Drop speculative decoding unless you trained a matching speculator
Several recipes ship with an EAGLE3 draft model trained against the base checkpoint:config.yaml
--speculative-config.* flags unless you have trained a speculator on your fine-tune.
Disable multimodal for text-only fine-tunes
Multimodal recipes (Gemma 4, Llama 4) include--limit-mm-per-prompt.image 1 to cap image inputs per prompt. If your fine-tune dropped the vision tower or you only need text serving, swap that flag:
config.yaml
--language-model-only skips the multimodal preprocessing path and avoids loading the vision encoder weights.
When a recipe is not the right starting point
The recipes assume open-weight models served over the OpenAI-compatible API. For other cases:| Case | Start here |
|---|---|
| Custom inference logic, or pre- or post-processing in Python | Custom Python models |
| A serving stack with no recipe yet | vLLM, SGLang, Ollama, or generic Docker |
| Embeddings, rerankers, or classification beyond the listed recipes | Embeddings with BEI |
| TensorRT-LLM engines from scratch | TensorRT-LLM |