Model recipes - Baseten

Each page in this section is a deploy-ready Truss config for an open-weight model family, with hardware, engine version, quantization, and serving flags chosen for a sensible default. If your model is a fine-tune of one of these base checkpoints, bring over your weights rather than starting from scratch. If you are new to Baseten, work through Deploy your first model first.

Build a recipe

Pick a category, narrow by capability, then choose a family and preset to see the matching config.yaml. Every recipe links to its full page for deploy and inference steps.

Browse recipes by category

Pick a category to land on a representative family. The sidebar lists every family under each group.

LLMs

Chat-completions models served with vLLM or TensorRT-LLM, including Qwen, GLM, GPT-OSS, Gemma, Llama, MiniMax, and Nemotron.

Image generation

Text-to-image and text-to-video diffusion models.

Transcription

Speech-to-text with Voxtral and Qwen3-ASR.

Embedding

Dense embeddings and cross-encoder rerankers served with BEI.

Browse by capability

Every family page tags its capabilities. Open a capability page to see every recipe that supports it.

Reasoning

Tool calling

Multimodal (image)

Long context

Agentic

Speech to text

Bring over fine-tune weights

Most recipes in this section work as-is with a fine-tuned version of the base model. If you have your own weights from fine-tuning one of these base checkpoints, the only parameter you usually change is the Hugging Face ID pointing at your weights. The rest of the config (engine version, parsers, prefix caching, health checks) stays as written. The sections below cover the additional parameters that may need adjustment in specific cases.

Interested in fine-tuning? See training on Baseten to set up a workspace and start a run.

Swap the Hugging Face repo

Three parameters in the config name the checkpoint. Change all three to point at your fine-tuned repo:

Parameter	Change to
`model_metadata.repo_id`	Your fine-tuned Hugging Face repo.
`weights.source`	The `hf://` URI for the same repo with a branch or revision, like `hf://your-org/gemma-4-31B-it-finetune@main`.
`weights.auth_secret_name`	Keep `hf_access_token` for gated repos and add the secret in your workspace before pushing.
`--served-model-name` in `start_command`	The `model` string your clients pass in `chat.completions.create(model=...)`.

For example, here is how you’d adapt the Gemma 4 31B recipe to point at a fine-tune:

config.yaml

 model_metadata:
-  repo_id: RedHatAI/gemma-4-31B-it-FP8-block
+  repo_id: your-org/gemma-4-31B-it-finetune
   example_model_input:
     model: google/gemma-4-31B-it
     ...
 weights:
-  - source: "hf://RedHatAI/gemma-4-31B-it-FP8-block@main"
+  - source: "hf://your-org/gemma-4-31B-it-finetune@main"
     mount_location: "/app/checkpoint/gemma"
     auth_secret_name: "hf_access_token"
 docker_server:
   start_command: >-
     sh -c "GPU_COUNT=$(nvidia-smi --list-gpus | wc -l) && vllm serve /app/checkpoint/gemma
     --tensor-parallel-size $GPU_COUNT
-    --served-model-name google/gemma-4-31B-it
+    --served-model-name your-org/gemma-4-31B-it-finetune
     ..."

For checkpoints in S3, GCS, or mirrored through Baseten, see Baseten Delivery Network for s3://, gs://, and bdn:// source URIs.

Match hardware to model size

Pick the variant tab whose base model matches the size and architecture of your fine-tune, then copy that config. The --tensor-parallel-size $GPU_COUNT flag reads the GPU count at runtime, so the start command adapts automatically when you change resources.accelerator. See resources for the accelerator list.

Drop speculative decoding unless you trained a matching speculator

Several recipes ship with an EAGLE3 draft model trained against the base checkpoint:

config.yaml

--speculative-config.model RedHatAI/gemma-4-31B-it-speculator.eagle3
--speculative-config.num_speculative_tokens 3
--speculative-config.method eagle3

A fine-tune shifts the output distribution, so the base speculator’s acceptance rate drops and latency gets worse, not better. Remove all three --speculative-config.* flags unless you have trained a speculator on your fine-tune.

Disable multimodal for text-only fine-tunes

Multimodal recipes (Gemma 4, Llama 4) include --limit-mm-per-prompt.image 1 to cap image inputs per prompt. If your fine-tune dropped the vision tower or you only need text serving, swap that flag:

config.yaml

- --limit-mm-per-prompt.image 1
+ --language-model-only

--language-model-only skips the multimodal preprocessing path and avoids loading the vision encoder weights.

When a recipe is not the right starting point

The recipes assume open-weight models served over the OpenAI-compatible API. For other cases:

Case	Start here
Custom inference logic, or pre- or post-processing in Python	Custom Python models
A serving stack with no recipe yet	vLLM, SGLang, Ollama, or generic Docker
Embeddings, rerankers, or classification beyond the listed recipes	Embeddings with BEI
TensorRT-LLM engines from scratch	TensorRT-LLM

For a fine-tune of a base model not listed here, contact support and we can suggest a starting recipe.

​Build a recipe

​Browse recipes by category

LLMs

Image generation

Transcription

Embedding

​Browse by capability

Reasoning

Tool calling

Multimodal (image)

Long context

Agentic

Speech to text

​Bring over fine-tune weights

​Swap the Hugging Face repo

​Match hardware to model size

​Drop speculative decoding unless you trained a matching speculator

​Disable multimodal for text-only fine-tunes

​When a recipe is not the right starting point

Build a recipe

Browse recipes by category

Browse by capability

Bring over fine-tune weights

Swap the Hugging Face repo

Match hardware to model size

Drop speculative decoding unless you trained a matching speculator

Disable multimodal for text-only fine-tunes

When a recipe is not the right starting point