Documentation Index
Fetch the complete documentation index at: https://docs.baseten.co/llms.txt
Use this file to discover all available pages before exploring further.
Setup
To get started, sign into Baseten with Truss and then install the OpenAI SDK.Sign in to Baseten
uvx truss login --browser
Install the OpenAI SDK
uv pip install openai
- E2B
- E4B
- 26B A4B
- 31B
google/gemma-4-E2B-it is a 2B-parameter dense model with up to 125K context.This preset serves Gemma 4 E2B on a single L4, the lowest-cost deployment in the Model Library.Then create a file named
You should see output similar to:Your model ID is the string after The server parses the model’s chain of thought into a separate To let the model call tools, pass a
Hardware
L4
Engine
vLLM 0.20.0
Context
125K
Concurrency
8
Write the config
Create and move into the project directory:mkdir gemma-4-E2B-it-latency && cd gemma-4-E2B-it-latency
config.yaml and paste the following:config.yaml
model_name: model:gemma-4-E2B-it preset:latency
base_image:
image: vllm/vllm-openai:v0.20.0
model_metadata:
repo_id: google/gemma-4-E2B-it
example_model_input:
model: google/gemma-4-E2B-it
messages:
- role: user
content:
- type: text
text: "Describe this image in one sentence."
- type: image_url
image_url:
url: "https://picsum.photos/id/237/200/300"
stream: true
max_tokens: 512
temperature: 1.0
tags:
- openai-compatible
weights:
- source: "hf://google/gemma-4-E2B-it@main"
mount_location: "/app/checkpoint/gemma"
auth_secret_name: "hf_access_token"
build_commands:
- pip install --upgrade transformers==5.5.4
docker_server:
start_command: >-
sh -c "GPU_COUNT=$(nvidia-smi --list-gpus | wc -l) && vllm serve /app/checkpoint/gemma
--tensor-parallel-size $GPU_COUNT
--served-model-name google/gemma-4-E2B-it
--max-num-seqs 16
--max-model-len auto
--limit-mm-per-prompt.image 1
--gpu-memory-utilization 0.9
--async-scheduling
--trust-remote-code
--enable-auto-tool-choice
--enable-prefix-caching
--reasoning-parser gemma4
--tool-call-parser gemma4"
readiness_endpoint: /health
liveness_endpoint: /health
predict_endpoint: /v1/chat/completions
server_port: 8000
environment_variables:
VLLM_LOGGING_LEVEL: INFO
requirements:
- huggingface_hub
- hf_transfer
- datasets
resources:
accelerator: L4
use_gpu: true
secrets:
hf_access_token: null
runtime:
health_checks:
restart_check_delay_seconds: 300
restart_threshold_seconds: 300
stop_traffic_threshold_seconds: 120
predict_concurrency: 8
# Updated with nightly image and async scheduling
Flags
Thestart_command passes these flags to the engine. Each one controls a runtime or serving behavior:| Flag | Value | What it does |
|---|---|---|
--tensor-parallel-size | $GPU_COUNT | Number of GPUs to shard the model across. |
--max-num-seqs | 16 | Maximum number of concurrent sequences in the batch. |
--max-model-len | auto | Maximum context length (tokens) the server accepts per request. |
--limit-mm-per-prompt.image | 1 | Maximum number of image inputs per prompt. |
--gpu-memory-utilization | 0.9 | Fraction of GPU memory vLLM may use for weights and KV cache. |
--async-scheduling | (no value) | Overlap scheduling with GPU execution to hide scheduler latency. |
--trust-remote-code | (no value) | Execute model-specific Python from the checkpoint (required for many Qwen, Phi, and custom architectures). |
--enable-auto-tool-choice | (no value) | Let the model choose when to call tools without requiring tool_choice: "required". |
--enable-prefix-caching | (no value) | Reuse KV cache across requests that share a prefix. |
--reasoning-parser | gemma4 | Server-side parser that separates reasoning output into reasoning_content. |
--tool-call-parser | gemma4 | Server-side parser that emits structured tool_calls on the response. |
Deploy
Push the config to Baseten:uvx truss push
✨ Model gemma-4-E2B-it-latency was successfully pushed ✨
🪵 View logs for your deployment at https://app.baseten.co/models/abcd1234/logs/wxyz5678
/models/ in the logs URL (abcd1234 in the example). Use it wherever you see {model_id} in the next section.Call the model
Your deployment serves an OpenAI-compatible API. Replace{model_id} with your model ID and make sure BASETEN_API_KEY is set.Now call your deployment to run inference:- Python
- cURL
main.py
import os
from openai import OpenAI
client = OpenAI(
api_key=os.environ["BASETEN_API_KEY"],
base_url="https://model-{model_id}.api.baseten.co/environments/production/sync/v1",
)
response = client.chat.completions.create(
model="google/gemma-4-E2B-it",
messages=[
{"role": "user", "content": "What is machine learning?"}
],
)
print(response.choices[0].message.content)
curl -s https://model-{model_id}.api.baseten.co/environments/production/sync/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Api-Key $BASETEN_API_KEY" \
-d '{
"model": "google/gemma-4-E2B-it",
"messages": [
{"role": "user", "content": "What is machine learning?"}
]
}'
reasoning_content field on the response. Read it alongside the final answer:response = client.chat.completions.create(
model="google/gemma-4-E2B-it",
messages=[
{"role": "user", "content": "How many r's in strawberry?"}
],
)
print(response.choices[0].message.reasoning_content) # chain of thought
print(response.choices[0].message.content) # final answer
tools array. The server returns structured tool_calls on the response:tools = [{
"type": "function",
"function": {
"name": "get_weather",
"parameters": {
"type": "object",
"properties": {"location": {"type": "string"}},
"required": ["location"],
},
},
}]
response = client.chat.completions.create(
model="google/gemma-4-E2B-it",
messages=[
{"role": "user", "content": "What's the weather in Paris?"}
],
tools=tools,
)
print(response.choices[0].message.tool_calls)
google/gemma-4-E4B-it is a 4B-parameter dense model with up to 125K context.This preset serves Gemma 4 E4B on a single H100.Then create a file named
You should see output similar to:Your model ID is the string after The server parses the model’s chain of thought into a separate To let the model call tools, pass a
Hardware
H100
Engine
vLLM 0.20.0
Context
125K
Concurrency
8
Write the config
Create and move into the project directory:mkdir gemma-4-E4B-it-latency && cd gemma-4-E4B-it-latency
config.yaml and paste the following:config.yaml
model_name: model:gemma-4-E4B-it preset:latency
base_image:
image: vllm/vllm-openai:v0.20.0
model_metadata:
repo_id: google/gemma-4-E4B-it
example_model_input:
model: google/gemma-4-E4B-it
messages:
- role: user
content:
- type: text
text: "Describe this image in one sentence."
- type: image_url
image_url:
url: "https://picsum.photos/id/237/200/300"
stream: true
max_tokens: 512
temperature: 1.0
tags:
- openai-compatible
weights:
- source: "hf://google/gemma-4-E4B-it@main"
mount_location: "/app/checkpoint/gemma"
auth_secret_name: "hf_access_token"
build_commands:
- pip install --upgrade transformers==5.5.4
docker_server:
start_command: >-
sh -c "GPU_COUNT=$(nvidia-smi --list-gpus | wc -l) && vllm serve /app/checkpoint/gemma
--tensor-parallel-size $GPU_COUNT
--served-model-name google/gemma-4-E4B-it
--max-num-seqs 16
--max-model-len auto
--limit-mm-per-prompt.image 1
--gpu-memory-utilization 0.9
--async-scheduling
--trust-remote-code
--enable-auto-tool-choice
--enable-prefix-caching
--reasoning-parser gemma4
--tool-call-parser gemma4"
readiness_endpoint: /health
liveness_endpoint: /health
predict_endpoint: /v1/chat/completions
server_port: 8000
environment_variables:
VLLM_LOGGING_LEVEL: INFO
requirements:
- huggingface_hub
- hf_transfer
- datasets
resources:
accelerator: H100
use_gpu: true
secrets:
hf_access_token: null
runtime:
health_checks:
restart_check_delay_seconds: 300
restart_threshold_seconds: 300
stop_traffic_threshold_seconds: 120
predict_concurrency: 8
# Updated with nightly image and async scheduling
Flags
Thestart_command passes these flags to the engine. Each one controls a runtime or serving behavior:| Flag | Value | What it does |
|---|---|---|
--tensor-parallel-size | $GPU_COUNT | Number of GPUs to shard the model across. |
--max-num-seqs | 16 | Maximum number of concurrent sequences in the batch. |
--max-model-len | auto | Maximum context length (tokens) the server accepts per request. |
--limit-mm-per-prompt.image | 1 | Maximum number of image inputs per prompt. |
--gpu-memory-utilization | 0.9 | Fraction of GPU memory vLLM may use for weights and KV cache. |
--async-scheduling | (no value) | Overlap scheduling with GPU execution to hide scheduler latency. |
--trust-remote-code | (no value) | Execute model-specific Python from the checkpoint (required for many Qwen, Phi, and custom architectures). |
--enable-auto-tool-choice | (no value) | Let the model choose when to call tools without requiring tool_choice: "required". |
--enable-prefix-caching | (no value) | Reuse KV cache across requests that share a prefix. |
--reasoning-parser | gemma4 | Server-side parser that separates reasoning output into reasoning_content. |
--tool-call-parser | gemma4 | Server-side parser that emits structured tool_calls on the response. |
Deploy
Push the config to Baseten:uvx truss push
✨ Model gemma-4-E4B-it-latency was successfully pushed ✨
🪵 View logs for your deployment at https://app.baseten.co/models/abcd1234/logs/wxyz5678
/models/ in the logs URL (abcd1234 in the example). Use it wherever you see {model_id} in the next section.Call the model
Your deployment serves an OpenAI-compatible API. Replace{model_id} with your model ID and make sure BASETEN_API_KEY is set.Now call your deployment to run inference:- Python
- cURL
main.py
import os
from openai import OpenAI
client = OpenAI(
api_key=os.environ["BASETEN_API_KEY"],
base_url="https://model-{model_id}.api.baseten.co/environments/production/sync/v1",
)
response = client.chat.completions.create(
model="google/gemma-4-E4B-it",
messages=[
{"role": "user", "content": "What is machine learning?"}
],
)
print(response.choices[0].message.content)
curl -s https://model-{model_id}.api.baseten.co/environments/production/sync/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Api-Key $BASETEN_API_KEY" \
-d '{
"model": "google/gemma-4-E4B-it",
"messages": [
{"role": "user", "content": "What is machine learning?"}
]
}'
reasoning_content field on the response. Read it alongside the final answer:response = client.chat.completions.create(
model="google/gemma-4-E4B-it",
messages=[
{"role": "user", "content": "How many r's in strawberry?"}
],
)
print(response.choices[0].message.reasoning_content) # chain of thought
print(response.choices[0].message.content) # final answer
tools array. The server returns structured tool_calls on the response:tools = [{
"type": "function",
"function": {
"name": "get_weather",
"parameters": {
"type": "object",
"properties": {"location": {"type": "string"}},
"required": ["location"],
},
},
}]
response = client.chat.completions.create(
model="google/gemma-4-E4B-it",
messages=[
{"role": "user", "content": "What's the weather in Paris?"}
],
tools=tools,
)
print(response.choices[0].message.tool_calls)
google/gemma-4-26B-A4B-it is a 26B-parameter MoE model (4B active per token) with up to 256K context.This preset serves Gemma 4 26B A4B on H100:2 with FP8 dynamic quantization.Then create a file named
You should see output similar to:Your model ID is the string after The server parses the model’s chain of thought into a separate To let the model call tools, pass a
Hardware
H100 × 2
Engine
vLLM 0.20.0
Context
256K
Concurrency
8
Write the config
Create and move into the project directory:mkdir gemma-4-26B-A4B-it-latency && cd gemma-4-26B-A4B-it-latency
config.yaml and paste the following:config.yaml
model_name: model:gemma-4-26B-A4B-it preset:latency
base_image:
image: vllm/vllm-openai:v0.20.0
model_metadata:
repo_id: RedHatAI/gemma-4-26B-A4B-it-FP8-Dynamic
example_model_input:
model: google/gemma-4-26B-A4B-it
messages:
- role: user
content:
- type: text
text: "Describe this image in one sentence."
- type: image_url
image_url:
url: "https://picsum.photos/id/237/200/300"
stream: true
max_tokens: 512
temperature: 1.0
tags:
- openai-compatible
weights:
- source: "hf://RedHatAI/gemma-4-26B-A4B-it-FP8-Dynamic@main"
mount_location: "/app/checkpoint/gemma"
auth_secret_name: "hf_access_token"
build_commands:
- pip install --upgrade transformers==5.5.4
docker_server:
start_command: >-
sh -c "GPU_COUNT=$(nvidia-smi --list-gpus | wc -l) && vllm serve /app/checkpoint/gemma
--tensor-parallel-size $GPU_COUNT
--served-model-name google/gemma-4-26B-A4B-it
--max-num-seqs 16
--max-model-len auto
--limit-mm-per-prompt.image 1
--gpu-memory-utilization 0.9
--enable-prefix-caching
--speculative-config.model RedHatAI/gemma-4-26B-A4B-it-speculator.eagle3
--speculative-config.num_speculative_tokens 3
--speculative-config.method eagle3
--trust-remote-code
--enable-auto-tool-choice
--reasoning-parser gemma4
--tool-call-parser gemma4"
readiness_endpoint: /health
liveness_endpoint: /health
predict_endpoint: /v1/chat/completions
server_port: 8000
environment_variables:
VLLM_LOGGING_LEVEL: INFO
requirements:
- huggingface_hub
- hf_transfer
- datasets
resources:
accelerator: H100:2
use_gpu: true
secrets:
hf_access_token: null
runtime:
health_checks:
restart_check_delay_seconds: 300
restart_threshold_seconds: 300
stop_traffic_threshold_seconds: 120
predict_concurrency: 8
# Updated with nightly image and restored speculative decoding for latency
Flags
Thestart_command passes these flags to the engine. Each one controls a runtime or serving behavior:| Flag | Value | What it does |
|---|---|---|
--tensor-parallel-size | $GPU_COUNT | Number of GPUs to shard the model across. |
--max-num-seqs | 16 | Maximum number of concurrent sequences in the batch. |
--max-model-len | auto | Maximum context length (tokens) the server accepts per request. |
--limit-mm-per-prompt.image | 1 | Maximum number of image inputs per prompt. |
--gpu-memory-utilization | 0.9 | Fraction of GPU memory vLLM may use for weights and KV cache. |
--enable-prefix-caching | (no value) | Reuse KV cache across requests that share a prefix. |
--speculative-config.model | RedHatAI/gemma-4-26B-A4B-it-speculator.eagle3 | Hugging Face repo for the draft speculator checkpoint. |
--speculative-config.num_speculative_tokens | 3 | Number of tokens the draft speculator proposes per step. |
--speculative-config.method | eagle3 | Speculative decoding method. eagle3: EAGLE v3 speculative decoding. |
--trust-remote-code | (no value) | Execute model-specific Python from the checkpoint (required for many Qwen, Phi, and custom architectures). |
--enable-auto-tool-choice | (no value) | Let the model choose when to call tools without requiring tool_choice: "required". |
--reasoning-parser | gemma4 | Server-side parser that separates reasoning output into reasoning_content. |
--tool-call-parser | gemma4 | Server-side parser that emits structured tool_calls on the response. |
Deploy
Push the config to Baseten:uvx truss push
✨ Model gemma-4-26B-A4B-it-latency was successfully pushed ✨
🪵 View logs for your deployment at https://app.baseten.co/models/abcd1234/logs/wxyz5678
/models/ in the logs URL (abcd1234 in the example). Use it wherever you see {model_id} in the next section.Call the model
Your deployment serves an OpenAI-compatible API. Replace{model_id} with your model ID and make sure BASETEN_API_KEY is set.Now call your deployment to run inference:- Python
- cURL
main.py
import os
from openai import OpenAI
client = OpenAI(
api_key=os.environ["BASETEN_API_KEY"],
base_url="https://model-{model_id}.api.baseten.co/environments/production/sync/v1",
)
response = client.chat.completions.create(
model="google/gemma-4-26B-A4B-it",
messages=[
{"role": "user", "content": "What is machine learning?"}
],
)
print(response.choices[0].message.content)
curl -s https://model-{model_id}.api.baseten.co/environments/production/sync/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Api-Key $BASETEN_API_KEY" \
-d '{
"model": "google/gemma-4-26B-A4B-it",
"messages": [
{"role": "user", "content": "What is machine learning?"}
]
}'
reasoning_content field on the response. Read it alongside the final answer:response = client.chat.completions.create(
model="google/gemma-4-26B-A4B-it",
messages=[
{"role": "user", "content": "How many r's in strawberry?"}
],
)
print(response.choices[0].message.reasoning_content) # chain of thought
print(response.choices[0].message.content) # final answer
tools array. The server returns structured tool_calls on the response:tools = [{
"type": "function",
"function": {
"name": "get_weather",
"parameters": {
"type": "object",
"properties": {"location": {"type": "string"}},
"required": ["location"],
},
},
}]
response = client.chat.completions.create(
model="google/gemma-4-26B-A4B-it",
messages=[
{"role": "user", "content": "What's the weather in Paris?"}
],
tools=tools,
)
print(response.choices[0].message.tool_calls)
google/gemma-4-31B-it is a 31B-parameter dense model with up to 256K context.This preset serves Gemma 4 31B on H100:2 with FP8 block quantization.Then create a file named
You should see output similar to:Your model ID is the string after The server parses the model’s chain of thought into a separate To let the model call tools, pass a
Hardware
H100 × 2
Engine
vLLM 0.20.0
Context
256K
Concurrency
8
Write the config
Create and move into the project directory:mkdir gemma-4-31B-it-latency && cd gemma-4-31B-it-latency
config.yaml and paste the following:config.yaml
model_name: model:gemma-4-31B-it preset:latency
base_image:
image: vllm/vllm-openai:v0.20.0
model_metadata:
repo_id: RedHatAI/gemma-4-31B-it-FP8-block
example_model_input:
model: google/gemma-4-31B-it
messages:
- role: user
content:
- type: text
text: "Describe this image in one sentence."
- type: image_url
image_url:
url: "https://picsum.photos/id/237/200/300"
stream: true
max_tokens: 512
temperature: 1.0
tags:
- openai-compatible
weights:
- source: "hf://RedHatAI/gemma-4-31B-it-FP8-block@main"
mount_location: "/app/checkpoint/gemma"
auth_secret_name: "hf_access_token"
build_commands:
- pip install --upgrade transformers==5.5.4
docker_server:
start_command: >-
sh -c "GPU_COUNT=$(nvidia-smi --list-gpus | wc -l) && vllm serve /app/checkpoint/gemma
--tensor-parallel-size $GPU_COUNT
--served-model-name google/gemma-4-31B-it
--max-num-seqs 16
--max-model-len auto
--limit-mm-per-prompt.image 1
--gpu-memory-utilization 0.9
--enable-prefix-caching
--speculative-config.model RedHatAI/gemma-4-31B-it-speculator.eagle3
--speculative-config.num_speculative_tokens 3
--speculative-config.method eagle3
--trust-remote-code
--enable-auto-tool-choice
--reasoning-parser gemma4
--tool-call-parser gemma4"
readiness_endpoint: /health
liveness_endpoint: /health
predict_endpoint: /v1/chat/completions
server_port: 8000
environment_variables:
VLLM_LOGGING_LEVEL: INFO
requirements:
- huggingface_hub
- hf_transfer
- datasets
resources:
accelerator: H100:2
use_gpu: true
secrets:
hf_access_token: null
runtime:
health_checks:
restart_check_delay_seconds: 300
restart_threshold_seconds: 300
stop_traffic_threshold_seconds: 120
predict_concurrency: 8
# Updated with nightly image and restored speculative decoding for latency
Flags
Thestart_command passes these flags to the engine. Each one controls a runtime or serving behavior:| Flag | Value | What it does |
|---|---|---|
--tensor-parallel-size | $GPU_COUNT | Number of GPUs to shard the model across. |
--max-num-seqs | 16 | Maximum number of concurrent sequences in the batch. |
--max-model-len | auto | Maximum context length (tokens) the server accepts per request. |
--limit-mm-per-prompt.image | 1 | Maximum number of image inputs per prompt. |
--gpu-memory-utilization | 0.9 | Fraction of GPU memory vLLM may use for weights and KV cache. |
--enable-prefix-caching | (no value) | Reuse KV cache across requests that share a prefix. |
--speculative-config.model | RedHatAI/gemma-4-31B-it-speculator.eagle3 | Hugging Face repo for the draft speculator checkpoint. |
--speculative-config.num_speculative_tokens | 3 | Number of tokens the draft speculator proposes per step. |
--speculative-config.method | eagle3 | Speculative decoding method. eagle3: EAGLE v3 speculative decoding. |
--trust-remote-code | (no value) | Execute model-specific Python from the checkpoint (required for many Qwen, Phi, and custom architectures). |
--enable-auto-tool-choice | (no value) | Let the model choose when to call tools without requiring tool_choice: "required". |
--reasoning-parser | gemma4 | Server-side parser that separates reasoning output into reasoning_content. |
--tool-call-parser | gemma4 | Server-side parser that emits structured tool_calls on the response. |
Deploy
Push the config to Baseten:uvx truss push
✨ Model gemma-4-31B-it-latency was successfully pushed ✨
🪵 View logs for your deployment at https://app.baseten.co/models/abcd1234/logs/wxyz5678
/models/ in the logs URL (abcd1234 in the example). Use it wherever you see {model_id} in the next section.Call the model
Your deployment serves an OpenAI-compatible API. Replace{model_id} with your model ID and make sure BASETEN_API_KEY is set.Now call your deployment to run inference:- Python
- cURL
main.py
import os
from openai import OpenAI
client = OpenAI(
api_key=os.environ["BASETEN_API_KEY"],
base_url="https://model-{model_id}.api.baseten.co/environments/production/sync/v1",
)
response = client.chat.completions.create(
model="google/gemma-4-31B-it",
messages=[
{"role": "user", "content": "What is machine learning?"}
],
)
print(response.choices[0].message.content)
curl -s https://model-{model_id}.api.baseten.co/environments/production/sync/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Api-Key $BASETEN_API_KEY" \
-d '{
"model": "google/gemma-4-31B-it",
"messages": [
{"role": "user", "content": "What is machine learning?"}
]
}'
reasoning_content field on the response. Read it alongside the final answer:response = client.chat.completions.create(
model="google/gemma-4-31B-it",
messages=[
{"role": "user", "content": "How many r's in strawberry?"}
],
)
print(response.choices[0].message.reasoning_content) # chain of thought
print(response.choices[0].message.content) # final answer
tools array. The server returns structured tool_calls on the response:tools = [{
"type": "function",
"function": {
"name": "get_weather",
"parameters": {
"type": "object",
"properties": {"location": {"type": "string"}},
"required": ["location"],
},
},
}]
response = client.chat.completions.create(
model="google/gemma-4-31B-it",
messages=[
{"role": "user", "content": "What's the weather in Paris?"}
],
tools=tools,
)
print(response.choices[0].message.tool_calls)