Setup
To get started, sign into Baseten with Truss and then install the OpenAI SDK.Sign in to Baseten
uvx truss login --browser
Install the OpenAI SDK
uv pip install openai
- E2B
- E4B
- 26B A4B
- 31B
google/gemma-4-E2B-it is a 2B-parameter dense model with up to 125K context.This preset serves Gemma 4 E2B on a single L4, the lowest-cost deployment in the Model Library.Then create a file named
You should see output similar to:Your model ID is the string after The server parses the model’s chain of thought into a separate To let the model call tools, pass a
Hardware
L4
Engine
vLLM 0.20.0
Context
125K
Concurrency
8
Write the config
Create and move into the project directory:mkdir gemma-4-E2B-it-latency && cd gemma-4-E2B-it-latency
config.yaml and paste the following:config.yaml
model_name: model:gemma-4-E2B-it preset:latency
base_image:
image: vllm/vllm-openai:v0.20.0
model_metadata:
repo_id: google/gemma-4-E2B-it
example_model_input:
model: google/gemma-4-E2B-it
messages:
- role: user
content:
- type: text
text: "Describe this image in one sentence."
- type: image_url
image_url:
url: "https://picsum.photos/id/237/200/300"
stream: true
max_tokens: 512
temperature: 1.0
tags:
- openai-compatible
weights:
- source: "hf://google/gemma-4-E2B-it@main"
mount_location: "/app/checkpoint/gemma"
auth_secret_name: "hf_access_token"
build_commands:
- pip install --upgrade transformers==5.5.4
docker_server:
start_command: >-
sh -c "GPU_COUNT=$(nvidia-smi --list-gpus | wc -l) && vllm serve /app/checkpoint/gemma
--tensor-parallel-size $GPU_COUNT
--served-model-name google/gemma-4-E2B-it
--max-num-seqs 16
--max-model-len auto
--limit-mm-per-prompt.image 1
--gpu-memory-utilization 0.9
--async-scheduling
--trust-remote-code
--enable-auto-tool-choice
--enable-prefix-caching
--reasoning-parser gemma4
--tool-call-parser gemma4"
readiness_endpoint: /health
liveness_endpoint: /health
predict_endpoint: /v1/chat/completions
server_port: 8000
environment_variables:
VLLM_LOGGING_LEVEL: INFO
requirements:
- huggingface_hub
- hf_transfer
- datasets
resources:
accelerator: L4
use_gpu: true
secrets:
hf_access_token: null
runtime:
health_checks:
restart_check_delay_seconds: 300
restart_threshold_seconds: 300
stop_traffic_threshold_seconds: 120
predict_concurrency: 8
# Updated with nightly image and async scheduling
Flags
Thestart_command passes these flags to the engine. Each one controls a runtime or serving behavior:| Flag | Value | What it does |
|---|---|---|
--tensor-parallel-size | $GPU_COUNT | Number of GPUs to shard the model across. |
--max-num-seqs | 16 | Maximum number of concurrent sequences in the batch. |
--max-model-len | auto | Maximum context length (tokens) the server accepts per request. |
--limit-mm-per-prompt.image | 1 | Maximum number of image inputs per prompt. |
--gpu-memory-utilization | 0.9 | Fraction of GPU memory vLLM may use for weights and KV cache. |
--async-scheduling | (no value) | Overlap scheduling with GPU execution to hide scheduler latency. |
--trust-remote-code | (no value) | Execute model-specific Python from the checkpoint (required for many Qwen, Phi, and custom architectures). |
--enable-auto-tool-choice | (no value) | Let the model choose when to call tools without requiring tool_choice: "required". |
--enable-prefix-caching | (no value) | Reuse KV cache across requests that share a prefix. |
--reasoning-parser | gemma4 | Server-side parser that separates reasoning output into reasoning_content. |
--tool-call-parser | gemma4 | Server-side parser that emits structured tool_calls on the response. |
Deploy
Push the config to Baseten:uvx truss push
✨ Model gemma-4-E2B-it-latency was successfully pushed ✨
🪵 View logs for your deployment at https://app.baseten.co/models/abcd1234/logs/wxyz5678
/models/ in the logs URL (abcd1234 in the example). Use it wherever you see {model_id} in the next section.Call the model
Your deployment serves an OpenAI-compatible API. Replace{model_id} with your model ID and make sure BASETEN_API_KEY is set.Now call your deployment to run inference:- Python
- cURL
main.py
import os
from openai import OpenAI
client = OpenAI(
api_key=os.environ["BASETEN_API_KEY"],
base_url="https://model-{model_id}.api.baseten.co/environments/production/sync/v1",
)
response = client.chat.completions.create(
model="google/gemma-4-E2B-it",
messages=[
{"role": "user", "content": "What is machine learning?"}
],
)
print(response.choices[0].message.content)
curl -s https://model-{model_id}.api.baseten.co/environments/production/sync/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer $BASETEN_API_KEY" \
-d '{
"model": "google/gemma-4-E2B-it",
"messages": [
{"role": "user", "content": "What is machine learning?"}
]
}'
reasoning_content field on the response. Read it alongside the final answer:response = client.chat.completions.create(
model="google/gemma-4-E2B-it",
messages=[
{"role": "user", "content": "How many r's in strawberry?"}
],
)
print(response.choices[0].message.reasoning_content) # chain of thought
print(response.choices[0].message.content) # final answer
tools array. The server returns structured tool_calls on the response:tools = [{
"type": "function",
"function": {
"name": "get_weather",
"parameters": {
"type": "object",
"properties": {"location": {"type": "string"}},
"required": ["location"],
},
},
}]
response = client.chat.completions.create(
model="google/gemma-4-E2B-it",
messages=[
{"role": "user", "content": "What's the weather in Paris?"}
],
tools=tools,
)
print(response.choices[0].message.tool_calls)
google/gemma-4-E4B-it is a 4B-parameter dense model with up to 125K context.This preset serves Gemma 4 E4B on a single H100.Then create a file named
You should see output similar to:Your model ID is the string after The server parses the model’s chain of thought into a separate To let the model call tools, pass a
Hardware
H100
Engine
vLLM 0.20.0
Context
125K
Concurrency
8
Write the config
Create and move into the project directory:mkdir gemma-4-E4B-it-latency && cd gemma-4-E4B-it-latency
config.yaml and paste the following:config.yaml
model_name: model:gemma-4-E4B-it preset:latency
base_image:
image: vllm/vllm-openai:v0.20.0
model_metadata:
repo_id: google/gemma-4-E4B-it
example_model_input:
model: google/gemma-4-E4B-it
messages:
- role: user
content:
- type: text
text: "Describe this image in one sentence."
- type: image_url
image_url:
url: "https://picsum.photos/id/237/200/300"
stream: true
max_tokens: 512
temperature: 1.0
tags:
- openai-compatible
weights:
- source: "hf://google/gemma-4-E4B-it@main"
mount_location: "/app/checkpoint/gemma"
auth_secret_name: "hf_access_token"
build_commands:
- pip install --upgrade transformers==5.5.4
docker_server:
start_command: >-
sh -c "GPU_COUNT=$(nvidia-smi --list-gpus | wc -l) && vllm serve /app/checkpoint/gemma
--tensor-parallel-size $GPU_COUNT
--served-model-name google/gemma-4-E4B-it
--max-num-seqs 16
--max-model-len auto
--limit-mm-per-prompt.image 1
--gpu-memory-utilization 0.9
--async-scheduling
--trust-remote-code
--enable-auto-tool-choice
--enable-prefix-caching
--reasoning-parser gemma4
--tool-call-parser gemma4"
readiness_endpoint: /health
liveness_endpoint: /health
predict_endpoint: /v1/chat/completions
server_port: 8000
environment_variables:
VLLM_LOGGING_LEVEL: INFO
requirements:
- huggingface_hub
- hf_transfer
- datasets
resources:
accelerator: H100
use_gpu: true
secrets:
hf_access_token: null
runtime:
health_checks:
restart_check_delay_seconds: 300
restart_threshold_seconds: 300
stop_traffic_threshold_seconds: 120
predict_concurrency: 8
# Updated with nightly image and async scheduling
Flags
Thestart_command passes these flags to the engine. Each one controls a runtime or serving behavior:| Flag | Value | What it does |
|---|---|---|
--tensor-parallel-size | $GPU_COUNT | Number of GPUs to shard the model across. |
--max-num-seqs | 16 | Maximum number of concurrent sequences in the batch. |
--max-model-len | auto | Maximum context length (tokens) the server accepts per request. |
--limit-mm-per-prompt.image | 1 | Maximum number of image inputs per prompt. |
--gpu-memory-utilization | 0.9 | Fraction of GPU memory vLLM may use for weights and KV cache. |
--async-scheduling | (no value) | Overlap scheduling with GPU execution to hide scheduler latency. |
--trust-remote-code | (no value) | Execute model-specific Python from the checkpoint (required for many Qwen, Phi, and custom architectures). |
--enable-auto-tool-choice | (no value) | Let the model choose when to call tools without requiring tool_choice: "required". |
--enable-prefix-caching | (no value) | Reuse KV cache across requests that share a prefix. |
--reasoning-parser | gemma4 | Server-side parser that separates reasoning output into reasoning_content. |
--tool-call-parser | gemma4 | Server-side parser that emits structured tool_calls on the response. |
Deploy
Push the config to Baseten:uvx truss push
✨ Model gemma-4-E4B-it-latency was successfully pushed ✨
🪵 View logs for your deployment at https://app.baseten.co/models/abcd1234/logs/wxyz5678
/models/ in the logs URL (abcd1234 in the example). Use it wherever you see {model_id} in the next section.Call the model
Your deployment serves an OpenAI-compatible API. Replace{model_id} with your model ID and make sure BASETEN_API_KEY is set.Now call your deployment to run inference:- Python
- cURL
main.py
import os
from openai import OpenAI
client = OpenAI(
api_key=os.environ["BASETEN_API_KEY"],
base_url="https://model-{model_id}.api.baseten.co/environments/production/sync/v1",
)
response = client.chat.completions.create(
model="google/gemma-4-E4B-it",
messages=[
{"role": "user", "content": "What is machine learning?"}
],
)
print(response.choices[0].message.content)
curl -s https://model-{model_id}.api.baseten.co/environments/production/sync/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer $BASETEN_API_KEY" \
-d '{
"model": "google/gemma-4-E4B-it",
"messages": [
{"role": "user", "content": "What is machine learning?"}
]
}'
reasoning_content field on the response. Read it alongside the final answer:response = client.chat.completions.create(
model="google/gemma-4-E4B-it",
messages=[
{"role": "user", "content": "How many r's in strawberry?"}
],
)
print(response.choices[0].message.reasoning_content) # chain of thought
print(response.choices[0].message.content) # final answer
tools array. The server returns structured tool_calls on the response:tools = [{
"type": "function",
"function": {
"name": "get_weather",
"parameters": {
"type": "object",
"properties": {"location": {"type": "string"}},
"required": ["location"],
},
},
}]
response = client.chat.completions.create(
model="google/gemma-4-E4B-it",
messages=[
{"role": "user", "content": "What's the weather in Paris?"}
],
tools=tools,
)
print(response.choices[0].message.tool_calls)
google/gemma-4-26B-A4B-it is a 26B-parameter MoE model (4B active per token) with up to 256K context.This preset serves Gemma 4 26B A4B on H100:2 with FP8 dynamic quantization.Then create a file named
You should see output similar to:Your model ID is the string after The server parses the model’s chain of thought into a separate To let the model call tools, pass a
Hardware
H100 × 2
Engine
vLLM 0.20.0
Context
256K
Concurrency
8
Write the config
Create and move into the project directory:mkdir gemma-4-26B-A4B-it-latency && cd gemma-4-26B-A4B-it-latency
config.yaml and paste the following:config.yaml
model_name: model:gemma-4-26B-A4B-it preset:latency
base_image:
image: vllm/vllm-openai:v0.20.0
model_metadata:
repo_id: RedHatAI/gemma-4-26B-A4B-it-FP8-Dynamic
example_model_input:
model: google/gemma-4-26B-A4B-it
messages:
- role: user
content:
- type: text
text: "Describe this image in one sentence."
- type: image_url
image_url:
url: "https://picsum.photos/id/237/200/300"
stream: true
max_tokens: 512
temperature: 1.0
tags:
- openai-compatible
weights:
- source: "hf://RedHatAI/gemma-4-26B-A4B-it-FP8-Dynamic@main"
mount_location: "/app/checkpoint/gemma"
auth_secret_name: "hf_access_token"
build_commands:
- pip install --upgrade transformers==5.5.4
docker_server:
start_command: >-
sh -c "GPU_COUNT=$(nvidia-smi --list-gpus | wc -l) && vllm serve /app/checkpoint/gemma
--tensor-parallel-size $GPU_COUNT
--served-model-name google/gemma-4-26B-A4B-it
--max-num-seqs 16
--max-model-len auto
--limit-mm-per-prompt.image 1
--gpu-memory-utilization 0.9
--enable-prefix-caching
--speculative-config.model RedHatAI/gemma-4-26B-A4B-it-speculator.eagle3
--speculative-config.num_speculative_tokens 3
--speculative-config.method eagle3
--trust-remote-code
--enable-auto-tool-choice
--reasoning-parser gemma4
--tool-call-parser gemma4"
readiness_endpoint: /health
liveness_endpoint: /health
predict_endpoint: /v1/chat/completions
server_port: 8000
environment_variables:
VLLM_LOGGING_LEVEL: INFO
requirements:
- huggingface_hub
- hf_transfer
- datasets
resources:
accelerator: H100:2
use_gpu: true
secrets:
hf_access_token: null
runtime:
health_checks:
restart_check_delay_seconds: 300
restart_threshold_seconds: 300
stop_traffic_threshold_seconds: 120
predict_concurrency: 8
# Updated with nightly image and restored speculative decoding for latency
Flags
Thestart_command passes these flags to the engine. Each one controls a runtime or serving behavior:| Flag | Value | What it does |
|---|---|---|
--tensor-parallel-size | $GPU_COUNT | Number of GPUs to shard the model across. |
--max-num-seqs | 16 | Maximum number of concurrent sequences in the batch. |
--max-model-len | auto | Maximum context length (tokens) the server accepts per request. |
--limit-mm-per-prompt.image | 1 | Maximum number of image inputs per prompt. |
--gpu-memory-utilization | 0.9 | Fraction of GPU memory vLLM may use for weights and KV cache. |
--enable-prefix-caching | (no value) | Reuse KV cache across requests that share a prefix. |
--speculative-config.model | RedHatAI/gemma-4-26B-A4B-it-speculator.eagle3 | Hugging Face repo for the draft speculator checkpoint. |
--speculative-config.num_speculative_tokens | 3 | Number of tokens the draft speculator proposes per step. |
--speculative-config.method | eagle3 | Speculative decoding method. eagle3: EAGLE v3 speculative decoding. |
--trust-remote-code | (no value) | Execute model-specific Python from the checkpoint (required for many Qwen, Phi, and custom architectures). |
--enable-auto-tool-choice | (no value) | Let the model choose when to call tools without requiring tool_choice: "required". |
--reasoning-parser | gemma4 | Server-side parser that separates reasoning output into reasoning_content. |
--tool-call-parser | gemma4 | Server-side parser that emits structured tool_calls on the response. |
Deploy
Push the config to Baseten:uvx truss push
✨ Model gemma-4-26B-A4B-it-latency was successfully pushed ✨
🪵 View logs for your deployment at https://app.baseten.co/models/abcd1234/logs/wxyz5678
/models/ in the logs URL (abcd1234 in the example). Use it wherever you see {model_id} in the next section.Call the model
Your deployment serves an OpenAI-compatible API. Replace{model_id} with your model ID and make sure BASETEN_API_KEY is set.Now call your deployment to run inference:- Python
- cURL
main.py
import os
from openai import OpenAI
client = OpenAI(
api_key=os.environ["BASETEN_API_KEY"],
base_url="https://model-{model_id}.api.baseten.co/environments/production/sync/v1",
)
response = client.chat.completions.create(
model="google/gemma-4-26B-A4B-it",
messages=[
{"role": "user", "content": "What is machine learning?"}
],
)
print(response.choices[0].message.content)
curl -s https://model-{model_id}.api.baseten.co/environments/production/sync/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer $BASETEN_API_KEY" \
-d '{
"model": "google/gemma-4-26B-A4B-it",
"messages": [
{"role": "user", "content": "What is machine learning?"}
]
}'
reasoning_content field on the response. Read it alongside the final answer:response = client.chat.completions.create(
model="google/gemma-4-26B-A4B-it",
messages=[
{"role": "user", "content": "How many r's in strawberry?"}
],
)
print(response.choices[0].message.reasoning_content) # chain of thought
print(response.choices[0].message.content) # final answer
tools array. The server returns structured tool_calls on the response:tools = [{
"type": "function",
"function": {
"name": "get_weather",
"parameters": {
"type": "object",
"properties": {"location": {"type": "string"}},
"required": ["location"],
},
},
}]
response = client.chat.completions.create(
model="google/gemma-4-26B-A4B-it",
messages=[
{"role": "user", "content": "What's the weather in Paris?"}
],
tools=tools,
)
print(response.choices[0].message.tool_calls)
google/gemma-4-31B-it is a 31B-parameter dense model with up to 256K context.This preset serves Gemma 4 31B on H100:2 with FP8 block quantization.Then create a file named
You should see output similar to:Your model ID is the string after The server parses the model’s chain of thought into a separate To let the model call tools, pass a
Hardware
H100 × 2
Engine
vLLM 0.20.0
Context
256K
Concurrency
8
Write the config
Create and move into the project directory:mkdir gemma-4-31B-it-latency && cd gemma-4-31B-it-latency
config.yaml and paste the following:config.yaml
model_name: model:gemma-4-31B-it preset:latency
base_image:
image: vllm/vllm-openai:v0.20.0
model_metadata:
repo_id: RedHatAI/gemma-4-31B-it-FP8-block
example_model_input:
model: google/gemma-4-31B-it
messages:
- role: user
content:
- type: text
text: "Describe this image in one sentence."
- type: image_url
image_url:
url: "https://picsum.photos/id/237/200/300"
stream: true
max_tokens: 512
temperature: 1.0
tags:
- openai-compatible
weights:
- source: "hf://RedHatAI/gemma-4-31B-it-FP8-block@main"
mount_location: "/app/checkpoint/gemma"
auth_secret_name: "hf_access_token"
build_commands:
- pip install --upgrade transformers==5.5.4
docker_server:
start_command: >-
sh -c "GPU_COUNT=$(nvidia-smi --list-gpus | wc -l) && vllm serve /app/checkpoint/gemma
--tensor-parallel-size $GPU_COUNT
--served-model-name google/gemma-4-31B-it
--max-num-seqs 16
--max-model-len auto
--limit-mm-per-prompt.image 1
--gpu-memory-utilization 0.9
--enable-prefix-caching
--speculative-config.model RedHatAI/gemma-4-31B-it-speculator.eagle3
--speculative-config.num_speculative_tokens 3
--speculative-config.method eagle3
--trust-remote-code
--enable-auto-tool-choice
--reasoning-parser gemma4
--tool-call-parser gemma4"
readiness_endpoint: /health
liveness_endpoint: /health
predict_endpoint: /v1/chat/completions
server_port: 8000
environment_variables:
VLLM_LOGGING_LEVEL: INFO
requirements:
- huggingface_hub
- hf_transfer
- datasets
resources:
accelerator: H100:2
use_gpu: true
secrets:
hf_access_token: null
runtime:
health_checks:
restart_check_delay_seconds: 300
restart_threshold_seconds: 300
stop_traffic_threshold_seconds: 120
predict_concurrency: 8
# Updated with nightly image and restored speculative decoding for latency
Flags
Thestart_command passes these flags to the engine. Each one controls a runtime or serving behavior:| Flag | Value | What it does |
|---|---|---|
--tensor-parallel-size | $GPU_COUNT | Number of GPUs to shard the model across. |
--max-num-seqs | 16 | Maximum number of concurrent sequences in the batch. |
--max-model-len | auto | Maximum context length (tokens) the server accepts per request. |
--limit-mm-per-prompt.image | 1 | Maximum number of image inputs per prompt. |
--gpu-memory-utilization | 0.9 | Fraction of GPU memory vLLM may use for weights and KV cache. |
--enable-prefix-caching | (no value) | Reuse KV cache across requests that share a prefix. |
--speculative-config.model | RedHatAI/gemma-4-31B-it-speculator.eagle3 | Hugging Face repo for the draft speculator checkpoint. |
--speculative-config.num_speculative_tokens | 3 | Number of tokens the draft speculator proposes per step. |
--speculative-config.method | eagle3 | Speculative decoding method. eagle3: EAGLE v3 speculative decoding. |
--trust-remote-code | (no value) | Execute model-specific Python from the checkpoint (required for many Qwen, Phi, and custom architectures). |
--enable-auto-tool-choice | (no value) | Let the model choose when to call tools without requiring tool_choice: "required". |
--reasoning-parser | gemma4 | Server-side parser that separates reasoning output into reasoning_content. |
--tool-call-parser | gemma4 | Server-side parser that emits structured tool_calls on the response. |
Deploy
Push the config to Baseten:uvx truss push
✨ Model gemma-4-31B-it-latency was successfully pushed ✨
🪵 View logs for your deployment at https://app.baseten.co/models/abcd1234/logs/wxyz5678
/models/ in the logs URL (abcd1234 in the example). Use it wherever you see {model_id} in the next section.Call the model
Your deployment serves an OpenAI-compatible API. Replace{model_id} with your model ID and make sure BASETEN_API_KEY is set.Now call your deployment to run inference:- Python
- cURL
main.py
import os
from openai import OpenAI
client = OpenAI(
api_key=os.environ["BASETEN_API_KEY"],
base_url="https://model-{model_id}.api.baseten.co/environments/production/sync/v1",
)
response = client.chat.completions.create(
model="google/gemma-4-31B-it",
messages=[
{"role": "user", "content": "What is machine learning?"}
],
)
print(response.choices[0].message.content)
curl -s https://model-{model_id}.api.baseten.co/environments/production/sync/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer $BASETEN_API_KEY" \
-d '{
"model": "google/gemma-4-31B-it",
"messages": [
{"role": "user", "content": "What is machine learning?"}
]
}'
reasoning_content field on the response. Read it alongside the final answer:response = client.chat.completions.create(
model="google/gemma-4-31B-it",
messages=[
{"role": "user", "content": "How many r's in strawberry?"}
],
)
print(response.choices[0].message.reasoning_content) # chain of thought
print(response.choices[0].message.content) # final answer
tools array. The server returns structured tool_calls on the response:tools = [{
"type": "function",
"function": {
"name": "get_weather",
"parameters": {
"type": "object",
"properties": {"location": {"type": "string"}},
"required": ["location"],
},
},
}]
response = client.chat.completions.create(
model="google/gemma-4-31B-it",
messages=[
{"role": "user", "content": "What's the weather in Paris?"}
],
tools=tools,
)
print(response.choices[0].message.tool_calls)