Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.baseten.co/llms.txt

Use this file to discover all available pages before exploring further.

Setup

To get started, sign into Baseten with Truss and then install the OpenAI SDK.
Sign in to Baseten
uvx truss login --browser
Install the OpenAI SDK
uv pip install openai
Pick the model you want to deploy. Each tab is a self-contained recipe.
google/gemma-4-E2B-it is a 2B-parameter dense model with up to 125K context.This preset serves Gemma 4 E2B on a single L4, the lowest-cost deployment in the Model Library.

Hardware

L4

Engine

vLLM 0.20.0

Context

125K

Concurrency

8

Write the config

Create and move into the project directory:
mkdir gemma-4-E2B-it-latency && cd gemma-4-E2B-it-latency
Then create a file named config.yaml and paste the following:
config.yaml
model_name: model:gemma-4-E2B-it preset:latency
base_image:
  image: vllm/vllm-openai:v0.20.0
model_metadata:
  repo_id: google/gemma-4-E2B-it
  example_model_input:
    model: google/gemma-4-E2B-it
    messages:
      - role: user
        content:
          - type: text
            text: "Describe this image in one sentence."
          - type: image_url
            image_url:
              url: "https://picsum.photos/id/237/200/300"
    stream: true
    max_tokens: 512
    temperature: 1.0
  tags:
    - openai-compatible
weights:
  - source: "hf://google/gemma-4-E2B-it@main"
    mount_location: "/app/checkpoint/gemma"
    auth_secret_name: "hf_access_token"
build_commands:
  - pip install --upgrade transformers==5.5.4
docker_server:
  start_command: >-
    sh -c "GPU_COUNT=$(nvidia-smi --list-gpus | wc -l) && vllm serve /app/checkpoint/gemma
    --tensor-parallel-size $GPU_COUNT
    --served-model-name google/gemma-4-E2B-it
    --max-num-seqs 16
    --max-model-len auto
    --limit-mm-per-prompt.image 1
    --gpu-memory-utilization 0.9
    --async-scheduling
    --trust-remote-code
    --enable-auto-tool-choice
    --enable-prefix-caching
    --reasoning-parser gemma4
    --tool-call-parser gemma4"
  readiness_endpoint: /health
  liveness_endpoint: /health
  predict_endpoint: /v1/chat/completions
  server_port: 8000
environment_variables:
  VLLM_LOGGING_LEVEL: INFO
requirements:
  - huggingface_hub
  - hf_transfer
  - datasets
resources:
  accelerator: L4
  use_gpu: true
secrets:
  hf_access_token: null
runtime:
  health_checks:
    restart_check_delay_seconds: 300
    restart_threshold_seconds: 300
    stop_traffic_threshold_seconds: 120
  predict_concurrency: 8
# Updated with nightly image and async scheduling

Flags

The start_command passes these flags to the engine. Each one controls a runtime or serving behavior:
FlagValueWhat it does
--tensor-parallel-size$GPU_COUNTNumber of GPUs to shard the model across.
--max-num-seqs16Maximum number of concurrent sequences in the batch.
--max-model-lenautoMaximum context length (tokens) the server accepts per request.
--limit-mm-per-prompt.image1Maximum number of image inputs per prompt.
--gpu-memory-utilization0.9Fraction of GPU memory vLLM may use for weights and KV cache.
--async-scheduling(no value)Overlap scheduling with GPU execution to hide scheduler latency.
--trust-remote-code(no value)Execute model-specific Python from the checkpoint (required for many Qwen, Phi, and custom architectures).
--enable-auto-tool-choice(no value)Let the model choose when to call tools without requiring tool_choice: "required".
--enable-prefix-caching(no value)Reuse KV cache across requests that share a prefix.
--reasoning-parsergemma4Server-side parser that separates reasoning output into reasoning_content.
--tool-call-parsergemma4Server-side parser that emits structured tool_calls on the response.

Deploy

Push the config to Baseten:
uvx truss push
You should see output similar to:
✨ Model gemma-4-E2B-it-latency was successfully pushed ✨
🪵 View logs for your deployment at https://app.baseten.co/models/abcd1234/logs/wxyz5678
Your model ID is the string after /models/ in the logs URL (abcd1234 in the example). Use it wherever you see {model_id} in the next section.

Call the model

Your deployment serves an OpenAI-compatible API. Replace {model_id} with your model ID and make sure BASETEN_API_KEY is set.Now call your deployment to run inference:
main.py
import os
from openai import OpenAI

client = OpenAI(
    api_key=os.environ["BASETEN_API_KEY"],
    base_url="https://model-{model_id}.api.baseten.co/environments/production/sync/v1",
)

response = client.chat.completions.create(
    model="google/gemma-4-E2B-it",
    messages=[
        {"role": "user", "content": "What is machine learning?"}
    ],
)

print(response.choices[0].message.content)
The server parses the model’s chain of thought into a separate reasoning_content field on the response. Read it alongside the final answer:
response = client.chat.completions.create(
    model="google/gemma-4-E2B-it",
    messages=[
        {"role": "user", "content": "How many r's in strawberry?"}
    ],
)
print(response.choices[0].message.reasoning_content)  # chain of thought
print(response.choices[0].message.content)            # final answer
To let the model call tools, pass a tools array. The server returns structured tool_calls on the response:
tools = [{
    "type": "function",
    "function": {
        "name": "get_weather",
        "parameters": {
            "type": "object",
            "properties": {"location": {"type": "string"}},
            "required": ["location"],
        },
    },
}]

response = client.chat.completions.create(
    model="google/gemma-4-E2B-it",
    messages=[
        {"role": "user", "content": "What's the weather in Paris?"}
    ],
    tools=tools,
)
print(response.choices[0].message.tool_calls)