Gemma 4

Reasoning Tool calling Multimodal (image)Long context

Setup

To get started, sign into Baseten with Truss and then install the OpenAI SDK.

Sign in to Baseten

uvx truss login --browser

Install the OpenAI SDK

uv pip install openai

Pick the model you want to deploy. Each tab is a self-contained recipe.

E2B
E4B
26B A4B
31B

google/gemma-4-E2B-it is a 2B-parameter dense model with up to 125K context.This preset serves Gemma 4 E2B on a single L4, the lowest-cost deployment in the Model Library.

Hardware

Engine

vLLM 0.20.0

Context

125K

Concurrency

Write the config

Create and move into the project directory:

mkdir gemma-4-E2B-it-latency && cd gemma-4-E2B-it-latency

Then create a file named config.yaml and paste the following:

config.yaml

model_name: model:gemma-4-E2B-it preset:latency
base_image:
  image: vllm/vllm-openai:v0.20.0
model_metadata:
  repo_id: google/gemma-4-E2B-it
  example_model_input:
    model: google/gemma-4-E2B-it
    messages:
      - role: user
        content:
          - type: text
            text: "Describe this image in one sentence."
          - type: image_url
            image_url:
              url: "https://picsum.photos/id/237/200/300"
    stream: true
    max_tokens: 512
    temperature: 1.0
  tags:
    - openai-compatible
weights:
  - source: "hf://google/gemma-4-E2B-it@main"
    mount_location: "/app/checkpoint/gemma"
    auth_secret_name: "hf_access_token"
build_commands:
  - pip install --upgrade transformers==5.5.4
docker_server:
  start_command: >-
    sh -c "GPU_COUNT=$(nvidia-smi --list-gpus | wc -l) && vllm serve /app/checkpoint/gemma
    --tensor-parallel-size $GPU_COUNT
    --served-model-name google/gemma-4-E2B-it
    --max-num-seqs 16
    --max-model-len auto
    --limit-mm-per-prompt.image 1
    --gpu-memory-utilization 0.9
    --async-scheduling
    --trust-remote-code
    --enable-auto-tool-choice
    --enable-prefix-caching
    --reasoning-parser gemma4
    --tool-call-parser gemma4"
  readiness_endpoint: /health
  liveness_endpoint: /health
  predict_endpoint: /v1/chat/completions
  server_port: 8000
environment_variables:
  VLLM_LOGGING_LEVEL: INFO
requirements:
  - huggingface_hub
  - hf_transfer
  - datasets
resources:
  accelerator: L4
  use_gpu: true
secrets:
  hf_access_token: null
runtime:
  health_checks:
    restart_check_delay_seconds: 300
    restart_threshold_seconds: 300
    stop_traffic_threshold_seconds: 120
  predict_concurrency: 8
# Updated with nightly image and async scheduling

Flags

The start_command passes these flags to the engine. Each one controls a runtime or serving behavior:

Flag	Value	What it does
`--tensor-parallel-size`	`$GPU_COUNT`	Number of GPUs to shard the model across.
`--max-num-seqs`	`16`	Maximum number of concurrent sequences in the batch.
`--max-model-len`	`auto`	Maximum context length (tokens) the server accepts per request.
`--limit-mm-per-prompt.image`	`1`	Maximum number of image inputs per prompt.
`--gpu-memory-utilization`	`0.9`	Fraction of GPU memory vLLM may use for weights and KV cache.
`--async-scheduling`	(no value)	Overlap scheduling with GPU execution to hide scheduler latency.
`--trust-remote-code`	(no value)	Execute model-specific Python from the checkpoint (required for many Qwen, Phi, and custom architectures).
`--enable-auto-tool-choice`	(no value)	Let the model choose when to call tools without requiring `tool_choice: "required"`.
`--enable-prefix-caching`	(no value)	Reuse KV cache across requests that share a prefix.
`--reasoning-parser`	`gemma4`	Server-side parser that separates reasoning output into `reasoning_content`.
`--tool-call-parser`	`gemma4`	Server-side parser that emits structured `tool_calls` on the response.

Deploy

Push the config to Baseten:

uvx truss push

You should see output similar to:

✨ Model gemma-4-E2B-it-latency was successfully pushed ✨
🪵 View logs for your deployment at https://app.baseten.co/models/abcd1234/logs/wxyz5678

Your model ID is the string after /models/ in the logs URL (abcd1234 in the example). Use it wherever you see {model_id} in the next section.

Call the model

Your deployment serves an OpenAI-compatible API. Replace {model_id} with your model ID and make sure BASETEN_API_KEY is set.Now call your deployment to run inference:

Python
cURL

main.py

import os
from openai import OpenAI

client = OpenAI(
    api_key=os.environ["BASETEN_API_KEY"],
    base_url="https://model-{model_id}.api.baseten.co/environments/production/sync/v1",
)

response = client.chat.completions.create(
    model="google/gemma-4-E2B-it",
    messages=[
        {"role": "user", "content": "What is machine learning?"}
    ],
)

print(response.choices[0].message.content)

curl -s https://model-{model_id}.api.baseten.co/environments/production/sync/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Api-Key $BASETEN_API_KEY" \
  -d '{
    "model": "google/gemma-4-E2B-it",
    "messages": [
      {"role": "user", "content": "What is machine learning?"}
    ]
  }'

The server parses the model’s chain of thought into a separate reasoning_content field on the response. Read it alongside the final answer:

response = client.chat.completions.create(
    model="google/gemma-4-E2B-it",
    messages=[
        {"role": "user", "content": "How many r's in strawberry?"}
    ],
)
print(response.choices[0].message.reasoning_content)  # chain of thought
print(response.choices[0].message.content)            # final answer

To let the model call tools, pass a tools array. The server returns structured tool_calls on the response:

tools = [{
    "type": "function",
    "function": {
        "name": "get_weather",
        "parameters": {
            "type": "object",
            "properties": {"location": {"type": "string"}},
            "required": ["location"],
        },
    },
}]

response = client.chat.completions.create(
    model="google/gemma-4-E2B-it",
    messages=[
        {"role": "user", "content": "What's the weather in Paris?"}
    ],
    tools=tools,
)
print(response.choices[0].message.tool_calls)

google/gemma-4-E4B-it is a 4B-parameter dense model with up to 125K context.This preset serves Gemma 4 E4B on a single H100.

Hardware

H100

Engine

vLLM 0.20.0

Context

125K

Concurrency

Write the config

Create and move into the project directory:

mkdir gemma-4-E4B-it-latency && cd gemma-4-E4B-it-latency

Then create a file named config.yaml and paste the following:

config.yaml

model_name: model:gemma-4-E4B-it preset:latency
base_image:
  image: vllm/vllm-openai:v0.20.0
model_metadata:
  repo_id: google/gemma-4-E4B-it
  example_model_input:
    model: google/gemma-4-E4B-it
    messages:
      - role: user
        content:
          - type: text
            text: "Describe this image in one sentence."
          - type: image_url
            image_url:
              url: "https://picsum.photos/id/237/200/300"
    stream: true
    max_tokens: 512
    temperature: 1.0
  tags:
    - openai-compatible
weights:
  - source: "hf://google/gemma-4-E4B-it@main"
    mount_location: "/app/checkpoint/gemma"
    auth_secret_name: "hf_access_token"
build_commands:
  - pip install --upgrade transformers==5.5.4
docker_server:
  start_command: >-
    sh -c "GPU_COUNT=$(nvidia-smi --list-gpus | wc -l) && vllm serve /app/checkpoint/gemma
    --tensor-parallel-size $GPU_COUNT
    --served-model-name google/gemma-4-E4B-it
    --max-num-seqs 16
    --max-model-len auto
    --limit-mm-per-prompt.image 1
    --gpu-memory-utilization 0.9
    --async-scheduling
    --trust-remote-code
    --enable-auto-tool-choice
    --enable-prefix-caching
    --reasoning-parser gemma4
    --tool-call-parser gemma4"
  readiness_endpoint: /health
  liveness_endpoint: /health
  predict_endpoint: /v1/chat/completions
  server_port: 8000
environment_variables:
  VLLM_LOGGING_LEVEL: INFO
requirements:
  - huggingface_hub
  - hf_transfer
  - datasets
resources:
  accelerator: H100
  use_gpu: true
secrets:
  hf_access_token: null
runtime:
  health_checks:
    restart_check_delay_seconds: 300
    restart_threshold_seconds: 300
    stop_traffic_threshold_seconds: 120
  predict_concurrency: 8
# Updated with nightly image and async scheduling

Flags

The start_command passes these flags to the engine. Each one controls a runtime or serving behavior:

Flag	Value	What it does
`--tensor-parallel-size`	`$GPU_COUNT`	Number of GPUs to shard the model across.
`--max-num-seqs`	`16`	Maximum number of concurrent sequences in the batch.
`--max-model-len`	`auto`	Maximum context length (tokens) the server accepts per request.
`--limit-mm-per-prompt.image`	`1`	Maximum number of image inputs per prompt.
`--gpu-memory-utilization`	`0.9`	Fraction of GPU memory vLLM may use for weights and KV cache.
`--async-scheduling`	(no value)	Overlap scheduling with GPU execution to hide scheduler latency.
`--trust-remote-code`	(no value)	Execute model-specific Python from the checkpoint (required for many Qwen, Phi, and custom architectures).
`--enable-auto-tool-choice`	(no value)	Let the model choose when to call tools without requiring `tool_choice: "required"`.
`--enable-prefix-caching`	(no value)	Reuse KV cache across requests that share a prefix.
`--reasoning-parser`	`gemma4`	Server-side parser that separates reasoning output into `reasoning_content`.
`--tool-call-parser`	`gemma4`	Server-side parser that emits structured `tool_calls` on the response.

Deploy

Push the config to Baseten:

uvx truss push

You should see output similar to:

✨ Model gemma-4-E4B-it-latency was successfully pushed ✨
🪵 View logs for your deployment at https://app.baseten.co/models/abcd1234/logs/wxyz5678

Your model ID is the string after /models/ in the logs URL (abcd1234 in the example). Use it wherever you see {model_id} in the next section.

Call the model

Your deployment serves an OpenAI-compatible API. Replace {model_id} with your model ID and make sure BASETEN_API_KEY is set.Now call your deployment to run inference:

Python
cURL

main.py

import os
from openai import OpenAI

client = OpenAI(
    api_key=os.environ["BASETEN_API_KEY"],
    base_url="https://model-{model_id}.api.baseten.co/environments/production/sync/v1",
)

response = client.chat.completions.create(
    model="google/gemma-4-E4B-it",
    messages=[
        {"role": "user", "content": "What is machine learning?"}
    ],
)

print(response.choices[0].message.content)

curl -s https://model-{model_id}.api.baseten.co/environments/production/sync/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Api-Key $BASETEN_API_KEY" \
  -d '{
    "model": "google/gemma-4-E4B-it",
    "messages": [
      {"role": "user", "content": "What is machine learning?"}
    ]
  }'

The server parses the model’s chain of thought into a separate reasoning_content field on the response. Read it alongside the final answer:

response = client.chat.completions.create(
    model="google/gemma-4-E4B-it",
    messages=[
        {"role": "user", "content": "How many r's in strawberry?"}
    ],
)
print(response.choices[0].message.reasoning_content)  # chain of thought
print(response.choices[0].message.content)            # final answer

To let the model call tools, pass a tools array. The server returns structured tool_calls on the response:

tools = [{
    "type": "function",
    "function": {
        "name": "get_weather",
        "parameters": {
            "type": "object",
            "properties": {"location": {"type": "string"}},
            "required": ["location"],
        },
    },
}]

response = client.chat.completions.create(
    model="google/gemma-4-E4B-it",
    messages=[
        {"role": "user", "content": "What's the weather in Paris?"}
    ],
    tools=tools,
)
print(response.choices[0].message.tool_calls)

google/gemma-4-26B-A4B-it is a 26B-parameter MoE model (4B active per token) with up to 256K context.This preset serves Gemma 4 26B A4B on H100:2 with FP8 dynamic quantization.

Hardware

H100 × 2

Engine

vLLM 0.20.0

Context

256K

Concurrency

Write the config

Create and move into the project directory:

mkdir gemma-4-26B-A4B-it-latency && cd gemma-4-26B-A4B-it-latency

Then create a file named config.yaml and paste the following:

config.yaml

model_name: model:gemma-4-26B-A4B-it preset:latency
base_image:
  image: vllm/vllm-openai:v0.20.0
model_metadata:
  repo_id: RedHatAI/gemma-4-26B-A4B-it-FP8-Dynamic
  example_model_input:
    model: google/gemma-4-26B-A4B-it
    messages:
      - role: user
        content:
          - type: text
            text: "Describe this image in one sentence."
          - type: image_url
            image_url:
              url: "https://picsum.photos/id/237/200/300"
    stream: true
    max_tokens: 512
    temperature: 1.0
  tags:
    - openai-compatible
weights:
  - source: "hf://RedHatAI/gemma-4-26B-A4B-it-FP8-Dynamic@main"
    mount_location: "/app/checkpoint/gemma"
    auth_secret_name: "hf_access_token"
build_commands:
  - pip install --upgrade transformers==5.5.4
docker_server:
  start_command: >-
    sh -c "GPU_COUNT=$(nvidia-smi --list-gpus | wc -l) && vllm serve /app/checkpoint/gemma
    --tensor-parallel-size $GPU_COUNT
    --served-model-name google/gemma-4-26B-A4B-it
    --max-num-seqs 16
    --max-model-len auto
    --limit-mm-per-prompt.image 1
    --gpu-memory-utilization 0.9
    --enable-prefix-caching
    --speculative-config.model RedHatAI/gemma-4-26B-A4B-it-speculator.eagle3
    --speculative-config.num_speculative_tokens 3
    --speculative-config.method eagle3
    --trust-remote-code
    --enable-auto-tool-choice
    --reasoning-parser gemma4
    --tool-call-parser gemma4"
  readiness_endpoint: /health
  liveness_endpoint: /health
  predict_endpoint: /v1/chat/completions
  server_port: 8000
environment_variables:
  VLLM_LOGGING_LEVEL: INFO
requirements:
  - huggingface_hub
  - hf_transfer
  - datasets
resources:
  accelerator: H100:2
  use_gpu: true
secrets:
  hf_access_token: null
runtime:
  health_checks:
    restart_check_delay_seconds: 300
    restart_threshold_seconds: 300
    stop_traffic_threshold_seconds: 120
  predict_concurrency: 8
# Updated with nightly image and restored speculative decoding for latency

Flags

The start_command passes these flags to the engine. Each one controls a runtime or serving behavior:

Flag	Value	What it does
`--tensor-parallel-size`	`$GPU_COUNT`	Number of GPUs to shard the model across.
`--max-num-seqs`	`16`	Maximum number of concurrent sequences in the batch.
`--max-model-len`	`auto`	Maximum context length (tokens) the server accepts per request.
`--limit-mm-per-prompt.image`	`1`	Maximum number of image inputs per prompt.
`--gpu-memory-utilization`	`0.9`	Fraction of GPU memory vLLM may use for weights and KV cache.
`--enable-prefix-caching`	(no value)	Reuse KV cache across requests that share a prefix.
`--speculative-config.model`	`RedHatAI/gemma-4-26B-A4B-it-speculator.eagle3`	Hugging Face repo for the draft speculator checkpoint.
`--speculative-config.num_speculative_tokens`	`3`	Number of tokens the draft speculator proposes per step.
`--speculative-config.method`	`eagle3`	Speculative decoding method. eagle3: EAGLE v3 speculative decoding.
`--trust-remote-code`	(no value)	Execute model-specific Python from the checkpoint (required for many Qwen, Phi, and custom architectures).
`--enable-auto-tool-choice`	(no value)	Let the model choose when to call tools without requiring `tool_choice: "required"`.
`--reasoning-parser`	`gemma4`	Server-side parser that separates reasoning output into `reasoning_content`.
`--tool-call-parser`	`gemma4`	Server-side parser that emits structured `tool_calls` on the response.

Deploy

Push the config to Baseten:

uvx truss push

You should see output similar to:

✨ Model gemma-4-26B-A4B-it-latency was successfully pushed ✨
🪵 View logs for your deployment at https://app.baseten.co/models/abcd1234/logs/wxyz5678

Your model ID is the string after /models/ in the logs URL (abcd1234 in the example). Use it wherever you see {model_id} in the next section.

Call the model

Your deployment serves an OpenAI-compatible API. Replace {model_id} with your model ID and make sure BASETEN_API_KEY is set.Now call your deployment to run inference:

Python
cURL

main.py

import os
from openai import OpenAI

client = OpenAI(
    api_key=os.environ["BASETEN_API_KEY"],
    base_url="https://model-{model_id}.api.baseten.co/environments/production/sync/v1",
)

response = client.chat.completions.create(
    model="google/gemma-4-26B-A4B-it",
    messages=[
        {"role": "user", "content": "What is machine learning?"}
    ],
)

print(response.choices[0].message.content)

curl -s https://model-{model_id}.api.baseten.co/environments/production/sync/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Api-Key $BASETEN_API_KEY" \
  -d '{
    "model": "google/gemma-4-26B-A4B-it",
    "messages": [
      {"role": "user", "content": "What is machine learning?"}
    ]
  }'

The server parses the model’s chain of thought into a separate reasoning_content field on the response. Read it alongside the final answer:

response = client.chat.completions.create(
    model="google/gemma-4-26B-A4B-it",
    messages=[
        {"role": "user", "content": "How many r's in strawberry?"}
    ],
)
print(response.choices[0].message.reasoning_content)  # chain of thought
print(response.choices[0].message.content)            # final answer

To let the model call tools, pass a tools array. The server returns structured tool_calls on the response:

tools = [{
    "type": "function",
    "function": {
        "name": "get_weather",
        "parameters": {
            "type": "object",
            "properties": {"location": {"type": "string"}},
            "required": ["location"],
        },
    },
}]

response = client.chat.completions.create(
    model="google/gemma-4-26B-A4B-it",
    messages=[
        {"role": "user", "content": "What's the weather in Paris?"}
    ],
    tools=tools,
)
print(response.choices[0].message.tool_calls)

google/gemma-4-31B-it is a 31B-parameter dense model with up to 256K context.This preset serves Gemma 4 31B on H100:2 with FP8 block quantization.

Hardware

H100 × 2

Engine

vLLM 0.20.0

Context

256K

Concurrency

Write the config

Create and move into the project directory:

mkdir gemma-4-31B-it-latency && cd gemma-4-31B-it-latency

Then create a file named config.yaml and paste the following:

config.yaml

model_name: model:gemma-4-31B-it preset:latency
base_image:
  image: vllm/vllm-openai:v0.20.0
model_metadata:
  repo_id: RedHatAI/gemma-4-31B-it-FP8-block
  example_model_input:
    model: google/gemma-4-31B-it
    messages:
      - role: user
        content:
          - type: text
            text: "Describe this image in one sentence."
          - type: image_url
            image_url:
              url: "https://picsum.photos/id/237/200/300"
    stream: true
    max_tokens: 512
    temperature: 1.0
  tags:
    - openai-compatible
weights:
  - source: "hf://RedHatAI/gemma-4-31B-it-FP8-block@main"
    mount_location: "/app/checkpoint/gemma"
    auth_secret_name: "hf_access_token"
build_commands:
  - pip install --upgrade transformers==5.5.4
docker_server:
  start_command: >-
    sh -c "GPU_COUNT=$(nvidia-smi --list-gpus | wc -l) && vllm serve /app/checkpoint/gemma
    --tensor-parallel-size $GPU_COUNT
    --served-model-name google/gemma-4-31B-it
    --max-num-seqs 16
    --max-model-len auto
    --limit-mm-per-prompt.image 1
    --gpu-memory-utilization 0.9
    --enable-prefix-caching
    --speculative-config.model RedHatAI/gemma-4-31B-it-speculator.eagle3
    --speculative-config.num_speculative_tokens 3
    --speculative-config.method eagle3
    --trust-remote-code
    --enable-auto-tool-choice
    --reasoning-parser gemma4
    --tool-call-parser gemma4"
  readiness_endpoint: /health
  liveness_endpoint: /health
  predict_endpoint: /v1/chat/completions
  server_port: 8000
environment_variables:
  VLLM_LOGGING_LEVEL: INFO
requirements:
  - huggingface_hub
  - hf_transfer
  - datasets
resources:
  accelerator: H100:2
  use_gpu: true
secrets:
  hf_access_token: null
runtime:
  health_checks:
    restart_check_delay_seconds: 300
    restart_threshold_seconds: 300
    stop_traffic_threshold_seconds: 120
  predict_concurrency: 8
# Updated with nightly image and restored speculative decoding for latency

Flags

The start_command passes these flags to the engine. Each one controls a runtime or serving behavior:

Flag	Value	What it does
`--tensor-parallel-size`	`$GPU_COUNT`	Number of GPUs to shard the model across.
`--max-num-seqs`	`16`	Maximum number of concurrent sequences in the batch.
`--max-model-len`	`auto`	Maximum context length (tokens) the server accepts per request.
`--limit-mm-per-prompt.image`	`1`	Maximum number of image inputs per prompt.
`--gpu-memory-utilization`	`0.9`	Fraction of GPU memory vLLM may use for weights and KV cache.
`--enable-prefix-caching`	(no value)	Reuse KV cache across requests that share a prefix.
`--speculative-config.model`	`RedHatAI/gemma-4-31B-it-speculator.eagle3`	Hugging Face repo for the draft speculator checkpoint.
`--speculative-config.num_speculative_tokens`	`3`	Number of tokens the draft speculator proposes per step.
`--speculative-config.method`	`eagle3`	Speculative decoding method. eagle3: EAGLE v3 speculative decoding.
`--trust-remote-code`	(no value)	Execute model-specific Python from the checkpoint (required for many Qwen, Phi, and custom architectures).
`--enable-auto-tool-choice`	(no value)	Let the model choose when to call tools without requiring `tool_choice: "required"`.
`--reasoning-parser`	`gemma4`	Server-side parser that separates reasoning output into `reasoning_content`.
`--tool-call-parser`	`gemma4`	Server-side parser that emits structured `tool_calls` on the response.

Deploy

Push the config to Baseten:

uvx truss push

You should see output similar to:

✨ Model gemma-4-31B-it-latency was successfully pushed ✨
🪵 View logs for your deployment at https://app.baseten.co/models/abcd1234/logs/wxyz5678

Your model ID is the string after /models/ in the logs URL (abcd1234 in the example). Use it wherever you see {model_id} in the next section.

Call the model

Your deployment serves an OpenAI-compatible API. Replace {model_id} with your model ID and make sure BASETEN_API_KEY is set.Now call your deployment to run inference:

Python
cURL

main.py

import os
from openai import OpenAI

client = OpenAI(
    api_key=os.environ["BASETEN_API_KEY"],
    base_url="https://model-{model_id}.api.baseten.co/environments/production/sync/v1",
)

response = client.chat.completions.create(
    model="google/gemma-4-31B-it",
    messages=[
        {"role": "user", "content": "What is machine learning?"}
    ],
)

print(response.choices[0].message.content)

curl -s https://model-{model_id}.api.baseten.co/environments/production/sync/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Api-Key $BASETEN_API_KEY" \
  -d '{
    "model": "google/gemma-4-31B-it",
    "messages": [
      {"role": "user", "content": "What is machine learning?"}
    ]
  }'

The server parses the model’s chain of thought into a separate reasoning_content field on the response. Read it alongside the final answer:

response = client.chat.completions.create(
    model="google/gemma-4-31B-it",
    messages=[
        {"role": "user", "content": "How many r's in strawberry?"}
    ],
)
print(response.choices[0].message.reasoning_content)  # chain of thought
print(response.choices[0].message.content)            # final answer

To let the model call tools, pass a tools array. The server returns structured tool_calls on the response:

tools = [{
    "type": "function",
    "function": {
        "name": "get_weather",
        "parameters": {
            "type": "object",
            "properties": {"location": {"type": "string"}},
            "required": ["location"],
        },
    },
}]

response = client.chat.completions.create(
    model="google/gemma-4-31B-it",
    messages=[
        {"role": "user", "content": "What's the weather in Paris?"}
    ],
    tools=tools,
)
print(response.choices[0].message.tool_calls)

Examples

Models

Engines

Custom Docker servers

Custom Python models

Chains

Setup

Hardware

Engine

Context

Concurrency

Write the config

Flags

Deploy

Call the model

Hardware

Engine

Context

Concurrency

Write the config

Flags

Deploy

Call the model

Hardware

Engine

Context

Concurrency

Write the config

Flags

Deploy

Call the model

Hardware

Engine

Context

Concurrency

Write the config

Flags

Deploy

Call the model

Examples

Models

Engines

Custom Docker servers

Custom Python models

Chains

Documentation Index

​Setup

Hardware

Engine

Context

Concurrency

​Write the config

​Flags

​Deploy

​Call the model

Hardware

Engine

Context

Concurrency

​Write the config

​Flags

​Deploy

​Call the model

Hardware

Engine

Context

Concurrency

​Write the config

​Flags

​Deploy

​Call the model

Hardware

Engine

Context

Concurrency

​Write the config

​Flags

​Deploy

​Call the model

Setup

Write the config

Flags

Deploy

Call the model

Write the config

Flags

Deploy

Call the model

Write the config

Flags

Deploy

Call the model

Write the config

Flags

Deploy

Call the model