Skip to main content
The config.yaml file defines how your model runs on Baseten: its dependencies, compute resources, secrets, and runtime behavior. You specify what your model needs; Baseten handles the infrastructure. Every Truss includes a config.yaml in its root directory. Configuration is optional, every value has a sensible default. Common configuration tasks include:
If you’re new to YAML, here’s a quick primer. The default config uses [] for empty lists and {} for empty dictionaries. When adding values, the syntax changes to indented lines:
# Empty
requirements: []
secrets: {}

# With values
requirements:
  - torch
  - transformers
secrets:
  hf_access_token: null

Example

The following example shows a config file for a GPU-accelerated text generation model:
config.yaml
model_name: my-llm
description: A text generation model.
requirements:
  - torch
  - transformers
  - accelerate
resources:
  cpu: "4"
  memory: 16Gi
  accelerator: L4
secrets:
  hf_access_token: null
For more examples, see the truss-examples repository.

Reference

model_name
string
The name of your model. This is displayed in the model details page in the Baseten UI.
description
string
A description of your model.
model_class_name
string
default:"Model"
The name of the class that defines your Truss model. This class must implement at least a predict method.
model_module_dir
string
default:"model"
The folder containing your model class.
data_dir
string
default:"data"
The folder for data files in your Truss. Access it in your model:
model/model.py
class Model:
  def __init__(self, **kwargs):
    data_dir = kwargs["data_dir"]

  # ...
bundled_packages_dir
string
default:"packages"
The folder for custom packages in your Truss.Place your own code here to reference in model.py. For example, with this project structure:
stable-diffusion/
    packages/
        package_1/
            subpackage/
                script.py
    model/
        model.py
        __init__.py
    config.yaml
Inside the model.py the package can be imported like this:
model/model.py
from package_1.subpackage.script import run_script

class Model:
    def __init__(self, **kwargs):
        pass

    def load(self):
        run_script()

    ...
external_package_dirs
string[]
Use external_package_dirs to access custom packages located outside your Truss. This lets multiple Trusses share the same package.The following example shows a project structure where shared_utils/ is outside the Truss:
my-model/
    model/
        model.py
    config.yaml
shared_utils/
    helpers.py
Specify the path in your config.yaml:
config.yaml
external_package_dirs:
  - ../shared_utils/
Then import the package in your model.py:
model.py
from shared_utils.helpers import process_input

class Model:
    def predict(self, model_input):
        return process_input(model_input)
environment_variables
object
Key-value pairs exposed to the environment that the model executes in. Many Python libraries can be customized with environment variables.
Do not store secret values directly in environment variables (or anywhere in the config file). See the secrets field for information on properly managing secrets.
environment_variables:
  ENVIRONMENT: Staging
  DB_URL: https://my_database.example.com/
model_metadata
object
A flexible field for additional metadata. The entire config file is available to your model at runtime.Reserved keys that Baseten interprets:
  • example_model_input: Sample input that populates the Baseten playground.
For example, to configure a model with playground input and custom vLLM settings, use the following:
model_metadata:
  example_model_input: {"prompt": "What is the meaning of life?"}
  vllm_config:
    tensor_parallel_size: 1
    max_model_len: 4096
requirements_file
string
Path to a dependency file. Supports requirements.txt, pyproject.toml, and uv.lock. Truss detects the format by filename. Pin versions for reproducibility.When set to a pyproject.toml, Truss installs packages from [project.dependencies]. When set to a uv.lock, a sibling pyproject.toml must exist in the same directory.
requirements_file: ./requirements.txt
requirements_file: ./pyproject.toml
requirements_file: ./uv.lock
requirements
string[]
A list of Python dependencies in pip requirements file format. Mutually exclusive with requirements_file. Only one can be specified.For example, to install pinned versions of the dependencies, use the following:
requirements:
  - scikit-learn==1.0.2
  - threadpoolctl==3.0.0
  - joblib==1.1.0
  - numpy==1.20.3
  - scipy==1.7.3
system_packages
string[]
System packages that you would typically install using apt on a Debian operating system.
system_packages:
  - ffmpeg
  - libsm6
  - libxext6
python_version
string
default:"py39"
The Python version to use. Supported versions:
  • py39
  • py310
  • py311
  • py312
  • py313
  • py314
secrets
object
Declare secrets your model needs at runtime, such as API keys or access tokens. Store the actual values in your organization settings.
Never store actual secret values in config. Use null as a placeholder. The key name must match the secret name in your organization.
secrets:
  hf_access_token: null
For more information, see Secrets.
examples_filename
string
default:"examples.yaml"
The path to a file containing example inputs for your model.
live_reload
boolean
default:"false"
If true, changes to your model code are automatically reloaded without restarting the server. Useful for development.
apply_library_patches
boolean
default:"true"
Whether to apply library patches for improved compatibility.

resources

The resources section specifies the compute resources that your model needs, including CPU, memory, and GPU resources. You can configure resources in two ways: Option 1: Specify individual resource fields
resources:
  accelerator: A10G
  cpu: "4"
  memory: 20Gi
Baseten provisions the smallest instance that meets the specified constraints. Option 2: Specify an exact instance type
resources:
  instance_type: "A10G:4x16"
Using instance_type lets you select an exact SKU from the instance type reference. When instance_type is specified, other resource fields are ignored.
cpu
string
default:"1"
CPU resources needed, expressed as either a raw number or “millicpus”. For example, 1000m and 1 are equivalent. Fractional CPU amounts can be requested using millicpus. For example, 500m is half of a CPU core.
memory
string
default:"2Gi"
CPU RAM needed, expressed as a number with units. Units include “Gi” (Gibibytes), “G” (Gigabytes), “Mi” (Mebibytes), and “M” (Megabytes). For example, 1Gi and 1024Mi are equivalent.
Gi in resources.memory refers to Gibibytes, which are slightly larger than Gigabytes.
accelerator
string
The GPU type for your instance. Available GPUs:To request multiple GPUs (for example, if the weights don’t fit in a single GPU), use the : operator:
resources:
  accelerator: L4:4 # Requests 4 L4s
For more information, see how to Manage resources.
instance_type
string
The full SKU name for the instance type. When specified, cpu, memory, and accelerator fields are ignored.Use this field to select an exact instance type from the instance type reference. The format is <GPU_TYPE>:<vCPU>x<MEMORY> for GPU instances or CPU:<vCPU>x<MEMORY> for CPU-only instances.
resources:
  instance_type: "L4:4x16"
Examples:
  • L4:4x16: L4 GPU with 4 vCPUs and 16 GiB RAM.
  • H100:8x80: H100 GPU with 8 vCPUs and 80 GiB RAM (the exact specs vary by GPU type).
  • CPU:4x16: CPU-only instance with 4 vCPUs and 16 GiB RAM.
node_count
number
The number of nodes for multi-node deployments. Each node gets the specified resources.

runtime

Runtime settings for your model instance. For example, to configure a high-throughput inference server with concurrency and health checks, use the following:
runtime:
  predict_concurrency: 256
  streaming_read_timeout: 120
  health_checks:
    restart_threshold_seconds: 600
    stop_traffic_threshold_seconds: 300
predict_concurrency
number
default:"1"
The number of concurrent requests that can run in your model’s predict method. Defaults to 1, meaning predict runs one request at a time. Increase this if your model supports parallelism.See Autoscaling for more detail.
streaming_read_timeout
number
default:"60"
The timeout in seconds for streaming read operations.
enable_tracing_data
boolean
default:"false"
If true, enables trace data export with built-in OTEL instrumentation. By default, data is collected internally by Baseten for troubleshooting. You can also export to your own systems. See the tracing guide. May add performance overhead.
enable_debug_logs
boolean
default:"false"
If true, sets the Truss server log level to DEBUG instead of INFO.
transport
object
The transport protocol for your model. Supports http (default), websocket, and grpc.
runtime:
  transport:
    kind: websocket
    ping_interval_seconds: 30
    ping_timeout_seconds: 10
health_checks
object
Custom health check configuration for your deployments. For details, see health check configuration.
runtime:
  health_checks:
    startup_threshold_seconds: 2400
    restart_threshold_seconds: 600
    stop_traffic_threshold_seconds: 300
startup_threshold_seconds
number
How long the startup phase runs before marking the replica as unhealthy. During startup, readiness and liveness probes don’t run. Values must be between 10 and 3000 seconds. Defaults to 30 minutes (1800 seconds). See health checks for details.
stop_traffic_threshold_seconds
number
How long health checks must continuously fail before Baseten stops traffic to the replica. Defaults to 30 minutes (1800 seconds).
restart_threshold_seconds
number
How long health checks must continuously fail before Baseten restarts the replica. Defaults to 30 minutes (1800 seconds).
restart_check_delay_seconds
number
deprecated
How long to wait before running health checks. Deprecated. Use startup_threshold_seconds instead.

base_image

Use base_image to deploy a custom Docker image. This is useful for running scripts at build time or installing complex dependencies. For more information, see Deploy custom Docker images. For example, to use the vLLM Docker image as your base, use the following:
base_image:
  image: vllm/vllm-openai:v0.7.3
  python_executable_path: /usr/bin/python
# ...
image
string
The path to the Docker image, for example:
  • vllm/vllm-openai
  • lmsysorg/sglang
  • nvcr.io/nvidia/nemo:23.03
When using image tags like :latest, Baseten uses a cached copy and may not reflect updates to the image. To pull a specific version, use image digests like your-image@sha256:abc123....
python_executable_path
string
A path to the Python executable on the image, for example /usr/bin/python.
base_image:
  image: vllm/vllm-openai:latest
  python_executable_path: /usr/bin/python
docker_auth
object
Authentication configuration for a private Docker registry.
base_image:
  docker_auth:
    auth_method: GCP_SERVICE_ACCOUNT_JSON
    secret_name: gcp-service-account
    registry: us-west2-docker.pkg.dev
For more information, see Private Docker registries.
auth_method
string
The authentication method for the private registry. Supported values:
  • GCP_SERVICE_ACCOUNT_JSON - authenticate with a GCP service account. Add your service account JSON blob as a Truss secret.
  • AWS_IAM - authenticate with an AWS IAM service account. Add aws_access_key_id and aws_secret_access_key to your Baseten secrets.
  • AWS_OIDC - authenticate using AWS OIDC federation. Requires aws_oidc_role_arn and aws_oidc_region.
  • GCP_OIDC - authenticate using GCP Workload Identity Federation. Requires gcp_oidc_service_account and gcp_oidc_workload_id_provider.
For GCP_SERVICE_ACCOUNT_JSON:
base_image:
  docker_auth:
    auth_method: GCP_SERVICE_ACCOUNT_JSON
    secret_name: gcp-service-account
    registry: us-east4-docker.pkg.dev
For AWS_IAM:
base_image:
  docker_auth:
    auth_method: AWS_IAM
    registry: <aws account id>.dkr.ecr.<region>.amazonaws.com
secrets:
  aws_access_key_id: null
  aws_secret_access_key: null
For AWS_OIDC:
base_image:
  docker_auth:
    auth_method: AWS_OIDC
    registry: <aws account id>.dkr.ecr.<region>.amazonaws.com
    aws_oidc_role_arn: arn:aws:iam::123456789012:role/my-role
    aws_oidc_region: us-east-1
For GCP_OIDC:
base_image:
  docker_auth:
    auth_method: GCP_OIDC
    registry: us-east4-docker.pkg.dev
    gcp_oidc_service_account: my-sa@my-project.iam.gserviceaccount.com
    gcp_oidc_workload_id_provider: projects/123/locations/global/workloadIdentityPools/my-pool/providers/my-provider
secret_name
string
The Truss secret that stores the credential for authentication. Required for GCP_SERVICE_ACCOUNT_JSON. Ensure this secret is added to the secrets section.
registry
string
The registry to authenticate to, for example us-east4-docker.pkg.dev.
aws_access_key_id_secret_name
string
default:"aws_access_key_id"
The secret name for the AWS access key ID. Only used with AWS_IAM auth method.
aws_secret_access_key_secret_name
string
default:"aws_secret_access_key"
The secret name for the AWS secret access key. Only used with AWS_IAM auth method.

docker_server

Use docker_server to deploy a custom Docker image that has its own HTTP server, without writing a Model class. This is useful for deploying inference servers like vLLM or SGLang that provide their own endpoints. See Deploy custom Docker images for usage details. For example, to deploy vLLM serving Qwen 2.5 3B, use the following:
base_image:
  image: vllm/vllm-openai:v0.7.3
docker_server:
  start_command: vllm serve Qwen/Qwen2.5-3B-Instruct --enable-prefix-caching
  readiness_endpoint: /health
  liveness_endpoint: /health
  predict_endpoint: /v1/completions
  server_port: 8000
# ...
start_command
string
The command to start the server. Required when no_build is not set or is false. When no_build is true, start_command is optional; if omitted, the image’s original ENTRYPOINT runs.
server_port
number
required
The port where the server runs. Port 8080 is reserved by Baseten’s internal reverse proxy and cannot be used.
predict_endpoint
string
required
The endpoint for inference requests. This is mapped to Baseten’s /predict route.
readiness_endpoint
string
required
The endpoint for readiness probes. Determines when the container can accept traffic.
liveness_endpoint
string
required
The endpoint for liveness probes. Determines if the container needs to be restarted.
run_as_user_id
number
The Linux UID to run the server process as inside the container. Use this when your base image expects a specific non-root user (for example, NVIDIA NIM containers).The specified UID must already exist in the base image. Values 0 (root) and 60000 (platform default) are not allowed.Baseten automatically sets ownership of /app, /workspace, the packages directory, and $HOME to this UID. If your server writes to other directories, ensure they are writable by this UID in your base image or via build_commands.
no_build
boolean
Skip the build step and deploy the base image as-is. Baseten copies the image to its container registry without running docker build or modifying the image in any way. Only available for custom server deployments that use docker_server.When no_build is true:
  • start_command is optional. If omitted, the image’s original ENTRYPOINT runs.
  • Environment variables and secrets are available.
  • Development mode is not supported. Deploy with truss push (published deployments are the default).
Use this for security-hardened images (for example, Chainguard) that must remain unmodified. Contact support to enable no-build deployments for your organization.
config.yaml
base_image:
  image: your-registry/your-hardened-image:latest
docker_server:
  no_build: true
  server_port: 8000
  predict_endpoint: /predict
  readiness_endpoint: /health
  liveness_endpoint: /health
See No-build deployment for usage details.
The /app directory is reserved by Baseten. By default, /app, /workspace, and /tmp are writable in the container. If you need other directories to be writable, use run_as_user_id or build_commands to set permissions.

external_data

Use external_data to download remote files into your image at build time. This reduces cold-start time by making data available without downloading it at runtime. Each entry specifies a URL to fetch and a path relative to the data directory where the file is stored.
external_data:
  - url: https://my-bucket.s3.amazonaws.com/my-data.tar.gz
    local_data_path: my-data.tar.gz
url
string
required
The URL to download data from.
local_data_path
string
required
Path relative to the data directory where the downloaded file is stored. For example, my-data.tar.gz is stored at /app/data/my-data.tar.gz.
name
string
An optional name for the download entry.
backend
string
default:"http_public"
The download backend to use.

build_commands

build_commands
string[]
A list of shell commands to run during Docker build. These commands execute after system packages and Python requirements are installed. Use them for any setup that can’t be handled by requirements or system_packages alone.For example, to clone a GitHub repository into the container, use the following:
build_commands:
  - git clone https://github.com/comfyanonymous/ComfyUI.git
You can also combine build_commands with docker_server to deploy third-party inference servers. The following example installs Ollama at build time and runs it as a Docker server:
model_name: ollama-tinyllama
base_image:
  image: python:3.11-slim
build_commands:
  - curl -fsSL https://ollama.com/install.sh | sh
docker_server:
  start_command: sh -c "ollama serve & sleep 5 && ollama pull tinyllama && wait"
  readiness_endpoint: /api/tags
  liveness_endpoint: /api/tags
  predict_endpoint: /api/generate
  server_port: 11434
resources:
  cpu: "4"
  memory: 8Gi
For more information, see Build commands.

build

The build section handles secret access during Docker builds. Other build-time configuration options are:
secret_to_path_mapping
object
Grants access to secrets during the build. Provide a mapping between a secret and a path on the image. You can then access the secret in commands specified in build_commands by running cat on the file.For example, to install a pip package from a private GitHub repository, use the following:
build_commands:
  - pip install git+https://$(cat /root/my-github-access-token)@github.com/path/to-private-repo.git
build:
  secret_to_path_mapping:
    my-github-access-token: /root/my-github-access-token
secrets:
  my-github-access-token: null
Under the hood, this option mounts your secret as a build secret. The value of your secret will be secure and will not be exposed in your Docker history or logs.

weights Preview

Use weights to configure Baseten Delivery Network (BDN) for model weight delivery with multi-tier caching. This is the recommended approach for optimizing cold starts.
weights:
  - source: "hf://meta-llama/Llama-3.1-8B@main"
    mount_location: "/models/llama"
    allow_patterns: ["*.safetensors", "config.json"]
weights replaces the deprecated model_cache configuration. Use truss migrate to automatically convert your configuration.
source
string
required
URI specifying where to fetch weights from. Supported schemes:
  • hf://: Hugging Face Hub, for example hf://meta-llama/Llama-3.1-8B@main
  • s3://: AWS S3, for example s3://my-bucket/models/weights
  • gs://: Google Cloud Storage, for example gs://my-bucket/models/weights
  • r2://: Cloudflare R2, for example r2://account_id.bucket/path
mount_location
string
required
Absolute path where weights will be mounted in your container. Must start with /.
auth_secret_name
string
Name of a Baseten secret containing credentials for private weight sources.
auth
object
Authentication configuration for accessing private weight sources. Required for OIDC-based authentication. Supported auth_method values:
  • CUSTOM_SECRET: use a Baseten secret (specify auth_secret_name).
  • AWS_OIDC: use AWS OIDC federation (requires aws_oidc_role_arn and aws_oidc_region).
  • GCP_OIDC: use GCP Workload Identity Federation (requires gcp_oidc_service_account and gcp_oidc_workload_id_provider).
For AWS OIDC:
weights:
  - source: "s3://my-bucket/models/weights"
    mount_location: "/models/weights"
    auth:
      auth_method: AWS_OIDC
      aws_oidc_role_arn: arn:aws:iam::123456789012:role/my-role
      aws_oidc_region: us-east-1
For GCP OIDC:
weights:
  - source: "gs://my-bucket/models/weights"
    mount_location: "/models/weights"
    auth:
      auth_method: GCP_OIDC
      gcp_oidc_service_account: my-sa@my-project.iam.gserviceaccount.com
      gcp_oidc_workload_id_provider: projects/123/locations/global/workloadIdentityPools/my-pool/providers/my-provider
allow_patterns
string[]
File patterns to include. Uses fnmatch-style wildcards. Patterns like *.safetensors only match at the root level; use **/*.safetensors for recursive matching across subdirectories.
ignore_patterns
string[]
File patterns to exclude. Uses fnmatch-style wildcards. Patterns like *.bin only match at the root level; use **/*.bin for recursive matching across subdirectories.
For full documentation, see Baseten Delivery Network (BDN).

model_cache Deprecated

model_cache is deprecated. Use weights instead for faster cold starts through multi-tier caching.
Use model_cache to bundle model weights into your image at build time, reducing cold start latency. For example, to cache Llama 2 7B weights from Hugging Face, use the following:
model_cache:
  - repo_id: NousResearch/Llama-2-7b-chat-hf
    revision: main
    ignore_patterns:
      - "*.bin"
    use_volume: true
    volume_folder: llama-2-7b-chat-hf
Despite the name model_cache, there are multiple backends supported, not just Hugging Face. You can also cache weights stored on GCS, S3, or Azure.
repo_id
string
required
The source path for your model weights. For example, to cache weights from a Hugging Face repo, use the following:
model_cache:
  - repo_id: madebyollin/sdxl-vae-fp16-fix
Or you can cache weights from buckets like GCS or S3, using the following options:
model_cache:
  - repo_id: gcs://path-to-my-bucket
    kind: gcs
  - repo_id: s3://path-to-my-bucket
    kind: s3
kind
string
default:"hf"
The source kind for the model cache. Supported values: hf (Hugging Face), gcs, s3, azure.
revision
string
The revision of your Hugging Face repo. Required when use_volume is true for Hugging Face repos.
use_volume
boolean
required
If true, caches model artifacts outside the container image. Recommended: true.
volume_folder
string
The location of the mounted folder. Required when use_volume is true. For example, volume_folder: myrepo makes the model available under /app/model_cache/myrepo at runtime.
allow_patterns
string[]
File patterns to include in the cache. Uses Unix shell-style wildcards. By default, all paths are included.
ignore_patterns
string[]
File patterns to ignore, streamlining the caching process. Use Unix shell-style wildcards. Example: ["*.onnx", "Readme.md"]. By default, nothing is ignored.
runtime_secret_name
string
default:"hf_access_token"
The secret name to use for runtime authentication, for example when accessing private Hugging Face repos.

trt_llm

Configure TensorRT-LLM for optimized LLM inference on Baseten. TRT-LLM supports two inference stacks:
  • v1: Best for dense models, small models, and embedding models. Supports lookahead speculative decoding and LoRA adapters.
  • v2: Best for MoE models (Qwen3-MoE, DeepSeek, Kimi) and multi-node setups.
config.yaml
trt_llm:
  inference_stack: v2
  build:
    checkpoint_repository:
      source: HF
      repo: meta-llama/Llama-3.1-8B-Instruct
    quantization_type: fp8
  runtime:
    max_batch_size: 256
    max_num_tokens: 8192
    tensor_parallel_size: 1
resources:
  accelerator: H100
inference_stack
string
default:"v1"
The inference stack version to use.Supported values:
  • v1: Use for dense models, small models, and embedding/reranking models. Supports lookahead speculative decoding and LoRA adapters.
  • v2: Use for MoE models and multi-node setups. The v2 runtime manages build parameters automatically; only checkpoint_repository, quantization_type, and num_builder_gpus can be set under build.

build

Build-time configuration for TRT-LLM engine compilation.
base_model
string
default:"decoder"
The model architecture type.Supported values:
  • decoder: For generative causal LLMs (Llama, Qwen, Mistral, DeepSeek). Auto-detects architecture from the checkpoint.
  • encoder: For causal embedding models. Optimized for throughput with models like Qwen3-8B for embeddings.
  • encoder_bert: For BERT-based models (classification, reranking, embeddings). Optimized for throughput and cold-start latency of models under 4B parameters.
checkpoint_repository
object
required
The model checkpoint to compile. See checkpoint_repository for sub-fields.
quantization_type
string
default:"no_quant"
The quantization method for the model weights. Use no_quant for fp16/bf16 (uses the precision from the model’s config.json).Supported values:
  • no_quant: No quantization (fp16 or bf16).
  • fp8: FP8 weights with 16-bit KV cache.
  • fp8_kv: FP8 weights with FP8 KV cache. Faster attention with FP8 context FMHA. Not compatible with models that use bias=True (for example, Qwen 2.5).
  • fp4: FP4 weights with 16-bit KV cache. Requires B200 or newer GPUs.
  • fp4_kv: FP4 weights with FP8 KV cache. Requires B200 or newer GPUs.
  • fp4_mlp_only: FP4 quantization applied only to MLP layers, with 16-bit KV cache. Requires B200 or newer GPUs.
tensor_parallel_count
number
default:"1"
Number of GPUs for tensor parallelism. Must equal the number of GPUs in your resources.accelerator setting for v1.
max_seq_len
number
Maximum sequence length the engine supports. Automatically inferred from the model checkpoint when not set. For encoder models, this is inferred from max_position_embeddings in the model’s config.
max_batch_size
number
default:"256"
Maximum number of requests batched together in one forward pass. Range: 1 to 2048.
max_num_tokens
number
default:"8192"
Maximum number of tokens batched together in one forward pass. For encoder models and generative models without chunked prefill, this limits the max context length. Range: 65 to 1048576.
num_builder_gpus
number
Number of GPUs to use during engine compilation. Set this higher than the deployment GPU count if quantization causes out-of-memory errors during the build step. If you run out of CPU memory, add more memory in the resources section instead.
lora_adapters
object
A mapping of LoRA adapter names to checkpoint repositories. Each key becomes the model name in OpenAI-compatible API requests. Only supported on the v1 inference stack.
trt_llm:
  build:
    lora_adapters:
      my-adapter:
        source: HF
        repo: my-org/my-lora-adapter
    lora_configuration:
      max_lora_rank: 64
lora_configuration
object
LoRA configuration. See lora_configuration for sub-fields. Only supported on the v1 inference stack.
speculator
object
Speculative decoding configuration. See speculator for sub-fields. Only supported on the v1 inference stack.
moe_expert_parallel_option
number
default:"-1"
Expert parallelism setting for MoE models. Set to -1 to let the runtime decide. When set explicitly, must be a positive number less than or equal to tensor_parallel_count, and tensor_parallel_count should be divisible by this value for optimal performance.

checkpoint_repository

The model checkpoint to compile. Specifies the source, repository path, and optional credentials.
source
string
required
Where to fetch the checkpoint from.Supported values:
  • HF: Hugging Face Hub.
  • S3: AWS S3 bucket (for example, s3://my-bucket/path/to/checkpoint).
  • GCS: Google Cloud Storage bucket (for example, gcs://my-bucket/path/to/checkpoint).
  • AZURE: Azure Blob Storage.
  • REMOTE_URL: HTTP URL to a tar.gzip archive (for example, a presigned URL).
  • BASETEN_TRAINING: Deploy from a Baseten training job. Use the training job ID as repo and the run revision as revision.
repo
string
required
The repository path. For HF, this is the Hugging Face repo ID (for example, meta-llama/Llama-3.1-8B-Instruct). For S3/GCS/AZURE, this is the bucket path. The checkpoint must contain config.json and model files in safetensors format.
revision
string
The revision or version of the checkpoint. For HF sources, this is the branch, tag, or commit hash. Required for BASETEN_TRAINING sources.
runtime_secret_name
string
default:"hf_access_token"
The name of the Baseten secret that stores the access credential. Must match a key in your organization’s secret settings.

quantization_config

Calibration settings for quantized models. Only relevant when quantization_type is not no_quant.
calib_size
number
default:"1024"
Size of the calibration dataset. Must be a multiple of 64, between 64 and 16384. Increase for production runs (for example, 1536) or decrease for quick testing (for example, 256).
calib_dataset
string
default:"abisee/cnn_dailymail"
Hugging Face dataset to use for calibration. Uses the train split and quantizes based on the text column.
calib_max_seq_length
number
default:"1536"
Maximum sequence length for calibration samples. Must be a multiple of 64, between 64 and 16384.

runtime (v1)

Runtime configuration for the v1 inference stack.
trt_llm:
  inference_stack: v1
  runtime:
    kv_cache_free_gpu_mem_fraction: 0.9
    enable_chunked_context: true
    batch_scheduler_policy: guaranteed_no_evict
    total_token_limit: 500000
  # ...
kv_cache_free_gpu_mem_fraction
number
default:"0.9"
Fraction of free GPU memory to allocate for the KV cache. Higher values allow more context but leave less room for other operations.
kv_cache_host_memory_bytes
number
Bytes of host (CPU) memory to reserve for KV cache offloading. Set to a high value to enable KV cache offload to host memory when GPU memory is constrained.
enable_chunked_context
boolean
default:"true"
Whether to process long contexts in chunks. Requires paged_kv_cache and use_paged_context_fmha to be enabled in the build plugin configuration.
batch_scheduler_policy
string
default:"guaranteed_no_evict"
The batch scheduling strategy.Supported values:
  • guaranteed_no_evict: Guarantees scheduling with the requested number of tokens. May queue requests if memory is insufficient. Recommended for most use cases.
  • max_utilization: Schedules requests without checking available memory. May need to pause requests if memory fills up.
request_default_max_tokens
number
Default maximum number of tokens per request when not specified by the client.
served_model_name
string
The model name returned in OpenAI-compatible API responses. Only for generative (decoder) models.
total_token_limit
number
default:"500000"
Maximum number of tokens scheduled at once to the C++ engine. Only for generative (decoder) models.
webserver_default_route
string
Default API route for the model. Auto-detected from the model architecture for encoder models.Supported values:
  • /v1/embeddings: For embedding models.
  • /rerank: For reranking models.
  • /predict: For sequence classification models.

runtime (v2)

Runtime configuration for the v2 inference stack.
trt_llm:
  inference_stack: v2
  runtime:
    max_batch_size: 256
    max_num_tokens: 8192
    tensor_parallel_size: 1
  # ...
max_seq_len
number
Maximum sequence length. Range: 1 to 1048576.
max_batch_size
number
default:"256"
Maximum number of requests batched together in one forward pass. Range: 1 to 2048.
max_num_tokens
number
default:"8192"
Maximum number of tokens batched together in one forward pass. Range: 65 to 131072.
tensor_parallel_size
number
default:"1"
Number of GPUs for tensor parallelism.
enable_chunked_prefill
boolean
default:"true"
Whether to enable chunked prefill for generative (decoder) models.
served_model_name
string
The model name returned in OpenAI-compatible API responses. Only for generative (decoder) models.

speculator

Configure speculative decoding to speed up inference by predicting multiple tokens per step. Only supported on the v1 inference stack.
trt_llm:
  build:
    speculator:
      speculative_decoding_mode: LOOKAHEAD_DECODING
      lookahead_windows_size: 7
      lookahead_ngram_size: 5
      lookahead_verification_set_size: 3
    max_batch_size: 64
    # ...
Speculative decoding works best at lower batch sizes (under 64). For high-throughput use cases, tune concurrency settings for more aggressive autoscaling instead.
speculative_decoding_mode
string
The speculative decoding strategy.Supported values:
  • LOOKAHEAD_DECODING: N-gram based speculation built into the runtime. Recommended for most use cases, especially code editing workloads where n-gram patterns are common.
lookahead_windows_size
number
Lookahead window size for the LOOKAHEAD_DECODING mode. Required when using lookahead decoding. Recommended values: 5 to 8.
lookahead_ngram_size
number
N-gram size for the LOOKAHEAD_DECODING mode. Required when using lookahead decoding. Recommended values: 3 to 5.
lookahead_verification_set_size
number
Verification set size for the LOOKAHEAD_DECODING mode. Required when using lookahead decoding. Recommended values: 3 to 5.
num_draft_tokens
number
Maximum number of speculative tokens per step. Auto-calculated from the lookahead parameters when using LOOKAHEAD_DECODING. Maximum: 2048.
enable_b10_lookahead
boolean
default:"false"
Enable the Baseten-optimized lookahead algorithm. Requires speculative_decoding_mode to be LOOKAHEAD_DECODING. When enabled with (window_size, 1, 1) settings (for example, (8, 1, 1) or (32, 1, 1)), enables dynamic speculation.

lora_configuration

LoRA adapter settings for the v1 inference stack. Use with lora_adapters to serve multiple fine-tuned models from a single deployment.
max_lora_rank
number
default:"64"
Maximum LoRA rank across all adapters.
lora_target_modules
string[]
List of model modules to apply LoRA to.

training_checkpoints

Configuration for deploying models from training checkpoints. For example, to deploy a model using checkpoints from a training job, use the following:
training_checkpoints:
  download_folder: /tmp/training_checkpoints
  artifact_references:
    - training_job_id: tr_abc123
      paths:
        - "checkpoint-*"
download_folder
string
default:"/tmp/training_checkpoints"
The folder to download the checkpoints to.
artifact_references
object[]
A list of artifact references to download.
training_job_id
string
required
The training job ID that the artifact reference belongs to.
paths
string[]
The paths of the files to download, which can contain * or ? wildcards.