Changelog

Model library

Truss examples

Baseten

API reference

Truss reference 📦

Chains reference ⛓

Workspace

Status

Support

Deploy a model that makes use of pre-process

Pre/post-processing

Fast, scalable inference in our cloud or yours

Welcome to Baseten!

Deploy your first model

Truss: Package and deploy AI models on Baseten

Overview

Getting Started

LLM with Streaming

Text-to-image

Deploy a language model, with the model weights cached at build time

Fast Cold Starts with Cached Weights

Load a model that requires authentication with Hugging Face

Private Hugging Face Model

Deploy a model with both Python and system dependencies

Model with system packages

A guide to configuring a base image for your truss

Base Docker images

Deploy Custom Server from Docker image

Load model weights without Hugging Face or S3

Model weights

Streaming output with an LLM

A guide on configuring your truss to use external packages

External (source) packages

How to run your own docker commands during the build stage

Running custom docker commands

Private Hugging Face model

Enable fast cold starts for a model with private Hugging Face weights

Deploy Llama 2 with Caching

A guide to setting concurrency for your model

Request concurrency

Accelerate cold starts by caching your weights

Caching model weights

A guide to using secrets securely in your ML models

Storing secrets in Baseten

Get more control by directly using the request object.

Using request objects / Cancellation

Get more control by directly creating the response object.

Returning response objects and SSEs

A guide to leveraging environments in your models

Access model environments

Deployments and environments

Serve your model on the right instance type

Setting GPU resources

Scale from internal testing to the top of Hacker News

Autoscaling

Fixing common problems during model deployment

Troubleshooting

How to call your model

How to stream model output

Run asynchronous inference on deployed models

Async inference user guide

Secure the asynchronous inference results sent to your webhook

Securing async inference

Enforce an output schema on LLM inference

Structured output (JSON mode)

Use an LLM to select amongst provided tools

Function calling (tool use)

How to parse base64 output

How to do model I/O in binary

How to do model I/O with files

Use your Baseten models with tools like LangChain

Baseten model integrations

Fixing common problems during model inference

Chains: A new DX for deploying multi-component ML workflows

Build your first Chain

Glossary of Chains concepts and terminology

Concepts

Architecture & Design

Local Development

Deploy

Incovation

Watch

Modularize and re-use Chainlet implementations

Subclassing

Streaming outputs, reducing latency, SSEs

Streaming

Binary IO

Error Handling

Integrate deployed Truss models with stubs

Truss Integration

Build a RAG (retrieval-augmented generation) pipeline with  Chains

RAG Chain

Transcribe hours of audio to text in a few seconds

Audio Transcription Chain

Model performance overview

Deploy optimized model inference servers in minutes

Engine Builder overview

Automatically build and deploy a TensorRT-LLM model serving engine

Build your first LLM engine

Use `model.py` to customize engine behavior

Engine control in Python

Configure your TensorRT-LLM inference engine

Engine Builder configuration

Engineering your Truss and application for faster cold starts

How to get faster cold starts

Handle variable throughput with this autoscaling parameter

Setting concurrency

Specs and recommendations for every instance type on Baseten

Instance type reference

Understand the load and performance of your model

Reading model metrics

Export metrics from Baseten to your observability stack

Metrics export overview

Export metrics from Baseten to Prometheus

Export metrics to Prometheus

Export metrics to Datadog

Export metrics from Baseten to Grafana Cloud

Export metrics to Grafana Cloud

Export metrics to New Relic

Metrics support matrix

Monitoring model health

Secure model inference

Share your Baseten workspace with your team

Workspace access control

Best practices for API keys

Securely store and access passwords, tokens, keys, and more

Best practices for secrets

Investigate the prediction flow in detail

Tracing

Manage payments and track overall Baseten usage

Billing and usage

Details on model inference and management APIs

🆕 Inference by environment

Production deployment

Development deployment

Published deployment

ChatCompletions

ChatCompletions (deprecated)

Model endpoint migration guide

Call primary version

Call model version

Wake primary version

Wake model version

Use this endpoint to call the model associated with the specified environment asynchronously.

🆕 Async inference by environment

Use this endpoint to call the [production deployment](/deploy/lifecycle) of your model asynchronously.

Use this endpoint to call the [development deployment](/deploy/lifecycle) of your model asynchronously.

Use this endpoint to call any [published deployment](/deploy/lifecycle) of your model.

Use this endpoint to get the status of an async request.

Get async request status

Use this endpoint to cancel a queued async request.

Cancel async request

Use this endpoint to get the status of a production deployment's async queue.

Use this endpoint to get the async queue status for a model associated with the specified environment.

Environment deployment

Use this endpoint to get the status of a development deployment's async queue.

Use this endpoint to get the status of a published deployment's async queue.

Get all secrets

Creates a new secret or updates an existing secret if one with the provided name already exists. The name and creation date of the created or updated secret is returned.

Upsert a secret

Creates an environment for the specified model and returns the environment.

Create model environment

Get all model environments

Gets an environment's details and returns the environment.

Get model environment

Updates an environment's settings and returns the updated environment.

Update model environment

Create a chain environment. Returns the resulting environment.

Create chain environment

Gets all chain environments for a given chain

Get all chain environments

Gets a chain environment's details and returns the chain environment.

Get chain environment

Update a chain environment's settings and returns the chain environment.

Update chain environment

Updates a chainlet environment's autoscaling settings and returns the updated chainlet environment settings.

Update chainlet environment's autoscaling settings

Updates a chainlet environment's instance type settings. The chainlet environment setting must exist. When updated, a new chain deployment is created and deployed. It is promoted to the chain environment according to promotion settings on the environment.

Update chainlet environment's instance type

Get instance types

Get all models

Get a model by ID

Delete models

Get all chains

Get a chain by ID

Delete chains

Get all model deployments

Gets a model's production deployment and returns the deployment.

Production model deployment

Gets a model's development deployment and returns the deployment.

Development model deployment

Gets a model's deployment by ID and returns the deployment.

Any model deployment by ID

Deletes a model's deployment by ID and returns the tombstone of the deployment.

Delete model deployments

Get all chain deployments

Any chain deployment by ID

Delete chain deployments

Promotes an existing chain deployment to an environment and returns the promoted chain deployment.

Welcome

Writing Truss models 📦

Model deployment ☁️

Inference 📨

Chains ⛓️

Performance ⚡

Observability 📊

Pre/post-processing