Chains is a framework for building robust, performant multi-step and multi-model inference pipelines and deploying them to production. It addresses the common challenges of managing latency, cost and dependencies for complex workflows, while leveraging Truss’ existing battle-tested performance, reliability and developer toolkit.

User Guides

Guides focus on specific features and use cases. Also refer to getting started and general concepts.

Design

How to structure your Chainlets, concurrency, file structure

Local Dev

Iterating, Debugging, Testing, Mocking

Deploy

Deploy your Chain on Baseten

Invocation

Call your deployed Chain

Watch

Live-patch deployed code

Subclassing

Modularize and re-use Chainlet implementations

Streaming

Streaming outputs, reducing latency, SSEs

Binary IO

Performant serialization of numeric data

Error Propagation

Understanding and handling Chains errors

Truss Integration

Integrate deployed Truss models with stubs

From model to system

Some models are actually pipelines (e.g. invoking a LLM involves sequentially tokenizing the input, predicting the next token, and then decoding the predicted tokens). These pipelines generally make sense to bundle together in a monolithic deployment because they have the same dependencies, require the same compute resources, and have a robust ecosystem of tooling to improve efficiency and performance in a single deployment. Many other pipelines and systems do not share these properties. Some examples include:

Running multiple different models in sequence.
Chunking/partitioning a set of files and concatenating/organizing results.
Pulling inputs from or saving outputs to a database or vector store.

Each step in these workflows has different hardware requirements, software dependencies, and scaling needs so it doesn’t make sense to bundle them in a monolithic deployment. That’s where Chains comes in!

Six principles behind Chains

Chains exists to help you build multi-step, multi-model pipelines. The abstractions that Chains introduces are based on six opinionated principles: three for architecture and three for developer experience. Architecture principles

Atomic components

Each step in the pipeline can set its own hardware requirements and software dependencies, separating GPU and CPU workloads.

Modular scaling

Each component has independent autoscaling parameters for targeted resource allocation, removing bottlenecks from your pipelines.

Maximum composability

Components specify a single public interface for flexible-but-safe composition and are reusable between projects

Developer experience principles

Type safety and validation

Eliminate entire taxonomies of bugs by writing typed Python code and validating inputs, outputs, module initializations, function signatures, and even remote server configurations.

Local debugging

Seamless local testing and cloud deployments: test Chains locally with support for mocking the output of any step and simplify your cloud deployment loops by separating large model deployments from quick updates to glue code.

Incremental adoption

Use Chains to orchestrate existing model deployments, like pre-packaged models from Baseten’s model library, alongside new model pipelines built entirely within Chains.

Hello World with Chains

Here’s a simple Chain that says “hello” to each person in a list of provided names:

hello_chain/hello.py

import asyncio
import truss_chains as chains


# This Chainlet does the work.
class SayHello(chains.ChainletBase):

    async def run_remote(self, name: str) -> str:
        return f"Hello, {name}"


# This Chainlet orchestrates the work.
@chains.mark_entrypoint
class HelloAll(chains.ChainletBase):

    def __init__(self, say_hello_chainlet=chains.depends(SayHello)) -> None:
        self._say_hello = say_hello_chainlet

    async def run_remote(self, names: list[str]) -> str:
        tasks = []
        for name in names:
            tasks.append(asyncio.ensure_future(
                self._say_hello.run_remote(name)))

        return "\n".join(await asyncio.gather(*tasks))

This is a toy example, but it shows how Chains can be used to separate preprocessing steps like chunking from workload execution steps. If SayHello were an LLM instead of a simple string template, we could do a much more complex action for each person on the list.

What to build with Chains

RAG: retrieval-augmented generation

Chunked Audio Transcription and high-throughput pipelines

Efficient multi-model pipelines

Get started by building and deploying your first chain.

Get started

Concepts

Development

Deployment

Inference

Training

Observability

Troubleshooting

Overview

User Guides

Design

Local Dev

Deploy

Invocation

Watch

Subclassing

Streaming

Binary IO

Error Propagation

Truss Integration

From model to system

Six principles behind Chains

Hello World with Chains

What to build with Chains

Get started

Concepts

Development

Deployment

Inference

Training

Observability

Troubleshooting

​User Guides

Design

Local Dev

Deploy

Invocation

Watch

Subclassing

Streaming

Binary IO

Error Propagation

Truss Integration

​From model to system

​Six principles behind Chains

​Hello World with Chains

​What to build with Chains

User Guides

From model to system

Six principles behind Chains

Hello World with Chains

What to build with Chains