Chains is a framework for building robust, performant multi-step and multi-model
inference pipelines and deploying them to production. It addresses the common
challenges of managing latency, cost and dependencies for complex workflows,
while leveraging Truss’ existing battle-tested performance, reliability and
developer toolkit.
Some models are actually pipelines (e.g. invoking a LLM involves sequentially
tokenizing the input, predicting the next token, and then decoding the predicted
tokens). These pipelines generally make sense to bundle together in a monolithic
deployment because they have the same dependencies, require the same compute
resources, and have a robust ecosystem of tooling to improve efficiency and
performance in a single deployment.
Many other pipelines and systems do not share these properties. Some examples
include:
Running multiple different models in sequence.
Chunking/partitioning a set of files and concatenating/organizing results.
Pulling inputs from or saving outputs to a database or vector store.
Each step in these workflows has different hardware requirements, software
dependencies, and scaling needs so it doesn’t make sense to bundle them in a
monolithic deployment. That’s where Chains comes in!
Chains exists to help you build multi-step, multi-model pipelines. The
abstractions that Chains introduces are based on six opinionated principles:
three for architecture and three for developer experience.Architecture principles
1
Atomic components
Each step in the pipeline can set its own hardware requirements and
software dependencies, separating GPU and CPU workloads.
2
Modular scaling
Each component has independent autoscaling parameters for targeted
resource allocation, removing bottlenecks from your pipelines.
3
Maximum composability
Components specify a single public interface for flexible-but-safe
composition and are reusable between projects
Developer experience principles
4
Type safety and validation
Eliminate entire taxonomies of bugs by writing typed Python code and
validating inputs, outputs, module initializations, function signatures,
and even remote server configurations.
5
Local debugging
Seamless local testing and cloud deployments: test Chains locally with
support for mocking the output of any step and simplify your cloud
deployment loops by separating large model deployments from quick
updates to glue code.
6
Incremental adoption
Use Chains to orchestrate existing model deployments, like pre-packaged
models from Baseten’s model library, alongside new model pipelines built
entirely within Chains.
Here’s a simple Chain that says “hello” to each person in a list of provided
names:
hello_chain/hello.py
Copy
Ask AI
import asyncioimport truss_chains as chains# This Chainlet does the work.class SayHello(chains.ChainletBase): async def run_remote(self, name: str) -> str: return f"Hello, {name}"# This Chainlet orchestrates the work.@chains.mark_entrypointclass HelloAll(chains.ChainletBase): def __init__(self, say_hello_chainlet=chains.depends(SayHello)) -> None: self._say_hello = say_hello_chainlet async def run_remote(self, names: list[str]) -> str: tasks = [] for name in names: tasks.append(asyncio.ensure_future( self._say_hello.run_remote(name))) return "\n".join(await asyncio.gather(*tasks))
This is a toy example, but it shows how Chains can be used to separate
preprocessing steps like chunking from workload execution steps. If SayHello
were an LLM instead of a simple string template, we could do a much more complex
action for each person on the list.
Connect to a vector databases and augment LLM results with additional
context information without introducing overhead to the model inference
step.Try it yourself: RAG Chain.
Chunked Audio Transcription and high-throughput pipelines
Transcribe large audio files by splitting them into smaller chunks and
processing them in parallel — we’ve used this approach to process 10-hour
files in minutes.Try it yourself: Audio Transcription Chain.
Efficient multi-model pipelines
Build powerful experiences wit optimal scaling in each step like:
AI phone calling (transcription + LLM + speech synthesis)