Prerequisites
To use Chains, install a recent Truss version and ensure pydantic is v2:Help for setting up a clean development environment
Help for setting up a clean development environment
Truss requires python
>=3.8,<3.13
. To set up a fresh development environment,
you can use the following commands, creating a environment named chains_env
using pyenv
:.bashrc
:
~/.bashrc
Overview
Retrieval-augmented generation (RAG) is a multi-model pipeline for generating context-aware answers from LLMs. There are a number of ways to build a RAG system. This tutorial shows a minimum viable implementation with a basic vector store and retrieval function. It’s intended as a starting point to show how Chains helps you flexibly combine model inference and business logic. In this tutorial, we’ll build a simple RAG pipeline for a hypothetical alumni matching service for a university. The system:- Takes a bio with information about a new graduate
- Uses a vector database to retrieve semantically similar bios of other alums
- Uses an LLM to explain why the new graduate should meet the selected alums
- Returns the writeup from the LLM
Building the Chain
Create a filerag.py
in a new directory with:
VectorStore
, a Chainlet that implements a vector database with a retrieval function.LLMClient
, a Stub for connecting to a deployed LLM.RAG
, the entrypoint Chainlet that orchestrates the RAG pipeline and hasVectorStore
andLLMClient
as dependencies.
Vector store Chainlet
A real production RAG system would use a hosted vector database with a massive number of stored embeddings. For this example, we’re using a small local vector store built withchromadb
to stand in for a more complex system.
The Chainlet has three parts:
remote_config
, which configures a Docker image on deployment with dependencies.__init__()
, which runs once when the Chainlet is spun up, and creates the vector database with ten sample bios.run_remote()
, which runs each time the Chainlet is called and is the sole public interface for the Chainlet.
rag/rag.py
LLM inference stub
Now that we can retrieve relevant bios from the vector database, we need to pass that information to an LLM to generate our final output. Chains can integrate previously deployed models using a Stub. Like Chainlets, Stubs implementrun_remote()
, but as a call
to the deployed model.
For our LLM, we’ll use Phi-3 Mini Instruct, a small-but-mighty open source LLM.
Deploy Phi-3 Mini Instruct 4k
One-click model deployment from Baseten’s model library.
rag/rag.py
RAG entrypoint Chainlet
The entrypoint to a Chain is the Chainlet that specifies the public-facing input and output of the Chain and orchestrates calls to dependencies. The__init__
function in this Chainlet takes two new arguments:
- Add dependencies to any Chainlet with
chains.depends()
. Only Chainlets, not Stubs, need to be added in this fashion. - Use
chains.depends_context()
to inject a context object at runtime. This context object is required to initialize theLLMClient
stub. - Visit your baseten workspace to find your
the URL of the previously deployed Phi-3 model and insert if as value
for
LLM_URL
.
rag/rag.py
Testing locally
Because our Chain uses a Stub for the LLM call, we can run the whole Chain locally without any GPU resources. Before running the Chainlet, make sure to set your Baseten API key as an environment variableBASETEN_API_KEY
.
rag/rag.py
Deploying to production
Once we’re satisfied with our Chain’s local behavior, we can deploy it to production on Baseten. To deploy the Chain, run:call_chain.py