RAG Chain
Build a RAG (retrieval-augmented generation) pipeline with Chains
Prerequisites
To use Chains, install a recent Truss version and ensure pydantic is v2:
To deploy Chains remotely, you also need a
Baseten account.
It is handy to export your API key to the current shell session or permanently in your .bashrc
:
If you want to run this example in local debugging mode, you’ll also need to install chromadb:
The complete code used in this tutorial can also be found in the Chains examples repo.
Overview
Retrieval-augmented generation (RAG) is a multi-model pipeline for generating context-aware answers from LLMs.
There are a number of ways to build a RAG system. This tutorial shows a minimum viable implementation with a basic vector store and retrieval function. It’s intended as a starting point to show how Chains helps you flexibly combine model inference and business logic.
In this tutorial, we’ll build a simple RAG pipeline for a hypothetical alumni matching service for a university. The system:
- Takes a bio with information about a new graduate
- Uses a vector database to retrieve semantically similar bios of other alums
- Uses an LLM to explain why the new graduate should meet the selected alums
- Returns the writeup from the LLM
Let’s dive in!
Building the Chain
Create a file rag.py
in a new directory with:
Our RAG Chain is composed of three parts:
VectorStore
, a Chainlet that implements a vector database with a retrieval function.LLMClient
, a Stub for connecting to a deployed LLM.RAG
, the entrypoint Chainlet that orchestrates the RAG pipeline and hasVectorStore
andLLMClient
as dependencies.
We’ll examine these components one by one and then see how they all work together.
Vector store Chainlet
A real production RAG system would use a hosted vector database with a massive
number of stored embeddings. For this example, we’re using a small local vector
store built with chromadb
to stand in for a more complex system.
The Chainlet has three parts:
remote_config
, which configures a Docker image on deployment with dependencies.__init__()
, which runs once when the Chainlet is spun up, and creates the vector database with ten sample bios.run_remote()
, which runs each time the Chainlet is called and is the sole public interface for the Chainlet.
LLM inference stub
Now that we can retrieve relevant bios from the vector database, we need to pass that information to an LLM to generate our final output.
Chains can integrate previously deployed models using a Stub. Like Chainlets,
Stubs implement
run_remote()
, but as a call
to the deployed model.
For our LLM, we’ll use Phi-3 Mini Instruct, a small-but-mighty open source LLM.
Deploy Phi-3 Mini Instruct 4k
One-click model deployment from Baseten’s model library.
While the model is deploying, be sure to note down the models’ invocation URL from the model dashboard for use in the next step.
To use our deployed LLM in the RAG Chain, we define a Stub:
RAG entrypoint Chainlet
The entrypoint to a Chain is the Chainlet that specifies the public-facing input and output of the Chain and orchestrates calls to dependencies.
The __init__
function in this Chainlet takes two new arguments:
- Add dependencies to any Chainlet with
chains.depends()
. Only Chainlets, not Stubs, need to be added in this fashion. - Use
chains.depends_context()
to inject a context object at runtime. This context object is required to initialize theLLMClient
stub. - Visit your baseten workspace to find your
the URL of the previously deployed Phi-3 model and insert if as value
for
LLM_URL
.
Testing locally
Because our Chain uses a Stub for the LLM call, we can run the whole Chain locally without any GPU resources.
Before running the Chainlet, make sure to set your Baseten API key as an
environment variable BASETEN_API_KEY
.
We can run our Chain locally:
After a few moments, we should get a recommendation for why Sam should meet the alumni selected from the database.
Deploying to production
Once we’re satisfied with our Chain’s local behavior, we can deploy it to production on Baseten. To deploy the Chain, run:
This will deploy our Chain as a development deployment. Once the Chain is deployed, we can call it from its API endpoint.
You can do this in the console with cURL:
Alternatively, you can also integrate this in a Python application:
When we’re happy with the deployed Chain, we can promote it to production via the UI or by running:
Once in production, the Chain will have access to full autoscaling settings. Both the development and production deployments will scale to zero when not in use.