# How Baseten works Source: https://docs.baseten.co/concepts/howbasetenworks Baseten is a platform designed to make deploying, serving, and scaling AI models seamless. Whether your models are custom-built, open-source, or fine-tuned, Baseten provides the tools and resources to turn them into production-ready APIs. Instead of managing infrastructure, scaling policies, and performance optimization, you can focus on building and iterating on your AI-powered applications. Baseten's inference platform is built around four core concepts: * **[Development](/development/concepts)** → Package and optimize models for deployment * **[Deployment](/deployment/concepts)** → Serve models with robust autoscaling and resource management * **[Inference](/inference/concepts)** → Run real-time or batch predictions with powerful execution controls * **[Observability](/observability/metrics)** → Monitor performance, optimize latency, and debug issues These work together to streamline the entire lifecycle of AI models, from packaging and deployment to execution and performance tracking. ## Development AI development starts with turning a trained model into a deployable artifact. Baseten makes this process easy with [Truss](/development/model/overview), an open-source model packaging framework. With Truss, you can define dependencies, resource requirements, and custom logic, ensuring that your model runs consistently across local environments and production deployments. Whether you're using a Hugging Face transformer, a TensorFlow model, or a custom PyTorch implementation, Truss provides a standardized way to containerize and deploy it. For more complex use cases you can use [Chains](/development/chain/overview), our SDK for compound AI systems, to orchestrate multiple models and processing steps into a single, cohesive workflow. You can combine model inference with pre-processing logic, post-processing steps, and external API calls, enabling powerful AI pipelines that go beyond simple predictions. With Truss and Chains, you have full control over how your models are structured, executed, and optimized for production use. Package and deploy any AI/ML model as an API with Truss or a Custom Server. Build multi-model workflows by chaining models, pre/post-processing, and business logic. ## Deployment Once a model is packaged, it needs to be served efficiently. [Deployments](/deployment/concepts) on Baseten provide the flexibility and performance required for real-world AI applications. Every model runs within a dedicated deployment, which [manages resources](/deployment/resources), versioning, and [scaling](/deployment/autoscaling). Models can be served with autoscaling to handle spikes in traffic, and they can scale to zero when idle to minimize costs. For production use cases, [Environments](/deployment/environments) provide structured model management, ensuring that updates and changes follow a controlled process. You can maintain separate environments for staging, testing, and production, each with its own scaling policies and performance optimizations. [Canary deployments](/deployment/deployments#canary-deployments) allow for gradual traffic shifting, so new model versions can be rolled out safely without disrupting existing users. With built-in infrastructure management, Baseten ensures that every model runs efficiently and reliably, regardless of demand. ## Inference Serving a model isn’t just about hosting it; it’s about delivering fast, reliable predictions. Baseten’s [inference engine](/inference/concepts) is built to maximize performance, supporting [synchronous](/inference/calling-your-model), [asynchronous](/inference/async), and [streaming](/inference/streaming) inference. For LLMs and generative AI, streamed responses provide sub-second latency, allowing tokens to be returned as they are generated. This makes AI-powered chatbots, content generation tools, and interactive applications feel instantaneous. For workloads that require efficiency at scale, inference requests can be optimized with concurrency settings and batched execution. [Asynchronous](/inference/async) inference enables large-scale processing without blocking application threads, allowing you to queue and process thousands of requests without latency bottlenecks. Whether your application needs high-speed responses or large-scale processing, Baseten gives you full control over how inference is handled, ensuring every request is processed with minimal delay. ## Observability Running AI models in production requires visibility into performance and reliability. Baseten provides built-in [monitoring](/observability/metrics) tools to track model health, execution times, and resource usage. With real-time metrics, you can analyze inference times, identify bottlenecks, and optimize performance based on actual usage patterns. Beyond performance tracking, detailed request and response logs allow for easier debugging and observability. If a model produces unexpected results or fails under certain conditions, you can inspect exact inputs, outputs, and error states to diagnose issues quickly. For deeper insights, Baseten supports [exporting metrics](/observability/export-metrics/overview) to external observability tools like Datadog and Prometheus. With a complete view into model execution, monitoring ensures that AI applications remain performant, cost-effective, and reliable at scale. # Why Baseten Source: https://docs.baseten.co/concepts/whybaseten Baseten delivers fast, scalable AI/ML inference with enterprise-grade security and reliability—whether in our cloud or yours. ## Mission-critical inference Built for high-performance workloads, our platform optimizes inference performance across modalities, from state-of-the-art transcription to blazing-fast LLMs. Built-in autoscaling, model performance optimizations, and deep observability tools ensure efficiency without complexity. Trusted by top ML teams serving their products to millions of users, Baseten accelerates time to market for AI-driven products by building on four key pillars of inference: performance, infrastructure, tooling, and expertise. #### Model performance Baseten’s model performance engineers apply the latest research and custom engine optimizations in production, so you get low latency and high throughput out of the box. Production-grade support for critical features, like speculative decoding and LoRA swapping, is baked into our platform. #### Cloud-native infrastructure [Deploy](/deployment/concepts) and [scale models](/deployment/autoscaling) across clusters, regions, and clouds with five nines reliability. We built all the orchestration and optimized the network routing to ensure global scalability without the operational complexity. #### Model management tooling Love your development ecosystem, with deep [observability](/observability/metrics) and easy-to-use tools for deploying, managing, and iterating on models in production. Quickly serve open-source and custom models, ultra-low-latency compound AI systems, and custom Docker servers in our cloud or yours. #### Embedded engineering Baseten’s expert engineers work as an extension of your team, customizing deployments for your target performance, quality, and cost-efficiency metrics. Get hands-on support with deep inference-specific expertise and 24/7 on-call availability. # Autoscaling Source: https://docs.baseten.co/deployment/autoscaling Autoscaling dynamically adjusts the number of active replicas to **handle variable traffic** while minimizing idle compute costs. ## Configuring autoscaling Autoscaling settings are **per deployment** and are inherited when promoting a model to production unless overridden. Configure autoscaling through: * **UI** → Manage settings in your Baseten workspace. * **API** → Use the **[autoscaling API](/reference/management-api/deployments/autoscaling)**. ### Replica Scaling Each deployment scales within a configured range of replicas: * **Minimum replicas** → The lowest number of active replicas. * Default: `0` (scale to zero). * Maximum value: Cannot exceed the **maximum replica count**. * **Maximum replicas** → The upper limit of active replicas. * Default: `1`. * Max: `10` by default (contact support to increase). When first deployed, the model starts with `1` replica (or the **minimum count**, if higher). As traffic increases, additional replicas **scale up** until the **maximum count** is reached. When traffic decreases, replicas **scale down** to match demand. *** ## Autoscaler settings The **autoscaler logic** is controlled by three key parameters: * **Autoscaling window** → Time window for traffic analysis before scaling up/down. Default: 60 seconds. * **Scale down delay** → Time before an unused replica is removed. Default: 900 seconds (15 minutes). * **Concurrency target** → Number of requests a replica should handle before scaling. Default: 1 request. A **short autoscaling window** with a **longer scale-down delay** is recommended for **fast upscaling** while maintaining capacity during temporary dips. *** ## Autoscaling behavior ### Scaling Up When the **average requests per active replica** exceed the **concurrency target** within the **autoscaling window**, more replicas are created until: * The **concurrency target is met**, or * The **maximum replica count** is reached. ### Scaling Down When traffic drops below the **concurrency target**, excess replicas are flagged for removal. The **scale-down delay** ensures that replicas are not removed prematurely: * If traffic **spikes again before the delay ends**, replicas remain active. * If the **minimum replica count** is reached, no further scaling down occurs. *** ## Scale to zero If you're just testing your model or anticipate light and inconsistent traffic, scale to zero can save you substantial amounts of money. Scale to zero means that when a deployed model is not receiving traffic, it scales down to zero replicas. When the model is called, Baseten spins up a new instance to serve model requests. To turn on scale to zero, just set a deployment's minimum replica count to zero. Scale to zero is enabled by default in the standard autoscaling config. Models that have not received any traffic for more than two weeks will be automatically deactivated. These models will need to be activated manually before they can serve requests again. *** ## Cold starts A **cold start** is the time required to **initialize a new replica** when scaling up. Cold starts impact: * **Scaled-to-zero deployments** → The first request must wait for a new replica to start. * **Scaling events** → When traffic spikes and a deployment requires more replicas. ### Cold Start Optimizations **Network accelerator** Baseten speeds up model loading from **Hugging Face, CloudFront, S3, and OpenAI** using parallelized **byte-range downloads**, reducing cold start delays. **Cold start pods** Baseten pre-warms specialized **cold start pods** to accelerate loading times. These pods appear in logs as `[Coldboost]`. ```md Example coldboost log line Oct 09 9:20:25pm [Coldboost] Completed model.load() execution in 12650 ms ``` *** ## Autoscaling for development deployments Development deployments have **fixed autoscaling constraints** to optimize for **live reload workflows**: * **Min replicas:** `0` * **Max replicas:** `1` * **Autoscaling window:** `60 seconds` * **Scale down delay:** `900 seconds (15 min)` * **Concurrency target:** `1 request` To enable full autoscaling, **promote the deployment and environment** like production. # Concepts Source: https://docs.baseten.co/deployment/concepts Baseten provides a flexible and scalable infrastructure for deploying and managing machine learning models. This page introduces key concepts - [deployments](/deployment/deployments), [environments](/deployment/environments) , [resources](/deployment/resources) , and [autoscaling](/deployment/autoscaling) — that shape how models are served, tested, and optimized for performance and cost efficiency. ## Deployments [Deployments](/deployment/deployments) define how models are served, scaled, and updated. They optimize resource use with autoscaling, scaling to zero, and controlled traffic shifts while ensuring minimal downtime. Deployments can be deactivated to pause resource usage or deleted permanently when no longer needed. ## Environments [Environments](/deployment/environments) group deployments, providing stable endpoints and autoscaling to manage model release cycles. They enable structured testing, controlled rollouts, and seamless transitions between staging and production. Each environment maintains its own settings and metrics, ensuring reliable and scalable deployments. ## Resources [Resources](/deployment/resources) define the hardware allocated to a model server, balancing performance and cost. Choosing the right instance type ensures efficient inference without unnecessary overhead. Resources can be set before deployment in Truss or adjusted later in the model dashboard to match workload demands. ## Autoscaling [Autoscaling](/deployment/autoscaling) dynamically adjusts model resources to handle traffic fluctuations efficiently while minimizing costs. Deployments scale between a defined range of replicas based on demand, with settings for concurrency, scaling speed, and scale-to-zero for low-traffic models. Optimizations like network acceleration and cold start pods ensure fast response times even when scaling up from zero. # Deployments Source: https://docs.baseten.co/deployment/deployments Deploy, manage, and scale machine learning models with Baseten A **deployment** in Baseten is a **containerized instance of a model** that serves inference requests via an API endpoint. Deployments exist independently but can be **promoted to an environment** for structured access and scaling. Every deployment is **automatically wrapped in a REST API**. Once deployed, models can be queried with a simple HTTP request: ```python import requests resp = requests.post( "https://model-{modelID}.api.baseten.co/deployment/[{deploymentID}]/predict", headers={"Authorization": "Api-Key YOUR_API_KEY"}, json={'text': 'Hello my name is {MASK}'}, ) print(resp.json()) ``` [Learn more about running inference on your deployment](/inference/calling-your-model) *** # Development deployment A **development deployment** is a mutable instance designed for rapid iteration. It is always in the **development state** and cannot be renamed or detached from it. Key characteristics: * **Live reload** enables direct updates without redeployment. * **Single replica, scales to zero** when idle to conserve compute resources. * **No autoscaling or zero-downtime updates.** * **Can be promoted** to create a persistent deployment. Once promoted, the development deployment transitions to a **deployment** and can optionally be promoted to an environment. *** # Environments & Promotion Environments provide **logical isolation** for managing deployments but are **not required** for a deployment to function. A deployment can be executed independently or promoted to an environment for controlled traffic allocation and scaling. * The **production environment** exists by default. * **Custom environments** (e.g., staging) can be created for specific workflows. * **Promoting a deployment does not modify its behavior**, only its routing and lifecycle management. ## Canary deployments Canary deployments support **incremental traffic shifting** to a new deployment, mitigating risk during rollouts. * Traffic is routed in **10 evenly distributed stages** over a configurable time window. * Traffic only begins to shift once the new deployment reaches the min replica count of the current production model. * Autoscaling dynamically adjusts to real-time demand. * Canary rollouts can be enabled or canceled via the UI or [REST API](/reference/management-api/environments/update-an-environments-settings). *** # Managing Deployments ## Naming deployments By default, deployments of a model are named `deployment-1`, `deployment-2`, and so forth sequentially. You can instead give deployments custom names via two methods: 1. While creating the deployment, using a [command line argument in truss push](/reference/sdk/truss#deploying-a-model). 2. After creating the deployment, in the model management page within your Baseten dashboard. Renaming deployments is purely aesthetic and does not affect model management API paths, which work via model and deployment IDs. ## Deactivating a deployment A deployment can be deactivated to suspend inference execution while preserving configuration. * **Remains visible in the dashboard.** * **Consumes no compute resources** but can be reactivated anytime. * **API requests return a 404 error while deactivated.** For demand-driven deployments, consider [autoscaling with scale to zero](/reference/management-api/deployments/autoscaling/updates-a-production-deployments-autoscaling-settings). ## Deleting deployments Deployments can be **permanently deleted**, but production deployments must be replaced before deletion. * **Deleted deployments are purged from the dashboard** but retained in usage logs. * **All associated compute resources are released.** * **API requests return a 404 error post-deletion.** Deletion is irreversible — use deactivation if retention is required. # Environments Source: https://docs.baseten.co/deployment/environments Manage your model’s release cycles with environments. Environments provide structured management for deployments, ensuring controlled rollouts, stable endpoints, and autoscaling. They help teams stage, test, and release models without affecting production traffic. Deployments can be promoted to an environment (e.g., "staging") to validate outputs before moving to production, allowing for safer model iteration and evaluation. *** ## Using Environments to manage deployments Environments support **structured validation** before promoting a deployment, including: * **Automated tests & evaluations** * **Manual testing in pre-production** * **Gradual traffic shifts with canary deployments** * **Shadow serving for real-world analysis** Promoting a deployment ensures it inherits **environment-specific scaling and monitoring settings**, such as: * **Dedicated API endpoint** → [Predict API Reference](/reference/inference-api/overview#predict-endpoints) * **Autoscaling controls** → Scale behavior is managed per environment. * **Traffic ramp-up** → Enable [canary rollouts](/deployment/deployments#canary-deployments). * **Monitoring & Metrics** → [Export environment metrics](/observability/export-metrics/overview). A **production environment** operates like any other environment but has restrictions: * **It cannot be deleted** unless the entire model is removed. * **You cannot create additional environments named "production."** *** ## Creating custom environments In addition to the standard **production** environment, you can create as many custom environments as needed. There are two ways to create a custom environment: 1. In the model management page on the Baseten dashboard. 2. Via the [create environment endpoint](/reference/management-api/environments/create-an-environment) in the model management API. *** ## Promoting deployments to environments When a deployment is promoted, Baseten follows a **three-step process**: 1. A **new deployment** is created with a unique deployment ID. 2. The deployment **initializes resources** and becomes active. 3. The new deployment **replaces the existing deployment** in that environment. * If there was **no previous deployment, default autoscaling settings** are applied. * If a **previous deployment existed**, the new one **inherits autoscaling settings**, and the old deployment is **demoted and scales to zero**. ### Promoting a Published Deployment If a **published deployment** (not a development deployment) is promoted: * Its **autoscaling settings are updated** to match the environment. * If **inactive**, it must be **activated** before promotion. Previous deployments are **demoted but remain in the system**, retaining their **deployment ID and scaling behavior**. *** ## Deploying directly to an environment You can **skip development stage** and deploy directly to an environment by specifying `--environment` in `truss push`: ```sh cd my_model/ truss push --environment {environment_name} ``` Only one active promotion per environment is allowed at a time. *** ## Accessing environments in your code The **environment name** is available in `model.py` via the `environment` keyword argument: ```python def __init__(self, **kwargs): self._environment = kwargs["environment"] ``` To ensure the **environment variable remains updated**, enable\*\* "Re-deploy when promoting" \*\*in the UI or via the [REST API](/reference/management-api/environments/update-an-environments-settings). This guarantees the environment is fully initialized after a promotion. *** ## Deleting environments Environments can be deleted, **except for production**. To remove a **production deployment**, first **promote another deployment to production** or delete the entire model. * **Deleted environments are removed from the overview** but remain in billing history. * **They do not consume resources** after deletion. * **API requests to a deleted environment return a 404 error.** Deletion is permanent - consider deactivation instead. # Resources Source: https://docs.baseten.co/deployment/resources Manage and configure model resources Every AI/ML model on Baseten runs on an **instance**, a dedicated set of hardware allocated to the model server. Selecting the right instance type ensures **optimal performance** while controlling **compute costs**. * **Insufficient resources** → Slow inference or failures. * **Excess resources** → Higher costs without added benefit. ## Instance type resource components * **Instance** → The allocated hardware for inference. * **vCPU** → Virtual CPU cores for general computing. * **RAM** → Memory available to the CPU. * **GPU** → Specialized hardware for accelerated ML workloads. * **VRAM** → Dedicated GPU memory for model execution. *** # Configuring model resources Resources can be defined **before deployment** in Truss or **adjusted later** via the Baseten UI. ### Defining resources in Truss Define resource requirements in `config.yaml` before running `truss push`. Any changes after deployment will not impact previous deployments. Running `truss push` again will create a new deployment using the resources specified in the `config.yaml`. The only exception is the **development** deployment. It will be redeployed with the new specified resources. **Example (Stable Diffusion XL):** ```yaml config.yaml resources: accelerator: A10G cpu: "4" memory: 16Gi use_gpu: true ``` Baseten provisions the **smallest instance that meets the specified constraints**: * \*\*cpu: "3" or "4" → \*\*Maps to a 4-core instance. * \*\*cpu: "5" to "8" → \*\*Maps to an 8-core instance. `Gi` in `resources.memory` refers to **Gibibytes**, which are slightly larger than **Gigabytes**. ### Updating resources in the Baseten UI Once deployed, resource configurations can only be updated **through the Baseten UI**. Changing the instance type will deploy a new copy of the deployment using the new specified instance type. Like when running `truss push`, the **development** deployment will be redeployed with the new specified instance type. For a list of available instance types, see the [instance type reference](/deployment/resources#instance-type-reference). *** # Instance Type Reference Specs and benchmarks for every Baseten instance type. Choosing the right instance for model inference means balancing performance and cost. This page lists all available instance types on Baseten to help you deploy and serve models effectively. ## CPU-only Instances Cost-effective options for lighter workloads. No GPU. * **Starts at**: \$0.00058/min * **Best for**: Transformers pipelines, small QA models, text embeddings | Instance | \$/min | vCPU | RAM | | -------- | --------- | ---- | ------ | | 1×2 | \$0.00058 | 1 | 2 GiB | | 1×4 | \$0.00086 | 1 | 4 GiB | | 2×8 | \$0.00173 | 2 | 8 GiB | | 4×16 | \$0.00346 | 4 | 16 GiB | | 8×32 | \$0.00691 | 8 | 32 GiB | | 16×64 | \$0.01382 | 16 | 64 GiB | **Example workloads:** * `1x2`: Text classification (e.g., Truss quickstart) * `4x16`: LayoutLM Document QA * `4x16+`: Sentence Transformers embeddings on larger corpora ## GPU Instances Accelerated inference for LLMs, diffusion models, and Whisper. | Instance | \$/min | vCPU | RAM | GPU | VRAM | | ----------------- | --------- | ---- | -------- | ---------------------- | ------- | | T4x4x16 | \$0.01052 | 4 | 16 GiB | NVIDIA T4 | 16 GiB | | T4x8x32 | \$0.01504 | 8 | 32 GiB | NVIDIA T4 | 16 GiB | | T4x16x64 | \$0.02408 | 16 | 64 GiB | NVIDIA T4 | 16 GiB | | L4x4x16 | \$0.01414 | 4 | 16 GiB | NVIDIA L4 | 24 GiB | | L4:2x4x16 | \$0.04002 | 24 | 96 GiB | 2 NVIDIA L4s | 48 GiB | | L4:4x48x192 | \$0.08003 | 48 | 192 GiB | 4 NVIDIA L4s | 96 GiB | | A10Gx4x16 | \$0.02012 | 4 | 16 GiB | NVIDIA A10G | 24 GiB | | A10Gx8x32 | \$0.02424 | 8 | 32 GiB | NVIDIA A10G | 24 GiB | | A10Gx16x64 | \$0.03248 | 16 | 64 GiB | NVIDIA A10G | 24 GiB | | A10G:2x24x96 | \$0.05672 | 24 | 96 GiB | 2 NVIDIA A10Gs | 48 GiB | | A10G:4x48x192 | \$0.11344 | 48 | 192 GiB | 4 NVIDIA A10Gs | 96 GiB | | A10G:8x192x768 | \$0.32576 | 192 | 768 GiB | 8 NVIDIA A10Gs | 188 GiB | | V100x8x61 | \$0.06120 | 16 | 61 GiB | NVIDIA V100 | 16 GiB | | A100x12x144 | \$0.10240 | 12 | 144 GiB | 1 NVIDIA A100 | 80 GiB | | A100:2x24x288 | \$0.20480 | 24 | 288 GiB | 2 NVIDIA A100s | 160 GiB | | A100:3x36x432 | \$0.30720 | 36 | 432 GiB | 3 NVIDIA A100s | 240 GiB | | A100:4x48x576 | \$0.40960 | 48 | 576 GiB | 4 NVIDIA A100s | 320 GiB | | A100:5x60x720 | \$0.51200 | 60 | 720 GiB | 5 NVIDIA A100s | 400 GiB | | A100:6x72x864 | \$0.61440 | 72 | 864 GiB | 6 NVIDIA A100s | 480 GiB | | A100:7x84x1008 | \$0.71680 | 84 | 1008 GiB | 7 NVIDIA A100s | 560 GiB | | A100:8x96x1152 | \$0.81920 | 96 | 1152 GiB | 8 NVIDIA A100s | 640 GiB | | H100x26x234 | \$0.16640 | 26 | 234 GiB | 1 NVIDIA H100 | 80 GiB | | H100:2x52x468 | \$0.33280 | 52 | 468 GiB | 2 NVIDIA H100s | 160 GiB | | H100:4x104x936 | \$0.66560 | 104 | 936 GiB | 4 NVIDIA H100s | 320 GiB | | H100:8x208x1872 | \$1.33120 | 208 | 1872 GiB | 8 NVIDIA H100s | 640 GiB | | H100MIG:3gx13x117 | \$0.08250 | 13 | 117 GiB | Fractional NVIDIA H100 | 40 GiB | ## GPU Details & Workloads ### T4 Turing-series GPU * 2,560 CUDA / 320 Tensor cores * 16 GiB VRAM * **Best for:** Whisper, small LLMs like StableLM 3B ### L4 Ada Lovelace-series GPU * 7,680 CUDA / 240 Tensor cores * 24 GiB VRAM, 300 GiB/s * 24 GiB VRAM, 300 GiB/s * 121 TFLOPS (fp16) * **Best for**: Stable Diffusion XL * **Limit**: Not suitable for LLMs due to bandwidth ### A10G Ampere-series GPU * 9,216 CUDA / 288 Tensor cores * 24 GiB VRAM, 600 GiB/s * 70 TFLOPS (fp16) * **Best for**: Mistral 7B, Whisper, Stable Diffusion/SDXL ### V100 Volta-series GPU * 16 GiB VRAM * **Best for**: Legacy workloads needing V100-specific support ### A100 Ampere-series GPU * 6,912 CUDA / 432 Tensor cores * 80 GiB VRAM, 1.94 TB/s * 312 TFLOPS (fp16) * **Best for**: Mixtral, Llama 2 70B (2 A100s), Falcon 180B (5 A100s), SDXL ### H100 Hopper-series GPU * 16,896 CUDA / 640 Tensor cores * 80 GiB VRAM, 3.35 TB/s * 990 TFLOPS (fp16) * **Best for**: Mixtral 8x7B, Llama 2 70B (2×H100), SDXL ### H100MIG Fractional H100 (3/7 compute, ½ memory) * 7,242 CUDA cores, 40 GiB VRAM * 1.675 TB/s bandwidth * **Best for**: Efficient LLM inference at lower cost than A100 # Binary IO Source: https://docs.baseten.co/development/chain/binaryio Performant serialization of numeric data Numeric data or audio/video are most efficiently transmitted as bytes. Other representations such as JSON or base64 encoding loose precision, add significant parsing overhead and increase message sizes (e.g. \~33% increase for base64 encoding). Chains extends the JSON-centred pydantic ecosystem with two ways how you can include binary data: numpy array support and raw bytes. ## Numpy `ndarray` support Once you have your data represented as a numpy array, you can easily (and often without copying) convert it to `torch`, `tensorflow` or other common numeric library's objects. To include numpy arrays in a pydantic model, chains has a special field type implementation `NumpyArrayField`. For example: ```python import numpy as np import pydantic from truss_chains import pydantic_numpy class DataModel(pydantic.BaseModel): some_numbers: pydantic_numpy.NumpyArrayField other_field: str ... numbers = np.random.random((3, 2)) data = DataModel(some_numbers=numbers, other_field="Example") print(data) # some_numbers=NumpyArrayField(shape=(3, 2), dtype=float64, data=[ # [0.39595027 0.23837526] # [0.56714894 0.61244946] # [0.45821942 0.42464844]]) # other_field='Example' ``` `NumpyArrayField` is a wrapper around the actual numpy array. Inside your python code, you can work with its `array` attribute: ```python data.some_numbers.array += 10 # some_numbers=NumpyArrayField(shape=(3, 2), dtype=float64, data=[ # [10.39595027 10.23837526] # [10.56714894 10.61244946] # [10.45821942 10.42464844]]) # other_field='Example' ``` The interesting part is, how it serializes when making communicating between Chainlets or with a client. It can work in two modes: JSON and binary. ### Binary As an JSON alternative that supports byte data, Chains uses `msgpack` (with `msgpack_numpy`) to serialize the dict representation. For Chainlet-Chainlet RPCs this is done automatically for you by enabling binary mode of the dependency Chainlets, see [all options](/reference/sdk/chains#truss-chains-depends): ```python import truss_chains as chains class Worker(chains.ChainletBase): async def run_remote(self, data: DataModel) -> DataModel: data.some_numbers.array += 10 return data class Consumer(chains.ChainletBase): def __init__(self, worker=chains.depends(Worker, use_binary=True)): self._worker = worker async def run_remote(self): numbers = np.random.random((3, 2)) data = DataModel(some_numbers=numbers, other_field="Example") result = await self._worker.run_remote(data) ``` Now the data is transmitted in a fast and compact way between Chainlets which often gives performance increases. ### Binary client If you want to send such data as input to a chain or parse binary output from a chain, you have to add the `msgpack` serialization client-side: ```python import requests import msgpack import msgpack_numpy msgpack_numpy.patch() # Register hook for numpy. # Dump to "python" dict and then to binary. data_dict = data.model_dump(mode="python") data_bytes = msgpack.dumps(data_dict) # Set binary content type in request header. headers = { "Content-Type": "application/octet-stream", "Authorization": ... } response = requests.post(url, data=data_bytes, headers=headers) response_dict = msgpack.loads(response.content) response_model = ResponseModel.model_validate(response_dict) ``` The steps of dumping from a pydantic model and validating the response dict into a pydantic model can be skipped, if you prefer working with raw dicts on the client. The implementation of `NumpyArrayField` only needs `pydantic`, no other Chains dependencies. So you can take that implementation code in isolation and integrated it in your client code. Some version combinations of `msgpack` and `msgpack_numpy` give errors, we know that `msgpack = ">=1.0.2"` and `msgpack-numpy = ">=0.4.8"` work. ### JSON The JSON-schema to represent the array is a dict of `shape (tuple[int]), dtype (str), data_b64 (str)`. E.g. ```python print(data.model_dump_json()) '{"some_numbers":{"shape":[3,2],"dtype":"float64", "data_b64":"30d4/rnKJEAsvm...' ``` The base64 data corresponds to `np.ndarray.tobytes()`. To get back to the array from the JSON string, use the model's `model_validate_json` method. As discussed in the beginning, this schema is not performant for numeric data and only offered as a compatibility layer (JSON does not allow bytes) - generally prefer the binary format. # Simple `bytes` fields It is possible to add a `bytes` field to a pydantic model used in a chain, or as a plain argument to `run_remote`. This can be useful to include non-numpy data formats such as images or audio/video snippets. In this case, the "normal" JSON representation does not work and all involved requests or Chainlet-Chainlet-invocations must use binary mode. The same steps as for arrays [above](#binary-client) apply: construct dicts with `bytes` values and keys corresponding to the `run_remote` argument names or the field names in the pydantic model. Then use `msgpack` to serialize and deserialize those dicts. Don't forget to add `Content-type` headers and that `response.json()` will not work. # Concepts Source: https://docs.baseten.co/development/chain/concepts Glossary of Chains concepts and terminology ## Chainlet A Chainlet is the basic building block of Chains. A Chainlet is a Python class that specifies: * A set of compute resources. * A Python environment with software dependencies. * A typed interface [ `run_remote()`](/development/chain/concepts#run-remote-chaining-chainlets) for other Chainlets to call. This is the simplest possible Chainlet — only the [`run_remote()`](/development/chain/concepts#run-remote-chaining-chainlets) method is required — and we can layer in other concepts to create a more capable Chainlet. ```python import truss_chains as chains class SayHello(chains.ChainletBase): async def run_remote(self, name: str) -> str: return f"Hello, {name}" ``` You can modularize your code by creating your own chainlet sub-classes, refer to our [subclassing guide](/development/chain/subclassing). ### Remote configuration Chainlets are meant for deployment as remote services. Each Chainlet specifies its own requirements for compute hardware (CPU count, GPU type and count, etc) and software dependencies (Python libraries or system packages). This configuration is built into a Docker image automatically as part of the deployment process. When no configuration is provided, the Chainlet will be deployed on a basic instance with one vCPU, 2GB of RAM, no GPU, and a standard set of Python and system packages. Configuration is set using the [`remote_config`](/reference/sdk/chains#remote-configuration) class variable within the Chainlet: ```python import truss_chains as chains class MyChainlet(chains.ChainletBase): remote_config = chains.RemoteConfig( docker_image=chains.DockerImage( pip_requirements=["torch==2.3.0", ...] ), compute=chains.Compute(gpu="H100", ...), assets=chains.Assets(secret_keys=["hf_access_token"], ...), ) ``` See the [remote configuration reference](/reference/sdk/chains#remote-configuration) for a complete list of options. ### Initialization Chainlets are implemented as classes because we often want to set up expensive static resources once at startup and then re-use it with each invocation of the Chainlet. For example, we only want to initialize an AI model and download its weights once then re-use it every time we run inference. We do this setup in `__init__()`, which is run exactly once when the Chainlet is deployed or scaled up. ```python import truss_chains as chains class PhiLLM(chains.ChainletBase): def __init__(self) -> None: import torch import transformers self._model = transformers.AutoModelForCausalLM.from_pretrained( PHI_HF_MODEL, torch_dtype=torch.float16, device_map="auto", ) self._tokenizer = transformers.AutoTokenizer.from_pretrained( PHI_HF_MODEL, ) ``` Chainlet initialization also has two important features: context and dependency injection of other Chainlets, explained below. #### Context (access information) You can add [ `DeploymentContext`](/reference/sdk/chains#class-truss-chains-deploymentcontext) object as an optional argument to the `__init__`-method of a Chainlet. This allows you to use secrets within your Chainlet, such as using a `hf_access_token` to access a gated model on Hugging Face (note that when using secrets, they also need to be added to the `assets`). ```python import truss_chains as chains class MistralLLM(chains.ChainletBase): remote_config = chains.RemoteConfig( ... assets = chains.Assets(secret_keys=["hf_access_token"], ...), ) def __init__( self, # Adding the `context` argument, allows us to access secrets context: chains.DeploymentContext = chains.depends_context(), ) -> None: import transformers # Using the secret from context to access a gated model on HF self._model = transformers.AutoModelForCausalLM.from_pretrained( "mistralai/Mistral-7B-Instruct-v0.2", use_auth_token=context.secrets["hf_access_token"], ) ``` #### Depends (call other Chainlets) The Chains framework uses the [`chains.depends()`](/reference/sdk/chains#truss-chains-depends) function in Chainlets' `__init__()` method to track the dependency relationship between different Chainlets within a Chain. This syntax, inspired by dependency injection, is used to translate local Python function calls into calls to the remote Chainlets in production. Once a dependency Chainlet is added with [`chains.depends()`](/reference/sdk/chains#truss-chains-depends), its [`run_remote()`](/development/chain/concepts#run-remote-chaining-chainlets) method can call this dependency Chainlet, e.g. below `HelloAll` we can make calls to `SayHello`: ```python import truss_chains as chains class HelloAll(chains.ChainletBase): def __init__(self, say_hello_chainlet=chains.depends(SayHello)) -> None: self._say_hello = say_hello_chainlet async def run_remote(self, names: list[str]) -> str: output = [] for name in names: output.append(self._say_hello.run_remote(name)) return "\n".join(output) ``` ## Run remote (chaining Chainlets) The `run_remote()` method is run each time the Chainlet is called. It is the sole public interface for the Chainlet (though you can have as many private helper functions as you want) and its inputs and outputs must have type annotations. In `run_remote()` you implement the actual work of the Chainlet, such as model inference or data chunking: ```python import truss_chains as chains class PhiLLM(chains.ChainletBase): async def run_remote(self, messages: Messages) -> str: import torch model_inputs = await self._tokenizer.apply_chat_template( messages, tokenize=False, add_generation_prompt=True ) inputs = await self._tokenizer(model_inputs, return_tensors="pt") input_ids = inputs["input_ids"].to("cuda") with torch.no_grad(): outputs = await self._model.generate( input_ids=input_ids, **self._generate_args) output_text = await self._tokenizer.decode( outputs[0], skip_special_tokens=True) return output_text ``` We recommend implementing this as an `async` method and using async APIs for doing all the work (e.g. downloads, vLLM or TRT inference). It is possible to stream results back, see our [streaming guide](/development/chain/streaming). If `run_remote()` makes calls to other Chainlets, e.g. invoking a dependency Chainlet for each element in a list, you can benefit from concurrent execution, by making the `run_remote()` an `async` method and starting the calls as concurrent tasks `asyncio.ensure_future(self._dep_chainlet.run_remote(...))`. ## Entrypoint The entrypoint is called directly from the deployed Chain's API endpoint and kicks off the entire chain. The entrypoint is also responsible for returning the final result back to the client. Using the [`@chains.mark_entrypoint`](/reference/sdk/chains#truss-chains-mark-entrypoint) decorator, one Chainlet within a file is set as the entrypoint to the chain. ```python @chains.mark_entrypoint class HelloAll(chains.ChainletBase): ``` Optionally you can also set a Chain display name (not to be confused with Chainlet display name) with this decorator: ```python @chains.mark_entrypoint("My Awesome Chain") class HelloAll(chains.ChainletBase): ``` ## I/O and `pydantic` data types To make orchestrating multiple remotely deployed services possible, Chains relies heavily on typed inputs and outputs. Values must be serialized to a safe exchange format to be sent over the network. The Chains framework uses the type annotations to infer how data should be serialized and currently is restricted to types that are JSON compatible. Types can be: * Direct type annotations for simple types such as `int`, `float`, or `list[str]`. * Pydantic models to define a schema for nested data structures or multiple arguments. An example of pydantic input and output types for a Chainlet is given below: ```python import enum import pydantic class Modes(enum.Enum): MODE_0 = "MODE_0" MODE_1 = "MODE_1" class SplitTextInput(pydantic.BaseModel): data: str num_partitions: int mode: Modes class SplitTextOutput(pydantic.BaseModel): parts: list[str] part_lens: list[int] ``` Refer to the [pydantic docs](https://docs.pydantic.dev/latest/) for more details on how to define custom pydantic data models. Also refer to the [guide](/development/chain/binaryio) about efficient integration of binary and numeric data. ## Chains compared to Truss Chains is an alternate SDK for packaging and deploying AI models. It carries over many features and concepts from Truss and gives you access to the benefits of Baseten (resource provisioning, autoscaling, fast cold starts, etc), but it is not a 1-1 replacement for Truss. Here are some key differences: * Rather than running `truss init` and creating a Truss in a directory, a Chain is a single file, giving you more flexibility for implementing multi-step model inference. Create an example with `truss chains init`. * Configuration is done inline in typed Python code rather than in a `config.yaml` file. * While Chainlets are converted to Truss models when run on Baseten, `Chainlet != TrussModel`. Chains is designed for compatibility and incremental adoption, with a stub function for wrapping existing deployed models. # Deploy Source: https://docs.baseten.co/development/chain/deploy Deploy your Chain on Baseten Deploying a Chain is an atomic action that deploys every Chainlet within the Chain. Each Chainlet specifies its own remote environment — hardware resources, Python and system dependencies, autoscaling settings. ### Development The default behavior for pushing a chain is to create a development deployment: ```sh truss chains push ./my_chain.py ``` Where `my_chain.py` contains the entrypoint Chainlet for your Chain. Development deployments are intended for testing and can't scale past one replica. Each time you make a development deployment, it overwrites the existing development deployment. Development deployments support rapid iteration with `watch` - see [above guide](/development/chain/watch). ### 🆕 Environments To deploy a Chain to an environment, run: ```sh truss chains push ./my_chain.py --environment {env_name} ``` Environments are intended for live traffic and have access to full autoscaling settings. Each time you deploy to an environment, a new deployment is created. Once the new deployment is live, it replaces the previous deployment, which is relegated to the published deployments list. [Learn more](/deployment/environments) about environments. # Architecture & Design Source: https://docs.baseten.co/development/chain/design How to structure your Chainlets A Chain is composed of multiple connected Chainlets working together to perform a task. For example, the Chain in the diagram below takes a large audio file as input. Then it splits it into smaller chunks, transcribes each chunk in parallel (reducing the end-to-end latency), and finally aggregates and returns the results. To build an efficient Chain, we recommend drafting your high level structure as a flowchart or diagram. This can help you identifying parallelizable units of work and steps that need different (model/hardware) resources. If one Chainlet creates many "sub-tasks" by calling other dependency Chainlets (e.g. in a loop over partial work items), these calls should be done as `aynscio`-tasks that run concurrently. That way you get the most out of the parallelism that Chains offers. This design pattern is extensively used in the [audio transcription example](/examples/chains-audio-transcription). While using `asyncio` is essential for performance, it can also be tricky. Here are a few caveats to look out for: * Executing operations in an async function that block the event loop for more than a fraction of a second. This hinders the "flow" of processing requests concurrently and starting RPCs to other Chainlets. Ideally use native async APIs. Frameworks like vLLM or triton server offer such APIs, similarly file downloads can be made async and you might find [`AsyncBatcher`](https://github.com/hussein-awala/async-batcher) useful. If there is no async support, consider running blocking code in a thread/process pool (as an attributed of a Chainlet). * Creating async tasks (e.g. with `asyncio.ensure_future`) does not start the task *immediately*. In particular, when starting several tasks in a loop, `ensure_future` must be alternated with operations that yield to the event loop that, so the task can be started. If the loop is not `async for` or contains other `await` statements, a "dummy" await can be added, for example `await asyncio.sleep(0)`. This allows the tasks to be started concurrently. # Engine Builder Models Source: https://docs.baseten.co/development/chain/engine-builder-models Engine Builder models are pre-trained models that are optimized for specific inference tasks. Baseten's [Engine Builder](/development/model/performance/engine-builder-overview) enables the deployment of optimized model inference engines. Currently, it supports TensorRT-LLM. Truss Chains allows seamless integration of these engines into structured workflows. This guide provides a quick entry point for Chains users. ## LLama 7B Example Use the `EngineBuilderLLMChainlet` baseclass to configure an LLM engine. The additional `engine_builder_config` field specifies model architecture, repository, and runtime parameters and more, the full options are detailed in the [Engine Builder configuration guide](/development/model/performance/engine-builder-config). ```python import truss_chains as chains from truss.base import trt_llm_config, truss_config class Llama7BChainlet(chains.EngineBuilderLLMChainlet): remote_config = chains.RemoteConfig( compute=chains.Compute(gpu=truss_config.Accelerator.H100), assets=chains.Assets(secret_keys=["hf_access_token"]), ) engine_builder_config = truss_config.TRTLLMConfiguration( build=trt_llm_config.TrussTRTLLMBuildConfiguration( base_model=trt_llm_config.TrussTRTLLMModel.LLAMA, checkpoint_repository=trt_llm_config.CheckpointRepository( source=trt_llm_config.CheckpointSource.HF, repo="meta-llama/Llama-3.1-8B-Instruct", ), max_batch_size=8, max_seq_len=4096, tensor_parallel_count=1, ) ) ``` ## Differences from Standard Chainlets * No `run_remote` implementation: Unlike regular Chainlets, `EngineBuilderLLMChainlet` does not require users to implement `run_remote()`. Instead, it automatically wires into the deployed engine’s API. All LLM Chainlets have the same function signature: `chains.EngineBuilderLLMInput` as input and a stream (`AsyncIterator`) of strings as output. Likewise `EngineBuilderLLMChainlet`s can only be used as dependencies, but not have dependencies themselves. * No `run_local` ([guide](/development/chain/localdev)) and `watch` ([guide](/development/chain/watch)) Standard Chains support a local debugging mode and watch. However, when using `EngineBuilderLLMChainlet`, local execution is not available, and testing must be done after deployment. For a faster dev loop of the rest of your chain (everything except the engine builder chainlet) you can substitute those chainlets with stubs like you can do for an already deployed truss model \[[guide](/development/chain/stub)]. ## Integrate the Engine Builder Chainlet After defining an `EngineBuilderLLMInput` like `Llama7BChainlet` above, you can use it as a dependency in other conventional chainlets: ```python from typing import AsyncIterator import truss_chains as chains @chains.mark_entrypoint class TestController(chains.ChainletBase): """Example using the Engine Builder Chainlet in another Chainlet.""" def __init__(self, llm=chains.depends(Llama7BChainlet)) -> None: self._llm = llm async def run_remote(self, prompt: str) -> AsyncIterator[str]: messages = [{"role": "user", "content": prompt}] llm_input = chains.EngineBuilderLLMInput(messages=messages) async for chunk in self._llm.run_remote(llm_input): yield chunk ``` # Error Handling Source: https://docs.baseten.co/development/chain/errorhandling Understanding and handling Chains errors Error handling in Chains follows the principle that the root cause "bubbles up" until the entrypoint - which returns an error response. Similarly to how python stack traces contain all the layers from where an exception was raised up until the main function. Consider the case of a Chain where the entrypoint calls `run_remote` of a Chainlet named `TextToNum` and this in turn invokes `TextReplicator`. The respective `run_remote` methods might also use other helper functions that appear in the call stack. Below is an example stack trace that shows how the root cause (a `ValueError`) is propagated up to the entrypoint's `run_remote` method (this is what you would see as an error log): ``` Chainlet-Traceback (most recent call last): File "/packages/itest_chain.py", line 132, in run_remote value = self._accumulate_parts(text_parts.parts) File "/packages/itest_chain.py", line 144, in _accumulate_parts value += self._text_to_num.run_remote(part) ValueError: (showing chained remote errors, root error at the bottom) ├─ Error in dependency Chainlet `TextToNum`: │ Chainlet-Traceback (most recent call last): │ File "/packages/itest_chain.py", line 87, in run_remote │ generated_text = self._replicator.run_remote(data) │ ValueError: (showing chained remote errors, root error at the bottom) │ ├─ Error in dependency Chainlet `TextReplicator`: │ │ Chainlet-Traceback (most recent call last): │ │ File "/packages/itest_chain.py", line 52, in run_remote │ │ validate_data(data) │ │ File "/packages/itest_chain.py", line 36, in validate_data │ │ raise ValueError(f"This input is too long: {len(data)}.") ╰ ╰ ValueError: This input is too long: 100. ``` ## Exception handling and retries Above stack trace is what you see if you don't catch the exception. It is possible to add error handling around each remote Chainlet invocation. Chains tries to raise the same exception class on the *caller* Chainlet as was raised in the *dependency* Chainlet. * Builtin exceptions (e.g. `ValueError`) always work. * Custom or third-party exceptions (e.g. from `torch`) can be only raised in the caller if they are included in the dependencies of the caller as well. If the exception class cannot be resolved, a `GenericRemoteException` is raised instead. Note that the *message* of re-raised exceptions is the concatenation of the original message and the formatted stack trace of the dependency Chainlet. In some cases it might make sense to simply retry a remote invocation (e.g. if it failed due to some transient problems like networking or any "flaky" parts). `depends` can be configured with additional [options](/reference/sdk/chains#truss-chains-depends) for that. Below example shows how you can add automatic retries and error handling for the call to `TextReplicator` in `TextToNum`: ```python import truss_chains as chains class TextToNum(chains.ChainletBase): def __init__( self, replicator: TextReplicator = chains.depends(TextReplicator, retries=3), ) -> None: self._replicator = replicator async def run_remote(self, data: ...): try: generated_text = await self._replicator.run_remote(data) except ValueError: ... # Handle error. ``` ## Stack filtering The stack trace is intended to show the user implemented code in `run_remote` (and user implemented helper functions). Under the hood, the calls from one Chainlet to another go through an HTTP connection, managed by the Chains framework. And each Chainlet itself is run as a FastAPI server with several layers of request handling code "above". In order to provide concise, readable stacks, all of this non-user code is filtered out. # Your first Chain Source: https://docs.baseten.co/development/chain/getting-started Build and deploy two example Chains This quickstart guide contains instructions for creating two Chains: 1. A simple CPU-only “hello world”-Chain. 2. A Chain that implements Phi-3 Mini and uses it to write poems. ## Prerequisites To use Chains, install a recent Truss version and ensure pydantic is v2: ```bash pip install --upgrade truss 'pydantic>=2.0.0' ``` Truss requires python `>=3.8,<3.13`. To set up a fresh development environment, you can use the following commands, creating a environment named `chains_env` using `pyenv`: ```bash curl https://pyenv.run | bash echo 'export PYENV_ROOT="$HOME/.pyenv"' >> ~/.bashrc echo '[[ -d $PYENV_ROOT/bin ]] && export PATH="$PYENV_ROOT/bin:$PATH"' >> ~/.bashrc echo 'eval "$(pyenv init -)"' >> ~/.bashrc source ~/.bashrc pyenv install 3.11.0 ENV_NAME="chains_env" pyenv virtualenv 3.11.0 $ENV_NAME pyenv activate $ENV_NAME pip install --upgrade truss 'pydantic>=2.0.0' ``` To deploy Chains remotely, you also need a [Baseten account](https://app.baseten.co/signup). It is handy to export your API key to the current shell session or permanently in your `.bashrc`: ```bash ~/.bashrc export BASETEN_API_KEY="nPh8..." ``` ## Example: Hello World Chains are written in Python files. In your working directory, create `hello_chain/hello.py`: ```sh mkdir hello_chain cd hello_chain touch hello.py ``` In the file, we'll specify a basic Chain. It has two Chainlets: * `HelloWorld`, the entrypoint, which handles the input and output. * `RandInt`, which generates a random integer. It is used a as a dependency by `HelloWorld`. Via the entrypoint, the Chain takes a maximum value and returns the string " Hello World!" repeated a variable number of times. ```python hello.py import random import truss_chains as chains class RandInt(chains.ChainletBase): async def run_remote(self, max_value: int) -> int: return random.randint(1, max_value) @chains.mark_entrypoint class HelloWorld(chains.ChainletBase): def __init__(self, rand_int=chains.depends(RandInt, retries=3)) -> None: self._rand_int = rand_int async def run_remote(self, max_value: int) -> str: num_repetitions = await self._rand_int.run_remote(max_value) return "Hello World! " * num_repetitions ``` ### The Chainlet class-contract Exactly one Chainlet must be marked as the entrypoint with the [`@chains.mark_entrypoint`](/reference/sdk/chains#truss-chains-mark-entrypoint) decorator. This Chainlet is responsible for handling public-facing input and output for the whole Chain in response to an API call. A Chainlet class has a single public method, [`run_remote()`](/development/chain/concepts#run-remote-chaining-chainlets), which is the API endpoint for the entrypoint Chainlet and the function that other Chainlets can use as a dependency. The [`run_remote()`](/development/chain/concepts#run-remote-chaining-chainlets) method must be fully type-annotated with primitive python types or [pydantic models](https://docs.pydantic.dev/latest/). Chainlets cannot be naively instantiated. The only correct usages are: 1. Make one Chainlet depend on another one via the [`chains.depends()`](/reference/sdk/chains#truss-chains-depends) directive as an `__init__`-argument as shown above for the `RandInt` Chainlet. 2. In the [local debugging mode](/development/chain/localdev#test-a-chain-locally). Beyond that, you can structure your code as you like, with private methods, imports from other files, and so forth. Keep in mind that Chainlets are intended for distributed, replicated, remote execution, so using global variables, global state, and certain Python features like importing modules dynamically at runtime should be avoided as they may not work as intended. ### Deploy your Chain to Baseten To deploy your Chain to Baseten, run: ```bash truss chains push hello.py ``` The deploy command results in an output like this: ``` ⛓️ HelloWorld - Chainlets ⛓️ ╭──────────────────────┬─────────────────────────┬─────────────╮ │ Status │ Name │ Logs URL │ ├──────────────────────┼─────────────────────────┼─────────────┤ │ 💚 ACTIVE │ HelloWorld (entrypoint) │ https://... │ ├──────────────────────┼─────────────────────────┼─────────────┤ │ 💚 ACTIVE │ RandInt (dep) │ https://... │ ╰──────────────────────┴─────────────────────────┴─────────────╯ Deployment succeeded. You can run the chain with: curl -X POST 'https://chain-.../run_remote' \ -H "Authorization: Api-Key $BASETEN_API_KEY" \ -d '' ``` Wait for the status to turn to `ACTIVE` and test invoking your Chain (replace `$INVOCATION_URL` in below command): ```bash curl -X POST $INVOCATION_URL \ -H "Authorization: Api-Key $BASETEN_API_KEY" \ -d '{"max_value": 10}' # "Hello World! Hello World! Hello World! " ``` ## Example: Poetry with LLMs Our second example also has two Chainlets, but is somewhat more complex and realistic. The Chainlets are: * `PoemGenerator`, the entrypoint, which handles the input and output and orchestrates calls to the LLM. * `PhiLLM`, which runs inference on Phi-3 Mini. This Chain takes a list of words and returns a poem about each word, written by Phi-3. Here's the architecture: We build this Chain in a new working directory (if you are still inside `hello_chain/`, go up one level with `cd ..` first): ```sh mkdir poetry_chain cd poetry_chain touch poems.py ``` A similar ent-to-end code example, using Mistral as an LLM, is available in the [examples repo](https://github.com/basetenlabs/model/tree/main/truss-chains/examples/mistral). ### Building the LLM Chainlet The main difference between this Chain and the previous one is that we now have an LLM that needs a GPU and more complex dependencies. Copy the following code into `poems.py`: ```python poems.py import asyncio from typing import List import pydantic import truss_chains as chains from truss import truss_config PHI_HF_MODEL = "microsoft/Phi-3-mini-4k-instruct" # This configures to cache model weights from the hunggingface repo # in the docker image that is used for deploying the Chainlet. PHI_CACHE = truss_config.ModelRepo( repo_id=PHI_HF_MODEL, allow_patterns=["*.json", "*.safetensors", ".model"] ) class Messages(pydantic.BaseModel): messages: List[dict[str, str]] class PhiLLM(chains.ChainletBase): # `remote_config` defines the resources required for this chainlet. remote_config = chains.RemoteConfig( docker_image=chains.DockerImage( # The phi model needs some extra python packages. pip_requirements=[ "accelerate==0.30.1", "einops==0.8.0", "transformers==4.41.2", "torch==2.3.0", ] ), # The phi model needs a GPU and more CPUs. compute=chains.Compute(cpu_count=2, gpu="T4"), # Cache the model weights in the image assets=chains.Assets(cached=[PHI_CACHE]), ) def __init__(self) -> None: # Note the imports of the *specific* python requirements are # pushed down to here. This code will only be executed on the # remotely deployed Chainlet, not in the local environment, # so we don't need to install these packages in the local # dev environment. import torch import transformers self._model = transformers.AutoModelForCausalLM.from_pretrained( PHI_HF_MODEL, torch_dtype=torch.float16, device_map="auto", ) self._tokenizer = transformers.AutoTokenizer.from_pretrained( PHI_HF_MODEL, ) self._generate_args = { "max_new_tokens" : 512, "temperature" : 1.0, "top_p" : 0.95, "top_k" : 50, "repetition_penalty" : 1.0, "no_repeat_ngram_size": 0, "use_cache" : True, "do_sample" : True, "eos_token_id" : self._tokenizer.eos_token_id, "pad_token_id" : self._tokenizer.pad_token_id, } async def run_remote(self, messages: Messages) -> str: import torch model_inputs = self._tokenizer.apply_chat_template( messages, tokenize=False, add_generation_prompt=True ) inputs = self._tokenizer(model_inputs, return_tensors="pt") input_ids = inputs["input_ids"].to("cuda") with torch.no_grad(): outputs = self._model.generate( input_ids=input_ids, **self._generate_args) output_text = self._tokenizer.decode( outputs[0], skip_special_tokens=True) return output_text ``` ### Building the entrypoint Now that we have an LLM, we can use it in a poem generator Chainlet. Add the following code to `poems.py`: ```python poems.py import asyncio @chains.mark_entrypoint class PoemGenerator(chains.ChainletBase): def __init__(self, phi_llm: PhiLLM = chains.depends(PhiLLM)) -> None: self._phi_llm = phi_llm async def run_remote(self, words: list[str]) -> list[str]: tasks = [] for word in words: messages = Messages( messages=[ { "role" : "system", "content": ( "You are poet who writes short, " "lighthearted, amusing poetry." ), }, {"role": "user", "content": f"Write a poem about {word}"}, ] ) tasks.append( asyncio.ensure_future(self._phi_llm.run_remote(messages))) await asyncio.sleep(0) # Yield to event loop, to allow starting tasks. return list(await asyncio.gather(*tasks)) ``` Note that we use `asyncio.ensure_future` around each RPC to the LLM chainlet. This makes the current python process start these remote calls concurrently, i.e. the next call is started before the previous one has finished and we can minimize our overall runtime. In order to await the results of all calls, `asyncio.gather` is used which gives us back normal python objects. If the LLM is hit with many concurrent requests, it can auto-scale up (if autoscaling is configure). More advanced LLM models have batching capabilities, so for those even a single instance can serve concurrent request. ### Deploy your Chain to Baseten To deploy your Chain to Baseten, run: ```bash truss chains push poems.py ``` Wait for the status to turn to `ACTIVE` and test invoking your Chain (replace `$INVOCATION_URL` in below command): ```bash curl -X POST $INVOCATION_URL \ -H "Authorization: Api-Key $BASETEN_API_KEY" \ -d '{"words": ["bird", "plane", "superman"]}' #[[ #" [INST] Generate a poem about: bird [/INST] In the quiet hush of...", #" [INST] Generate a poem about: plane [/INST] In the vast, boundless...", #" [INST] Generate a poem about: superman [/INST] In the realm where..." #]] ``` # Invocation Source: https://docs.baseten.co/development/chain/invocation Call your deployed Chain Once your Chain is deployed, you can call it via its API endpoint. Chains use the same inference API as models: * [Environment endpoint](/reference/inference-api/predict-endpoints/environments-run-remote) * [Development endpoint](/reference/inference-api/predict-endpoints/development-run-remote) * [Endpoint by ID](/reference/inference-api/predict-endpoints/deployment-run-remote) Here's an example which calls the development deployment: ```python call_chain.py import requests import os # From the Chain overview page on Baseten # E.g. "https://chain-.api.baseten.co/development/run_remote" CHAIN_URL = "" baseten_api_key = os.environ["BASETEN_API_KEY"] # JSON keys and types match the `run_remote` method signature. data = {...} resp = requests.post( CHAIN_URL, headers={"Authorization": f"Api-Key {baseten_api_key}"}, json=data, ) print(resp.json()) ``` ### How to pass chain input The data schema of the inference request corresponds to the function signature of [`run_remote()`](/development/chain/concepts#run-remote-chaining-chainlets) in your entrypoint Chainlet. For example, for the Hello Chain, `HelloAll.run_remote()`: ```python async def run_remote(self, names: list[str]) -> str: ``` You'd pass the following JSON payload: ```json { "names": ["Marius", "Sid", "Bola"] } ``` I.e. the keys in the JSON record, match the argument names and values match the types of`run_remote.` ### Async chain inference Like Truss models, Chains support async invocation. The [guide for models](/inference/async) applies largely - in particular for how to wrap the input and set up the webhook to process results. The following additional points are chains specific: * Use chain-based URLS: * `https://chain-{chain}.api.baseten.co/production/async_run_remote` * `https://chain-{chain}.api.baseten.co/development/async_run_remote` * `https://chain-{chain}.api.baseten.co/deployment/{deployment}/async_run_remote`. * `https://chain-{chain}.api.baseten.co/environments/{env_name}/async_run_remote`. * Only the entrypoint is invoked asynchronously. Internal Chainlet-Chainlet calls run synchronously. # Local Development Source: https://docs.baseten.co/development/chain/localdev Iterating, Debugging, Testing, Mocking Chains are designed for production in replicated remote deployments. But alongside that production-ready power, we offer great local development and deployment experiences. Chains exists to help you build multi-step, multi-model pipelines. The abstractions that Chains introduces are based on six opinionated principles: three for architecture and three for developer experience. **Architecture principles** Each step in the pipeline can set its own hardware requirements and software dependencies, separating GPU and CPU workloads. Each component has independent autoscaling parameters for targeted resource allocation, removing bottlenecks from your pipelines. Components specify a single public interface for flexible-but-safe composition and are reusable between projects **Developer experience principles** Eliminate entire taxonomies of bugs by writing typed Python code and validating inputs, outputs, module initializations, function signatures, and even remote server configurations. Seamless local testing and cloud deployments: test Chains locally with support for mocking the output of any step and simplify your cloud deployment loops by separating large model deployments from quick updates to glue code. Use Chains to orchestrate existing model deployments, like pre-packaged models from Baseten’s model library, alongside new model pipelines built entirely within Chains. Locally, a Chain is just Python files in a source tree. While that gives you a lot of flexibility in how you structure your code, there are some constraints and rules to follow to ensure successful distributed, remote execution in production. The best thing you can do while developing locally with Chains is to run your code frequently, even if you do not have a `__main__` section: the Chains framework runs various validations at module initialization to help you catch issues early. Additionally, running `mypy` and fixing reported type errors can help you find problems early in a rapid feedback loop, before attempting a (much slower) deployment. Complementary to the purely local development Chains also has a "watch" mode, like Truss, see the [watch guide](/development/chain/watch). ## Test a Chain locally Let's revisit our "Hello World" Chain: ```python hello_chain/hello.py import asyncio import truss_chains as chains # This Chainlet does the work class SayHello(chains.ChainletBase): async def run_remote(self, name: str) -> str: return f"Hello, {name}" # This Chainlet orchestrates the work @chains.mark_entrypoint class HelloAll(chains.ChainletBase): def __init__(self, say_hello_chainlet=chains.depends(SayHello)) -> None: self._say_hello = say_hello_chainlet async def run_remote(self, names: list[str]) -> str: tasks = [] for name in names: tasks.append(asyncio.ensure_future( self._say_hello.run_remote(name))) return "\n".join(await asyncio.gather(*tasks)) # Test the Chain locally if __name__ == "__main__": with chains.run_local(): hello_chain = HelloAll() result = asyncio.get_event_loop().run_until_complete( hello_chain.run_remote(["Marius", "Sid", "Bola"])) print(result) ``` When the `__main__()` module is run, local instances of the Chainlets are created, allowing you to test functionality of your chain just by executing the Python file: ```bash cd hello_chain python hello.py # Hello, Marius # Hello, Sid # Hello, Bola ``` ## Mock execution of GPU Chainlets Using `run_local()` to run your code locally requires that your development environment have the compute resources and dependencies that each Chainlet needs. But that often isn't possible when building with AI models. Chains offers a workaround, mocking, to let you test the coordination and business logic of your multi-step inference pipeline without worrying about running the model locally. The second example in the [getting started guide](/development/chain/getting-started) implements a Truss Chain for generating poems with Phi-3. This Chain has two Chainlets: 1. The `PhiLLM` Chainlet, which requires an NVIDIA A10G GPU. 2. The `PoemGenerator` Chainlet, which easily runs on a CPU. If you have an NVIDIA T4 under your desk, good for you. For the rest of us, we can mock the `PhiLLM` Chainlet that is infeasible to run locally so that we can quickly test the `PoemGenerator` Chainlet. To do this, we define a mock Phi-3 model in our `__main__` module and give it a [`run_remote()`](/development/chain/concepts#run-remote-chaining-chainlets) method that produces a test output that matches the output type we expect from the real Chainlet. Then, we inject an instance of this mock Chainlet into our Chain: ```python poems.py if __name__ == "__main__": class FakePhiLLM: async def run_remote(self, prompt: str) -> str: return f"Here's a poem about {prompt.split(" ")[-1]}" with chains.run_local(): poem_generator = PoemGenerator(phi_llm=FakePhiLLM()) result = asyncio.get_event_loop().run_until_complete( poem_generator.run_remote(words=["bird", "plane", "superman"])) print(result) ``` And run your Python file: ```bash python poems.py # ['Here's a poem about bird', 'Here's a poem about plane', 'Here's a poem about superman'] ``` ### Typing of mocks You may notice that the argument `phi_llm` expects a type `PhiLLM`, while we pass an instance of `FakePhiLLM`. These aren't the same, which is formally a type error. However, this works at runtime because we constructed `FakePhiLLM` to implement the same *protocol* as the real thing. We can make this explicit by defining a `Protocol` as a type annotation: ```python from typing import Protocol class PhiProtocol(Protocol): def run_remote(self, data: str) -> str: ... ``` and changing the argument type in `PoemGenerator`: ```python @chains.mark_entrypoint class PoemGenerator(chains.ChainletBase): def __init__(self, phi_llm: PhiProtocol = chains.depends(PhiLLM)) -> None: self._phi_llm = phi_llm ``` This is a bit more work and not needed to execute the code, but it shows how typing consistency can be achieved - if desired. # Overview Source: https://docs.baseten.co/development/chain/overview Chains is a framework for building robust, performant multi-step and multi-model inference pipelines and deploying them to production. It addresses the common challenges of managing latency, cost and dependencies for complex workflows, while leveraging Truss’ existing battle-tested performance, reliability and developer toolkit.