# AI tools Source: https://docs.baseten.co/ai-tools Connect AI tools to Baseten documentation for context-aware assistance with deploying and serving models. Baseten docs are optimized for AI tools. Connect your assistants, coding tools, and agents directly to the docs so they have up-to-date context when helping you build on Baseten. Every page includes a contextual menu (the icon in the top-right corner of any page) with shortcuts to copy content and connect your MCP server. ## MCP server The Model Context Protocol (MCP) connects AI tools directly to Baseten documentation. When connected, your AI tool searches the docs in real time while generating responses, so you get answers grounded in current documentation rather than stale training data. The Baseten docs MCP server is available at: ``` https://docs.baseten.co/mcp ``` Add the MCP server to Claude Code: ```bash theme={"system"} claude mcp add --transport http baseten-docs https://docs.baseten.co/mcp ``` Claude Code searches Baseten docs automatically when relevant to your prompts. Navigate to the **Connectors** page in Claude settings. Select **Add custom connector**, then enter: * **Name:** Baseten Docs * **URL:** `https://docs.baseten.co/mcp` Select **Add**. When starting a conversation, select the attachments button (the plus icon) and choose the Baseten Docs connector. Claude searches the docs as needed while responding. Use `Cmd + Shift + P` (macOS) or `Ctrl + Shift + P` (Windows/Linux) to open the command palette. Search for **"Open MCP settings"**. Select **Add custom MCP**. This opens your `mcp.json` file. Add the Baseten docs server: ```json mcp.json theme={"system"} { "mcpServers": { "baseten-docs": { "type": "http", "url": "https://docs.baseten.co/mcp" } } } ``` Create or update `.vscode/mcp.json` in your project: ```json .vscode/mcp.json theme={"system"} { "servers": { "baseten-docs": { "type": "http", "url": "https://docs.baseten.co/mcp" } } } ``` ### Other MCP clients Any MCP-compatible tool (Goose, ChatGPT, Windsurf, and others) can connect using the server URL `https://docs.baseten.co/mcp`. Refer to your tool's documentation for how to add an MCP server. You can also use `npx add-mcp` to auto-detect supported AI tools on your system and configure them: ```bash theme={"system"} npx add-mcp https://docs.baseten.co ``` ## Skills The skills file describes what AI agents can accomplish with Baseten, including required inputs and constraints. AI coding tools use this file to understand Baseten capabilities without reading every documentation page. Install the Baseten docs skill into your AI coding tool: ```bash theme={"system"} npx skills add https://docs.baseten.co ``` This gives your AI tool structured knowledge of Baseten's capabilities so it can help you deploy models, configure autoscaling, set up inference endpoints, and more with product-aware guidance. View the skill file directly at [docs.baseten.co/skill.md](https://docs.baseten.co/skill.md). Skills and MCP serve complementary purposes. **Skills** tell an AI tool *what Baseten can do* and how to do it. **MCP** lets the tool *search current documentation* for specific details. For the best results, install both. ## llms.txt The `llms.txt` file is an industry-standard directory that helps LLMs index documentation efficiently, similar to how `sitemap.xml` helps search engines. Baseten docs automatically host two versions: * [docs.baseten.co/llms.txt](https://docs.baseten.co/llms.txt): a structured list of all pages with descriptions. * [docs.baseten.co/llms-full.txt](https://docs.baseten.co/llms-full.txt): the full text content of all pages. These files stay up to date automatically and require no configuration. AI tools and search engines like ChatGPT, Perplexity, and Google AI Overviews use them to understand and cite Baseten documentation. ## Markdown access Every documentation page is available as Markdown by appending `.md` to the URL. For example: ``` https://docs.baseten.co/quickstart.md ``` AI agents receive page content as Markdown instead of HTML, which reduces token usage and improves processing speed. You can use this to quickly copy any page's content into an AI conversation. ## Contextual menu reference The contextual menu on each page provides one-click access to these integrations. Select the menu icon in the top-right corner of any page. | Option | Description | | ------------------- | --------------------------------------------------------- | | Copy page | Copies the page as Markdown for pasting into any AI tool. | | View as Markdown | Opens the page as raw Markdown in a new tab. | | Copy MCP server URL | Copies the MCP server URL to your clipboard. | | Connect to Cursor | Installs the MCP server in Cursor. | | Connect to VS Code | Installs the MCP server in VS Code. | # How Baseten works Source: https://docs.baseten.co/concepts/howbasetenworks Follow a model from truss push to a running endpoint: the build pipeline, request routing, autoscaling, and deployment lifecycle. The [overview](/overview) covers Baseten's capabilities. This page covers the underlying mechanics: how a config file becomes a running endpoint, how Baseten routes requests to your model, how the autoscaler manages capacity, and how you promote a model from development to production. ## Multi-cloud Capacity Management (MCM) Behind every Baseten deployment is our Multi-cloud Capacity Management (MCM) system. MCM acts as the infrastructure control plane, unifying thousands of GPUs across 10+ cloud service providers and multiple geographic regions. When you request a resource (an H100 in US-East-1 or a cluster of B200s in a private region), MCM provisions the hardware, configures networking, and monitors health. It abstracts differences between cloud providers to ensure the Baseten Inference Stack runs identically on any underlying infrastructure. This system powers Baseten's high availability by enabling active-active deployments across different clouds. If a region or provider faces a capacity crunch or outage, MCM rapidly re-routes and re-provisions workloads to maintain service continuity. ## The build pipeline When you run `truss push`, the CLI validates your `config.yaml`, archives your project directory, and uploads it to cloud storage. Baseten receives the archive and starts the build. For [Engine-Builder-LLM](/engines/engine-builder-llm/overview), Baseten downloads model weights from the source repository (Hugging Face, S3, or GCS) and compiles them with TensorRT-LLM. The compilation step builds optimized CUDA kernels for the target GPU architecture, applies quantization (FP8, FP4) if configured, and sets up tensor parallelism across multiple GPUs. Baseten packages the compiled engine, runtime configuration, and serving infrastructure into a container, deploys it to GPU infrastructure, and exposes it as an API endpoint. The `truss push` command returns once the upload finishes. For engine-based deployments, compilation can take several minutes. Watch progress in the deployment logs or check the dashboard, which shows "Active" when the endpoint is ready for requests. For [custom model code](/development/model/custom-model-code) deployments, the build is faster: Baseten installs your Python dependencies, packages your `Model` class into a container, and deploys it. You remain responsible for any inference optimization in custom builds. ## Request routing Each deployment gets a dedicated subdomain: `https://model-{model_id}.api.baseten.co/`. The URL path determines which deployment handles the request. Requests to `/production/predict` go to the production environment, while `/development/predict` goes to the development deployment. You can also target a specific deployment by ID or a custom environment by name. Once the environment is resolved, the load balancer routes the request to an active replica. If the model has scaled to zero, Baseten spins up a replica and queues the request until the model loads and becomes ready. The caller receives the response regardless of whether the model was warm or cold. Engine-based deployments serve an [OpenAI-compatible API](/reference/inference-api/chat-completions) at the `/v1/chat/completions` path, so any code written for the OpenAI SDK works without modification. Custom model deployments use the [predict API](/reference/inference-api/overview), which accepts and returns arbitrary JSON. For long-running workloads, [async requests](/inference/async) return a request ID immediately. The request enters a queue managed by an async request service. A background worker then calls your model and delivers the result via webhook. Sync requests take priority over async requests when competing for concurrency slots to prevent background work from starving real-time traffic. ## Autoscaling Baseten's autoscaler watches in-flight request counts and adjusts replicas to maintain each one near its [concurrency target](/deployment/autoscaling/overview). Scaling up is immediate. When average utilization crosses the target threshold (default 70%) within the autoscaling window (default 60 seconds), the autoscaler adds replicas up to the configured maximum. Scaling down is deliberately slow. When traffic drops, the autoscaler flags excess replicas for removal but keeps them alive for a configurable delay (default 900 seconds). It uses exponential backoff: removing half the excess replicas, waiting, and then removing half again. This prevents the cluster from thrashing during bursty traffic. Setting `min_replica` to 0 enables scale-to-zero. The model stops incurring GPU cost when idle, but the next request triggers a cold start. Setting `min_replica` to 1 or higher keeps warm capacity ready at all times, trading cost for lower latency. ## Cold starts and the weight delivery network The slowest part of a cold start is loading model weights, which can reach hundreds of gigabytes. Baseten addresses this with the [Baseten Delivery Network (BDN)](/development/model/bdn), a multi-tier caching system for model weights. When you first deploy, BDN mirrors your model weights from the source repository to Baseten's own blob storage. After that, no cold start depends on an upstream service like Hugging Face or S3. When a new replica starts, the BDN agent on the node fetches a manifest for the weights, downloads them through an in-cluster cache (shared across all pods in the cluster), and stores them in a node-level cache (shared across all replicas on the same node). Identical files across different models are deduplicated, so a GLM fine-tune that shares most weights with the base model only downloads the delta. Subsequent cold starts on the same node or in the same cluster are significantly faster than the first. Container images use streaming, so the model begins loading weights before the image download completes. ## Environments and promotion Every model starts with a development deployment: a single replica with scale-to-zero enabled and live reload for fast iteration. When the model is ready for production traffic, promote it to an environment. The [production environment](/deployment/environments) exists by default. You can create additional environments (staging, shadow, or canary) for testing and gradual rollouts. Each environment has a stable endpoint URL, its own autoscaling settings, and dedicated metrics. The endpoint URL remains constant when you promote new deployments, so your application code doesn't need to change. Promotion replaces the current deployment in an environment with the new one. The new deployment inherits the environment's autoscaling settings. Baseten demotes the previous deployment and scales it to zero, allowing you to roll back by re-promoting it. You can also push directly to an environment with `truss push --environment staging` to skip the development stage. Only one promotion can be active per environment at a time to prevent conflicting updates. # Why Baseten Source: https://docs.baseten.co/concepts/whybaseten Mission-critical inference with dedicated infrastructure, global scale, and full control. Baseten provides high-performance inference for teams that have outgrown shared API endpoints. We deliver the performance of custom-built infrastructure with the ease of a managed platform, allowing you to deploy and scale any model behind a production-grade API. ## Mission-critical inference Inference is the core of your application. When it fails, your product stops working. We built Baseten to handle mission-critical workloads, offering 99.99% uptime and low-latency performance at any scale. Operating thousands of GPUs across multiple regions and cloud providers exposes the limits of traditional deployment. Single points of failure, regional capacity constraints, and the overhead of managing heterogeneous clouds create significant operational risk. We solved these problems with our Multi-cloud Capacity Management (MCM) system. ## Multi-cloud Capacity Management (MCM) MCM is a unified control layer that provisions and scales resources across 10+ clouds and regions. It handles the complexity of cloud-agnostic orchestration, giving you a single pane of glass for your entire inference fleet. Whether you run in our cloud, yours, or both, the experience is identical. MCM enables three deployment modes, all sharing the same high-performance inference stack: ### Baseten Cloud Fully managed, multi-cloud inference. This is the fastest path to production, offering limitless scale and global latency optimization. We manage the infrastructure so you can focus on your models. ### Baseten Self-hosted The full Baseten stack inside your own VPC. Use this when you have strict data security, privacy, or sovereignty requirements. You maintain complete control over your data and networking while benefiting from Baseten’s autoscaling and performance optimizations. ### Baseten Hybrid The best of both worlds. Run core workloads in your VPC for maximum control and burst to Baseten Cloud on demand. This approach eliminates the trade-off between strict compliance and the need for elastic flex capacity. ## The Baseten advantage ML teams at Abridge, Writer, and Patreon use Baseten to serve millions of users. Our platform is built on four pillars that ensure your success in production: * **Model performance:** Our engineers apply the latest research in custom kernels and runtimes, delivering low latency and high throughput out of the box. * **Reliable infrastructure:** Deploy across clusters and clouds with active-active reliability and built-in redundancy. * **Operational control:** Use deep observability, secret management, and fine-grained autoscaling to maintain your SLAs. * **Compliance by design:** SOC 2 Type II, HIPAA, and GDPR compliance ensure that your deployments meet the highest standards for data security. ## Comparison of deployment options | Feature | Baseten Cloud | Self-hosted | Hybrid | | :----------------- | :--------------------- | :----------------- | :----------------------- | | **Scaling** | Unlimited, multi-cloud | Within your VPC | VPC with Cloud spillover | | **Data Residency** | Region-locked options | Full local control | Local with Cloud options | | **Compliance** | SOC 2, HIPAA, GDPR | Your compliance | Hybrid compliance | | **Time to Market** | Hours | Days | Days | Baseten gives you the visibility and control of your own infrastructure without the operational burden. Whether you're deploying a single LLM or an entire library of models, you can start with a managed solution and transition to self-hosted or hybrid modes as your requirements evolve. # Cold starts Source: https://docs.baseten.co/deployment/autoscaling/cold-starts Understand cold starts and how to minimize their impact on your deployments. A *cold start* is the time required to initialize a new replica when scaling up. Cold starts affect the latency of requests that trigger new replica creation. *** ## When cold starts happen Cold starts occur in two scenarios: 1. **Scale-from-zero**: When a deployment with zero active replicas receives its first request. 2. **Scaling events**: When traffic increases and the autoscaler adds new replicas. *** ## What contributes to cold start time Cold start duration depends on several factors: | Factor | Impact | | -------------- | ---------------------------------------------------------------------- | | Model loading | Loading model weights (10s–100s of GBs), typically the dominant factor | | Container pull | Downloading Docker image layers | | Initialization | Running your model's setup code | For large models, cold starts can take minutes. Model weight downloads are usually the bottleneck. Even with optimizations, the physics of moving hundreds of gigabytes of data creates inherent lag. *** ## Minimizing cold starts ### Keep replicas warm Set [`min_replica`](/reference/management-api/deployments/autoscaling/updates-a-deployments-autoscaling-settings) to always have at least one replica ready to serve requests. This eliminates cold starts for the first request but increases cost. ```json theme={"system"} { "min_replica": 1 } ``` For production redundancy, set `min_replica ≥ 2` so one replica can fail during maintenance without causing cold starts. ### Pre-warm before expected traffic For predictable traffic spikes, increase min replicas before the expected load: ```bash theme={"system"} # 10-15 minutes before expected spike curl -X PATCH \ https://api.baseten.co/v1/models/{model_id}/deployments/{deployment_id}/autoscaling_settings \ -H "Authorization: Api-Key $BASETEN_API_KEY" \ -d '{"min_replica": 5}' ``` After traffic stabilizes, reset to your normal minimum. ### Use longer scale-down delay A longer scale-down delay keeps replicas warm during temporary traffic dips: ```json theme={"system"} { "scale_down_delay": 900 } ``` This prevents cold starts when traffic returns within the delay window. *** ## Platform optimizations Baseten automatically applies several optimizations to reduce cold start times: **Baseten Delivery Network (Recommended)**: The [`weights`](/development/model/bdn) configuration optimizes cold starts by mirroring weights to Baseten's infrastructure and caching them close to your model pods. See [Baseten Delivery Network (BDN)](/development/model/bdn) for full configuration options. **Network accelerator (Legacy)**: Parallelized byte-range downloads speed up model loading from Hugging Face, S3, GCS, and R2. Network Acceleration is deprecated in favor of the new `weights` configuration, which provides superior cold start performance through multi-tier caching. See [Baseten Delivery Network (BDN)](/development/model/bdn) for the recommended approach. **Image streaming**: Optimized images stream into nodes, allowing model loading to begin before the full download completes: ``` Successfully pulled streaming-enabled image in 15.851s. Image size: 32 GB. ``` These optimizations are applied automatically. *** ## The tradeoff Cold starts create a fundamental tradeoff between **cost** and **latency**: | Approach | Cost | Latency | | -------------------------------- | ----------------------------- | ------------------------------------------ | | Scale to zero (`min_replica: 0`) | Lower: no cost when idle | Higher: first request waits for cold start | | Always on (`min_replica: ≥1`) | Higher: pay for idle replicas | Lower: no cold starts | For latency-sensitive production workloads, the cost of keeping replicas warm is usually justified. For batch workloads or development, scale-to-zero often makes sense. *** ## Next steps * [Autoscaling](/deployment/autoscaling/overview): Configure min replicas and scale-down delay. * [Traffic patterns](/deployment/autoscaling/traffic-patterns): Pre-warming strategies for different traffic types. * [Troubleshooting](/troubleshooting/deployments#autoscaling-issues): Diagnose cold start issues. # Autoscaling Source: https://docs.baseten.co/deployment/autoscaling/overview Configure autoscaling to dynamically adjust replicas based on traffic while minimizing idle compute costs. Autoscaling is a control loop that adjusts the number of **replicas** backing a deployment based on demand. The goal is to balance **performance** (latency and throughput) against **cost** (GPU hours). Autoscaling is reactive by nature. Baseten provides default settings that work for most workloads. Tune your autoscaling settings based on your model and traffic. | Parameter | Default | Range | What it controls | | ------------------ | ------- | -------- | ---------------------------------------- | | Min replicas | 0 | ≥ 0 | Baseline capacity (0 = scale to zero). | | Max replicas | 1 | ≥ 1 | Cost/capacity ceiling. | | Autoscaling window | 60s | 10–3600s | Time window for traffic analysis. | | Scale-down delay | 900s | 0–3600s | Wait time before removing idle replicas. | | Concurrency target | 1 | ≥ 1 | Requests per replica before scaling. | | Target utilization | 70% | 1–100% | Headroom before scaling triggers. | Configure autoscaling settings through the Baseten UI or API: 1. Select your deployment. 2. Under **Replicas** for your production environment, choose **Configure**. 3. Configure the autoscaling settings and choose **Update**. UI view to configure autoscaling ```bash theme={"system"} curl -X PATCH \ https://api.baseten.co/v1/models/{model_id}/deployments/{deployment_id}/autoscaling_settings \ -H "Authorization: Api-Key $BASETEN_API_KEY" \ -d '{ "min_replica": 2, "max_replica": 10, "concurrency_target": 32, "target_utilization_percentage": 70, "autoscaling_window": 60, "scale_down_delay": 900 }' ``` For more information, see the [API reference](/reference/management-api/deployments/autoscaling/updates-a-deployments-autoscaling-settings). ```python theme={"system"} import requests import os API_KEY = os.environ.get("BASETEN_API_KEY") response = requests.patch( "https://api.baseten.co/v1/models/{model_id}/deployments/{deployment_id}/autoscaling_settings", headers={"Authorization": f"Api-Key {API_KEY}"}, json={ "min_replica": 2, "max_replica": 10, "concurrency_target": 32, "target_utilization_percentage": 70, "autoscaling_window": 60, "scale_down_delay": 900 } ) print(response.json()) ``` For more information, see the [API reference](/reference/management-api/deployments/autoscaling/updates-a-deployments-autoscaling-settings). *** ## How autoscaling works When the **average requests per active replica** exceeds the **concurrency target × target utilization** within the **autoscaling window**, more replicas are created until: * The concurrency target is met. * The maximum replica count is reached. When traffic drops below the concurrency target, excess replicas are flagged for removal. The **scale-down delay** ensures replicas are not removed prematurely: * If traffic returns before the delay ends, replicas remain active. * Scale-down uses exponential back-off: cut half the excess replicas, wait, then cut half again. * Scaling stops when the minimum replica count is reached. *** ## Replicas Replicas are individual instances of your model, each capable of serving requests independently. The autoscaler adjusts the number of replicas based on traffic, but you control the boundaries with minimum and maximum replica settings. The floor for your deployment's capacity. The autoscaler won't scale below this number. **Range:** ≥ 0 The default of 0 enables *scale-to-zero*: your deployment costs nothing when idle, but the first request triggers a [cold start](/deployment/autoscaling/cold-starts). For large models, cold starts can take minutes. For production deployments, set `min_replica` to at least 2. This provides redundancy if one replica fails and eliminates cold starts. The ceiling for your deployment's capacity. The autoscaler won't scale above this number. **Range:** ≥ 1 This setting protects against runaway scaling and unexpected costs. If traffic exceeds max replica capacity, requests queue rather than triggering new replicas. The default of 1 means no autoscaling, exactly one replica regardless of load. Estimate max replicas: $$ (peak\_requests\_per\_second / throughput\_per\_replica) + buffer $$ For high-volume workloads requiring guaranteed capacity, [contact Baseten](mailto:support@baseten.co) about reserved capacity options. *** ## Scaling triggers Scaling triggers determine when the autoscaler adds or removes capacity. The two key settings: **concurrency target** and **target utilization** work together to define when your deployment needs more or fewer replicas. How many requests each replica can handle simultaneously. This directly determines replica count for a given load. **Range:** ≥ 1 The autoscaler calculates desired replicas: $$ ceiling(in\_flight\_requests / (concurrency\_target \times target\_utilization)) $$ *In-flight requests* are requests sent to your model that haven't returned a response (for streaming, until the stream completes). This count is exposed as [`baseten_concurrent_requests`](/observability/export-metrics/supported-metrics#baseten_concurrent_requests) in the metrics dashboard and metrics export. The default of 1 is appropriate for models that process one request at a time (like image generation consuming all GPU memory). For models with batching (LLMs, embeddings), higher values reduce cost. **Tradeoff:** Higher concurrency = fewer replicas (lower cost) but more per-replica queueing (higher latency). Lower concurrency = more replicas (higher cost) but less queueing (lower latency). **Starting points by model type:** | Model type | Starting concurrency | | ----------------------- | -------------------- | | Standard Truss model | 1 | | vLLM / LLM inference | 32–128 | | SGLang | 32 | | Text embeddings (TEI) | 32 | | BEI embeddings | 96+ (min ≥ 8) | | Whisper (async batch) | 256 | | Image generation (SDXL) | 1 | For engine-specific guidance, see [Autoscaling engines](/engines/performance-concepts/autoscaling-engines). **Concurrency target** controls requests sent *to* a replica and triggers autoscaling. **predict\_concurrency** (Truss config.yaml) controls requests processed *inside* the container. Concurrency target should be less than or equal to predict\_concurrency. See the `predict_concurrency` field in the [Truss configuration reference](/reference/truss-configuration) for details. Headroom before scaling triggers. The autoscaler scales when utilization reaches this percentage of the concurrency target. **Range:** 1–100% The effective threshold is: $$ concurrency\_target × target\_utilization $$ With concurrency target 10 and utilization 70%, scaling triggers at 7 concurrent requests (10 × 0.70), leaving 30% headroom. Lower values (50–60%) provide more headroom for spikes but cost more. Higher values (80%+) are cost-efficient for steady traffic but absorb spikes less effectively. Target utilization is **not** GPU utilization. It measures request slot usage relative to your concurrency target, not hardware utilization. *** ## Scaling dynamics Scaling dynamics control how quickly and smoothly the autoscaler responds to traffic changes. These settings help you balance responsiveness against stability. How far back (in seconds) the autoscaler looks when measuring traffic. Traffic is averaged over this window to make scaling decisions. **Range:** 10–3600 seconds A 60-second window considers average load over the past minute, smoothing out momentary spikes. Shorter windows (30–60s) react quickly to traffic changes. Longer windows (2–5 min) ignore short-lived fluctuations and prevent chasing noise. How long (in seconds) the autoscaler waits after load drops before removing replicas. Prevents premature scale-down during temporary dips. **Range:** 0–3600 seconds When load drops, the autoscaler starts a countdown. If load stays low for the full delay, it removes replicas using exponential back-off (half the excess, wait, half again). This is your primary lever for preventing *oscillation* (thrashing). If replicas repeatedly scale up and down, increase this first. A **short window** with a **long delay** gives you fast scale-up while maintaining capacity during temporary dips. This is a good starting configuration for most workloads. *** ## Development deployments Development deployments have fixed replica limits but allow modification of other autoscaling settings. The replica constraints are optimized for the development workflow rapid iteration with live reloading using the [`truss watch`](/reference/cli/truss/watch) command, rather than production traffic handling. | Setting | Value | Modifiable | | ------------------ | ----------- | ---------- | | Min replicas | 0 | No | | Max replicas | 1 | No | | Autoscaling window | 60 seconds | Yes | | Scale-down delay | 900 seconds | Yes | | Concurrency target | 1 | Yes | | Target utilization | 70% | Yes | The single-replica limit means development deployments aren't suitable for load testing or handling real traffic. To enable full autoscaling with configurable replica settings, [promote the deployment to production](/deployment/deployments). *** ## Next steps Identify your traffic pattern and get recommended starting settings. Understand cold starts and how to minimize their impact. Tune autoscaling settings for your traffic pattern. Complete autoscaling API documentation. *** ## Troubleshooting Having issues with autoscaling? See [Autoscaling troubleshooting](/troubleshooting/deployments#autoscaling-issues) for solutions to common problems like oscillation, slow scale-up, and unexpected costs. # Traffic patterns Source: https://docs.baseten.co/deployment/autoscaling/traffic-patterns Identify your traffic pattern and configure autoscaling settings to match. Different traffic patterns require different autoscaling configurations. Identify your pattern below for recommended starting settings. These are **starting points**, not final answers. Monitor your deployment's performance and adjust based on observed behavior. See [Autoscaling](/deployment/autoscaling/overview) for parameter details. *** ## Jittery traffic Small, frequent spikes that quickly return to baseline. ### Characteristics * Baseline replica count is steady, but **spikes up by 2x several times per hour**. * Spikes are short-lived and return to baseline quickly. * Often not real load growth, just temporary surges causing overreaction. ### Common causes * Consumer products with intermittent usage bursts. * Traffic splitting or A/B testing with low percentages. * Polling clients with synchronized intervals. ### Recommended settings | Parameter | Value | Why | | ------------------ | ----------------- | ----------------------------------------------- | | Autoscaling window | **2-5 minutes** | Smooth out noise, avoid reacting to every spike | | Scale-down delay | **300-600s** | Moderate stability | | Target utilization | **70%** | Default is fine | | Concurrency target | Benchmarked value | Start conservative | A longer autoscaling window averages out the jitter so the autoscaler doesn't chase every small spike. You're trading reaction speed for stability, which is acceptable when the spikes aren't sustained load increases. If you're still seeing oscillation with these settings, increase the scale-down delay before lowering target utilization. *** ## Bursty traffic ### Characteristics * Traffic **jumps sharply** (2x+ within 60 seconds). * Stays high for a sustained period before dropping. * The "pain" is queueing and latency spikes while new replicas start. ### Common causes * Daily morning ramp-up (users starting their day). * Marketing events, product launches, viral moments. * Top-of-hour scheduled jobs or cron-triggered traffic. ### Recommended settings | Parameter | Value | Why | | ------------------ | ---------- | --------------------------------------------- | | Autoscaling window | **30-60s** | React quickly to genuine load increases | | Scale-down delay | **900s+** | Handle back-to-back waves without thrashing | | Target utilization | **50-60%** | More headroom absorbs the burst while scaling | | Min replicas | **≥2** | Redundancy + reduces cold start impact | Short window means fast reaction. Long delay prevents scaling down between waves. Lower utilization gives you buffer capacity while new replicas start. ### Pre-warming for predictable bursts If your bursts are predictable (morning ramp, scheduled events), pre-warm by bumping min replicas before the expected spike: ```bash theme={"system"} curl -X PATCH \ https://api.baseten.co/v1/models/{model_id}/deployments/{deployment_id}/autoscaling_settings \ -H "Authorization: Api-Key $BASETEN_API_KEY" \ -d '{"min_replica": 5}' ``` After the burst subsides, reset to your normal minimum: ```bash theme={"system"} curl -X PATCH \ https://api.baseten.co/v1/models/{model_id}/deployments/{deployment_id}/autoscaling_settings \ -H "Authorization: Api-Key $BASETEN_API_KEY" \ -d '{"min_replica": 2}' ``` Automate pre-warming with cron jobs or your orchestration system. Bumping min replicas 10-15 minutes before known peaks avoids cold starts for the first requests after the spike. *** ## Scheduled traffic ### Characteristics * **Long periods of low or zero traffic**. * Large bursts tied to job schedules (hourly, daily, weekly). * Traffic patterns are predictable but infrequent. ### Common causes * ETL pipelines and data processing jobs. * Embedding backfills and batch inference. * Periodic evaluation or testing jobs. * Document processing triggered by user uploads. ### Recommended settings | Parameter | Value | Why | | ------------------ | --------------------------------------------------------------- | ----------------------------------------- | | Min replicas | **0** (if cold starts acceptable) or **1** (during job windows) | Cost savings when idle | | Scale-down delay | **Moderate to high** | Jobs often come in waves | | Autoscaling window | **60-120s** | Don't overreact to the first few requests | | Target utilization | **70%** | Default is fine | Scale-to-zero saves significant cost during idle periods. The moderate window prevents overreacting to the initial requests of a batch. If jobs come in waves, a longer delay keeps replicas warm between them. ### Scheduled pre-warming For predictable batch jobs, use cron + API to pre-warm. 5 minutes before the hourly job, scale up: ```bash theme={"system"} 0 * * * * curl -X PATCH \ https://api.baseten.co/v1/models/{model_id}/deployments/{deployment_id}/autoscaling_settings \ -H "Authorization: Api-Key $BASETEN_API_KEY" \ -d '{"min_replica": 3}' ``` 30 minutes after the job completes, scale back down: ```bash theme={"system"} 30 * * * * curl -X PATCH \ https://api.baseten.co/v1/models/{model_id}/deployments/{deployment_id}/autoscaling_settings \ -H "Authorization: Api-Key $BASETEN_API_KEY" \ -d '{"min_replica": 0}' ``` If you use scale-to-zero, the first request of each batch will experience a [cold start](/deployment/autoscaling/cold-starts). For latency-sensitive batch jobs, keep min replicas at 1 during expected job windows. *** ## Steady traffic ### Characteristics * Traffic **rises and falls gradually** over the day. * Classic diurnal pattern with no sharp edges. * Predictable, cyclical behavior. ### Common causes * Always-on inference APIs with consistent user base. * B2B applications with business-hours usage. * Production workloads with stable, mature traffic. ### Recommended settings | Parameter | Value | Why | | ------------------ | ------------ | ------------------------------ | | Target utilization | **70-80%** | Can run replicas hotter safely | | Autoscaling window | **60-120s** | Moderate reaction speed | | Scale-down delay | **300-600s** | Moderate | | Min replicas | **≥2** | Redundancy for production | Without sudden spikes, you don't need as much headroom. You can run replicas at higher utilization (lower cost) because load changes are gradual and predictable. The autoscaler has time to react. Smooth traffic is the easiest to tune. Start with defaults, monitor for a week, then optimize for cost by gradually raising target utilization while watching p95 latency. *** ## Identifying your pattern Not sure which pattern you have? Check your metrics: 1. Go to your model's **Metrics** tab in the Baseten dashboard 2. Look at **Inference volume** and **Replicas** over the past week 3. Compare to the patterns above | You see... | Your pattern is... | | ----------------------------------------------------- | ------------------ | | Frequent small spikes that quickly return to baseline | Jittery | | Sharp jumps that stay high for a while | Bursty | | Long flat periods with occasional large bursts | Scheduled | | Gradual rises and falls, smooth curves | Steady | Some workloads are a mix of patterns. If your traffic has both smooth diurnal patterns AND occasional bursts, optimize for the bursts (they cause the most pain) and accept slightly higher cost during steady periods. *** ## Next steps * [Autoscaling](/deployment/autoscaling/overview): Full parameter documentation. * [Troubleshooting autoscaling](/troubleshooting/deployments#autoscaling-issues): Diagnose and fix common problems. * [Truss configuration reference](/reference/truss-configuration): Configure predict\_concurrency in your model. # Concepts Source: https://docs.baseten.co/deployment/concepts Deployments, environments, resources, and autoscaling on Baseten. When you run `truss push`, Baseten creates a [deployment](/deployment/deployments): a running instance of your model on GPU infrastructure with an API endpoint. This page explains how deployments are managed, versioned, and scaled. ## Deployments A [deployment](/deployment/deployments) is a single version of your model running on specific hardware. Every `truss push` creates a new deployment. You can have multiple deployments of the same model running simultaneously, which is how you test new versions without affecting production traffic. Deployments can be deactivated to stop serving (and stop incurring cost) or deleted permanently when no longer needed. ## Environments As your model matures, you need a way to manage releases. [Environments](/deployment/environments) provide stable endpoints that persist across deployments. A typical setup has a development environment for testing and a production environment for live traffic. Each environment maintains its own autoscaling settings, metrics, and endpoint URL. When a new deployment is ready, you promote it to an environment, and traffic shifts to the new version without changing the endpoint your application calls. ## Resources Every deployment runs on a specific [instance type](/deployment/resources) that defines its GPU, CPU, and memory allocation. Choosing the right instance balances inference speed against cost. You set the instance type in your `config.yaml` before deployment, or adjust it later in the dashboard. Smaller models run well on an L4 (24 GB VRAM), while large LLMs may need A100s or H100s with tensor parallelism across multiple GPUs. ## Autoscaling You don't manage replicas manually. [Autoscaling](/deployment/autoscaling/overview) adjusts the number of running instances based on incoming traffic. You configure a minimum and maximum replica count, a concurrency target, and a scale-down delay. When traffic drops, replicas scale down (optionally to zero, eliminating all cost). When traffic spikes, new replicas come up within seconds. [Cold start optimization](/deployment/autoscaling/cold-starts) and network acceleration keep response times fast even when scaling from zero. # Deployments Source: https://docs.baseten.co/deployment/deployments Deploy, manage, and scale machine learning models with Baseten A **deployment** in Baseten is a **containerized instance of a model** that serves inference requests via an API endpoint. Deployments exist independently but can be **promoted to an environment** for structured access and scaling. Baseten **automatically wraps every deployment in a REST API**. Once deployed, models can be queried with a simple HTTP request: ```python theme={"system"} import requests resp = requests.post( "https://model-{modelID}.api.baseten.co/deployment/[{deploymentID}]/predict", headers={"Authorization": "Api-Key YOUR_API_KEY"}, json={'text': 'Hello my name is {MASK}'}, ) print(resp.json()) ``` [Learn more about running inference on your deployment](/inference/calling-your-model) *** # Development deployment A **development deployment** is a mutable instance designed for rapid iteration. Create one with `truss push --watch` (for models) or `truss chains push --watch` (for Chains). It is always in the **development state** and cannot be renamed or detached from it. Key characteristics: * **Live reload** enables direct updates without redeployment. * **Single replica, scales to zero** when idle to conserve compute resources. * **No autoscaling or zero-downtime updates.** * **Can be promoted** to create a persistent deployment. Once promoted, the development deployment transitions to a **deployment** and can optionally be promoted to an environment. *** # Environments and promotion Environments provide **logical isolation** for managing deployments but are **not required** for a deployment to function. You can execute a deployment independently or promoted to an environment for controlled traffic allocation and scaling. * The **production environment** exists by default. * **Custom environments** (e.g., staging) can be created for specific workflows. * **Promoting a deployment doesn't modify its behavior**, only its routing and lifecycle management. ## Rolling deployments Rolling deployments replace replicas incrementally when promoting a deployment to an environment. Instead of swapping all traffic at once, rolling deployments scale up the candidate, shift traffic proportionally, and scale down the previous deployment in controlled steps. You can pause, resume, cancel, or force-complete a rolling deployment at any point. See [Rolling deployments](/deployment/rolling-deployments) for configuration, control actions, and status reference. ## Canary deployments (deprecated) Canary deployments are deprecated. Use [rolling deployments](/deployment/rolling-deployments) for incremental traffic shifting with finer control over replica provisioning and rollback. Canary deployments support incremental traffic shifting to a new deployment in 10 evenly distributed stages over a configurable time window. Canary rollouts can be enabled or canceled via the UI or [REST API](/reference/management-api/environments/update-an-environments-settings). *** # Managing deployments ## Naming deployments By default, deployments of a model are named `deployment-1`, `deployment-2`, and so forth sequentially. You can instead give deployments custom names via two methods: 1. While creating the deployment, using a [command line argument in truss push](/reference/sdk/truss#deploying-a-model). 2. After creating the deployment, in the model management page within your Baseten dashboard. Renaming deployments is purely aesthetic and does not affect model management API paths, which work via model and deployment IDs. ## Deactivating a deployment Deactivate a deployment to suspend inference execution while preserving configuration. * **Remains visible in the dashboard.** * **Consumes no compute resources** but can be reactivated anytime. * **API requests return a 404 error while deactivated.** For demand-driven deployments, consider [autoscaling with scale to zero](/reference/management-api/deployments/autoscaling/updates-a-deployments-autoscaling-settings). ## Deleting deployments You can **permanently delete** deployments, but production deployments must be replaced before deletion. * **Deleted deployments are purged from the dashboard** but retained in usage logs. * **All associated compute resources are released.** * **API requests return a 404 error post-deletion.** Deletion is irreversible. Use deactivation if retention is required. # Environments Source: https://docs.baseten.co/deployment/environments Manage your model’s release cycles with environments. Environments provide structured management for deployments, ensuring controlled rollouts, stable endpoints, and autoscaling. They help teams stage, test, and release models without affecting production traffic. Deployments can be promoted to an environment (e.g., "staging") to validate outputs before moving to production, allowing for safer model iteration and evaluation. *** ## Using environments to manage deployments Environments support **structured validation** before promoting a deployment, including: * **Automated tests and evaluations** * **Manual testing in pre-production** * **Gradual traffic shifts with canary deployments** * **Shadow serving for real-world analysis** Promoting a deployment ensures it inherits **environment-specific scaling and monitoring settings**, such as: * **Dedicated API endpoint** → [Predict API Reference](/reference/inference-api/overview#predict-endpoints) * **Autoscaling controls** → Scale behavior is managed per environment. * **Traffic ramp-up** → Enable [canary rollouts](/deployment/deployments#canary-deployments) or [rolling deployments](/deployment/rolling-deployments). * **Monitoring and metrics** → [Export environment metrics](/observability/export-metrics/overview). A **production environment** operates like any other environment but has restrictions: * **It can't be deleted** unless the entire model is removed. * **You can't create additional environments named "production."** *** ## Creating custom environments In addition to the standard **production** environment, you can create as many custom environments as needed. There are two ways to create a custom environment: 1. In the model management page on the Baseten dashboard. 2. Via the [create environment endpoint](/reference/management-api/environments/create-an-environment) in the model management API. *** ## Promoting deployments to environments When you promote a deployment, Baseten follows a **three-step process**: 1. A **new deployment** is created with a unique deployment ID. 2. The deployment **initializes resources** and becomes active. 3. The new deployment **replaces the existing deployment** in that environment. * If there was **no previous deployment, default autoscaling settings** are applied. * If a **previous deployment existed**, the new one **inherits autoscaling settings**, and the old deployment is **demoted and scales to zero**. ### Promoting a published deployment If a **published deployment** (not a development deployment) is promoted: * Its **autoscaling settings are updated** to match the environment. * If **inactive**, it must be **activated** before promotion. Previous deployments are **demoted but remain in the system**, retaining their **deployment ID and scaling behavior**. *** ## Deploying directly to an environment You can deploy directly to a named environment by specifying `--environment` in `truss push`: ```sh theme={"system"} cd my_model/ truss push --environment {environment_name} ``` Only one active promotion per environment is allowed at a time. *** ## Accessing environments in your code The **environment name** is available in `model.py` via the `environment` keyword argument: ```python theme={"system"} def __init__(self, **kwargs): self._environment = kwargs["environment"] ``` To ensure the **environment variable remains updated**, enable\*\* "Re-deploy when promoting" \*\*in the UI or via the [REST API](/reference/management-api/environments/update-an-environments-settings). This guarantees the environment is fully initialized after a promotion. *** ## Regional environments Regional environments restrict inference traffic to a specific geographic region for data residency compliance. When your organization enables regional environments, each environment gets a dedicated regional endpoint that routes directly to infrastructure in the designated region. Regional environments are configured at the organization level. Contact your Baseten account team to enable regional environments. ### Regional endpoint format Regional endpoints embed the environment name in the hostname instead of the URL path: Call a model's regional endpoint with `/predict` or `/async_predict`. ``` https://model-{model_id}-{env_name}.api.baseten.co/predict ``` For example, a model with ID `abc123` in the `prod-us` environment: ``` https://model-abc123-prod-us.api.baseten.co/predict ``` Call a chain's regional endpoint with `/run_remote` or `/async_run_remote`. ``` https://chain-{chain_id}-{env_name}.api.baseten.co/run_remote ``` Connect to a regional WebSocket endpoint for models or chains. ``` wss://model-{model_id}-{env_name}.api.baseten.co/websocket wss://chain-{chain_id}-{env_name}.api.baseten.co/websocket ``` Connect to a regional gRPC endpoint using the `grpc.api.baseten.co` subdomain. ``` model-{model_id}-{env_name}.grpc.api.baseten.co:443 ``` The regional endpoint URL appears in your model's API endpoint section in the Baseten dashboard once regional environments are configured for your organization. ### API restrictions on regional endpoints Regional endpoints derive the environment exclusively from the hostname. Path-based routing (`/environments/`, `/production/`, `/deployment/`) is rejected. For gRPC, do not set `x-baseten-environment` or `x-baseten-deployment` metadata headers. *** ## Deleting environments You can delete environments, **except for production**. To remove a **production deployment**, first **promote another deployment to production** or delete the entire model. * **Deleted environments are removed from the overview** but remain in billing history. * **They do not consume resources** after deletion. * **API requests to a deleted environment return a 404 error.** Deletion is permanent - consider deactivation instead. # Resources Source: https://docs.baseten.co/deployment/resources Manage and configure model resources Every AI/ML model on Baseten runs on an **instance**, a dedicated set of hardware allocated to the model server. Selecting the right instance type ensures **optimal performance** while controlling **compute costs**. * **Insufficient resources**: Slow inference or failures. * **Excess resources**: Higher costs without added benefit. ## Instance type resource components * **Instance**: The allocated hardware for inference. * **Node**: The compute unit within an instance, comprising 8 GPUs with associated vCPU, RAM, and VRAM. * **vCPU**: Virtual CPU cores for general computing. * **RAM**: Memory available to the CPU. * **GPU**: Specialized hardware for accelerated ML workloads. * **VRAM**: Dedicated GPU memory for model execution. *** # Configuring model resources Define resources **before deployment** in Truss or **adjust them later** via the Baseten UI. ### Defining resources in Truss Define resource requirements in [`config.yaml`](/development/model/configuration) before running `truss push`. * **Published deployment** (`truss push`): Creates a new deployment (named sequentially: `deployment-1`, `deployment-2`, etc.) using the resources in [`config.yaml`](/development/model/configuration). * **Development deployment** (`truss push --watch`): Overwrites the existing development deployment with the specified resource configuration and starts watching for changes. Use [`truss watch`](/development/model/deploy-and-iterate) to resume watching an existing development deployment. * **Production deployment** (`truss push --promote`): Creates a new deployment and promotes it to production, replacing the active deployment. * **Environment deployment** (`truss push --environment `): Deploys directly to a [custom environment](/deployment/environments) like staging. Changes to `config.yaml` only affect new deployments. To update resources on an existing published deployment, edit resources in the [Baseten UI](#updating-resources-in-the-baseten-ui). You can configure resources in two ways: **Option 1: Specify individual resource fields** ```yaml config.yaml theme={"system"} resources: accelerator: L4 cpu: "4" memory: 16Gi ``` Baseten provisions the **smallest instance that meets the specified constraints**: * cpu: "3" or "4" → Maps to a 4-core instance. * cpu: "5" to "8" → Maps to an 8-core instance. `Gi` in `resources.memory` refers to **Gibibytes**, which are slightly larger than **Gigabytes**. **Option 2: Specify an exact instance type** An instance type is the full SKU name that uniquely identifies a specific hardware configuration. When you specify individual resource fields like `cpu` and `accelerator`, Baseten selects the smallest instance that meets your requirements. With `instance_type`, you specify exactly which instance you want, no guessing required. Use `instance_type` when you: * Know the exact hardware configuration you need. * Want to ensure consistent instance selection across deployments. * Are following a recommendation for a specific model (for example, "use an L4 with 4 vCPUs and 16 GiB RAM"). ```yaml config.yaml theme={"system"} resources: instance_type: "L4:4x16" ``` The format encodes the hardware specs: `:x`. For example, `L4:4x16` means an L4 GPU with 4 vCPUs and 16 GiB of RAM. When `instance_type` is specified, other resource fields (`cpu`, `memory`, `accelerator`, `use_gpu`) are ignored. ### Updating resources in the Baseten UI Once deployed, you can only update resource configurations **through the Baseten UI**. Changing the instance type deploys a copy of the deployment using the specified instance type. For a list of available instance types, see the [instance type reference](/deployment/resources#instance-type-reference). *** # Instance type reference Specs and benchmarks for every Baseten instance type. Choosing the right instance for model inference means balancing performance and cost. This page lists all available instance types on Baseten to help you deploy and serve models effectively. ## CPU-only instances Cost-effective options for lighter workloads. No GPU. * **Starts at**: \$0.00058/min * **Best for**: Transformers pipelines, small QA models, text embeddings | Instance | \$/min | vCPU | RAM | | -------- | --------- | ---- | ------ | | 1x2 | \$0.00058 | 1 | 2 GiB | | 1x4 | \$0.00086 | 1 | 4 GiB | | 2x8 | \$0.00173 | 2 | 8 GiB | | 4x16 | \$0.00346 | 4 | 16 GiB | | 8x32 | \$0.00691 | 8 | 32 GiB | | 16x64 | \$0.01382 | 16 | 64 GiB | To select a CPU-only instance, use the format `CPU:x` (e.g., `instance_type: "CPU:4x16"`). **Example workloads:** * `1x2`: Text classification (e.g., Truss quickstart) * `4x16`: LayoutLM Document QA * `4x16+`: Sentence Transformers embeddings on larger corpora ## GPU instances Accelerated inference for LLMs, diffusion models, and Whisper. | Instance | \$/min | vCPU | RAM | GPU | VRAM | | -------------- | --------- | ---- | -------- | ---------------------- | ------- | | T4x4x16 | \$0.01052 | 4 | 16 GiB | NVIDIA T4 | 16 GiB | | T4x8x32 | \$0.01504 | 8 | 32 GiB | NVIDIA T4 | 16 GiB | | T4x16x64 | \$0.02408 | 16 | 64 GiB | NVIDIA T4 | 16 GiB | | L4x4x16 | \$0.01414 | 4 | 16 GiB | NVIDIA L4 | 24 GiB | | L4:2x24x96 | \$0.04002 | 24 | 96 GiB | 2 NVIDIA L4s | 48 GiB | | L4:4x48x192 | \$0.08003 | 48 | 192 GiB | 4 NVIDIA L4s | 96 GiB | | A10Gx4x16 | \$0.02012 | 4 | 16 GiB | NVIDIA A10G | 24 GiB | | A10Gx8x32 | \$0.02424 | 8 | 32 GiB | NVIDIA A10G | 24 GiB | | A10Gx16x64 | \$0.03248 | 16 | 64 GiB | NVIDIA A10G | 24 GiB | | A10G:2x24x96 | \$0.05672 | 24 | 96 GiB | 2 NVIDIA A10Gs | 48 GiB | | A10G:4x48x192 | \$0.11344 | 48 | 192 GiB | 4 NVIDIA A10Gs | 96 GiB | | A10G:8x192x768 | \$0.32576 | 192 | 768 GiB | 8 NVIDIA A10Gs | 188 GiB | | A100x12x144 | \$0.10240 | 12 | 144 GiB | 1 NVIDIA A100 | 80 GiB | | A100:2x24x288 | \$0.20480 | 24 | 288 GiB | 2 NVIDIA A100s | 160 GiB | | A100:3x36x432 | \$0.30720 | 36 | 432 GiB | 3 NVIDIA A100s | 240 GiB | | A100:4x48x576 | \$0.40960 | 48 | 576 GiB | 4 NVIDIA A100s | 320 GiB | | A100:5x60x720 | \$0.51200 | 60 | 720 GiB | 5 NVIDIA A100s | 400 GiB | | A100:6x72x864 | \$0.61440 | 72 | 864 GiB | 6 NVIDIA A100s | 480 GiB | | A100:7x84x1008 | \$0.71680 | 84 | 1008 GiB | 7 NVIDIA A100s | 560 GiB | | A100:8x96x1152 | \$0.81920 | 96 | 1152 GiB | 8 NVIDIA A100s | 640 GiB | | H100 | \$0.10833 | - | - | 1 NVIDIA H100 | 80 GiB | | H100:2 | \$0.21667 | - | - | 2 NVIDIA H100s | 160 GiB | | H100:4 | \$0.43333 | - | - | 4 NVIDIA H100s | 320 GiB | | H100:8 | \$0.86667 | - | - | 8 NVIDIA H100s | 640 GiB | | H100MIG | \$0.06250 | - | - | Fractional NVIDIA H100 | 40 GiB | To select a GPU instance with `instance_type`: * **Single GPU**: `:x` (e.g., `"L4:4x16"`). * **Multi-GPU**: `:xx` (e.g., `"A100:2x24x288"`). * **H100**: `H100` or `H100:` (e.g., `"H100:2"`). * **Fractional H100**: `"H100_40GB"`. ## GPU details and workloads ### T4 Turing-series GPU * 2,560 CUDA / 320 Tensor cores * 16 GiB VRAM * **Best for:** Whisper, small LLMs like StableLM 3B ### L4 Ada Lovelace-series GPU * 7,680 CUDA / 240 Tensor cores * 24 GiB VRAM, 300 GiB/s * 121 TFLOPS (fp16) * **Best for**: Stable Diffusion XL * **Limit**: Not suitable for LLMs due to bandwidth ### A10G Ampere-series GPU * 9,216 CUDA / 288 Tensor cores * 24 GiB VRAM, 600 GiB/s * 70 TFLOPS (fp16) * **Best for**: Mistral 7B, Whisper, Stable Diffusion/SDXL ### A100 Ampere-series GPU * 6,912 CUDA / 432 Tensor cores * 80 GiB VRAM, 1.94 TB/s * 312 TFLOPS (fp16) * **Best for**: Mixtral, Llama 2 70B (2 A100s), Falcon 180B (5 A100s), SDXL ### H100 Hopper-series GPU * 16,896 CUDA / 640 Tensor cores * 80 GiB VRAM, 3.35 TB/s * 990 TFLOPS (fp16) * **Best for**: Mixtral 8x7B, Llama 2 70B (2xH100), SDXL ### H100MIG Fractional H100 (3/7 compute, ½ memory) * 7,242 CUDA cores, 40 GiB VRAM * 1.675 TB/s bandwidth * **Best for**: Efficient LLM inference at lower cost than A100 # Rolling deployments Source: https://docs.baseten.co/deployment/rolling-deployments Gradually shift traffic to a new deployment with replica-based rolling deployments. Rolling deployments replace replicas incrementally when promoting a deployment to an environment. Instead of swapping all traffic at once, rolling deployments scale up the candidate deployment, shift traffic proportionally, and scale down the previous deployment in controlled steps. Use rolling deployments when you need zero-downtime updates with the ability to pause, cancel, or force-complete the deployment at any point. Autoscaling is disabled for the entire duration of a rolling deployment. Replica counts don't adjust automatically until the deployment reaches a terminal status (SUCCEEDED, FAILED, or CANCELED). Use the `replica_overhead_percent` setting to pre-provision additional capacity before the deployment starts. ## How rolling deployments work A rolling deployment follows a repeating three-step cycle: 1. **Scale up** candidate deployment replicas by the configured percentage. 2. **Shift traffic** proportionally to match the new replica ratio. 3. **Scale down** the previous deployment replicas by the same percentage. This cycle repeats until all traffic and replicas run on the candidate deployment, at which point it becomes the active deployment in the environment. ### Provisioning modes Rolling deployments support two mutually exclusive provisioning modes. You must configure exactly one: * `max_surge_percent`: Scales up candidate replicas before scaling down previous replicas. * `max_unavailable_percent`: Scales down previous replicas before scaling up candidate replicas. Both can't be non-zero at the same time, and both can't be zero at the same time. ## Enabling rolling deployments Enable rolling deployments on any environment by updating the environment's promotion settings. Rolling deployments are disabled by default. ```bash theme={"system"} curl -X PATCH \ https://api.baseten.co/v1/models/{model_id}/environments/production \ -H "Authorization: Api-Key $BASETEN_API_KEY" \ -H "Content-Type: application/json" \ -d '{ "promotion_settings": { "rolling_deploy": true, "rolling_deploy_config": { "max_surge_percent": 10, "max_unavailable_percent": 0, "stabilization_time_seconds": 60, "replica_overhead_percent": 0 } } }' ``` ```python theme={"system"} import requests import os API_KEY = os.environ.get("BASETEN_API_KEY") response = requests.patch( "https://api.baseten.co/v1/models/{model_id}/environments/production", headers={"Authorization": f"Api-Key {API_KEY}"}, json={ "promotion_settings": { "rolling_deploy": True, "rolling_deploy_config": { "max_surge_percent": 10, "max_unavailable_percent": 0, "stabilization_time_seconds": 60, "replica_overhead_percent": 0, }, } }, ) print(response.json()) ``` Once rolling deployments are enabled, any subsequent [promotion to the environment](/reference/management-api/deployments/promote/promotes-a-deployment-to-an-environment) uses the rolling deployment workflow. ## Configuration reference Configure rolling deployments through the `rolling_deploy_config` object in the environment's `promotion_settings`. Percentage of additional replicas to provision during each step. Set to `0` to use max unavailable mode instead. **Range:** 0–50 Percentage of replicas that can be unavailable during each step. Set to `0` to use max surge mode instead. **Range:** 0–50 Seconds to wait after each traffic shift before proceeding to the next step. Use this to monitor metrics between steps. **Range:** 0–3600 Percentage of additional replicas to pre-provision on the current deployment before the rolling deployment starts. Compensates for autoscaling being disabled. **Range:** 0–500 Additional promotion settings configured at the `promotion_settings` level: Enables rolling deployments for the environment. ## Deployment statuses The `in_progress_promotion` field on the [environment detail endpoint](/reference/management-api/environments/get-an-environments-details) tracks the current state of a rolling deployment. | Status | Description | | -------------- | -------------------------------------------------------------------------------------------------- | | `RELEASING` | Candidate deployment is building and initializing replicas. | | `RAMPING_UP` | Scaling up candidate replicas and shifting traffic. | | `PAUSED` | Rolling deployment is paused at its current traffic split. Replicas stay at their current count. | | `RAMPING_DOWN` | Graceful cancel in progress. Traffic is shifting back to the previous deployment. | | `SUCCEEDED` | Rolling deployment completed. The candidate is now the active deployment. Autoscaling resumes. | | `FAILED` | Rolling deployment failed. Traffic remains on the previous deployment. Autoscaling resumes. | | `CANCELED` | Rolling deployment was canceled. Traffic returned to the previous deployment. Autoscaling resumes. | The `in_progress_promotion` object also includes `percent_traffic_to_new_version`, which reports the current percentage of traffic routed to the candidate deployment. ## Deployment control actions ### Pause Pauses the rolling deployment after the current step completes. Use this to inspect metrics or logs before proceeding. ```bash theme={"system"} curl -X POST \ https://api.baseten.co/v1/models/{model_id}/environments/production/pause_promotion \ -H "Authorization: Api-Key $BASETEN_API_KEY" ``` ```python theme={"system"} response = requests.post( "https://api.baseten.co/v1/models/{model_id}/environments/production/pause_promotion", headers={"Authorization": f"Api-Key {API_KEY}"}, ) print(response.json()) ``` ### Resume Resumes a paused rolling deployment from where it left off. ```bash theme={"system"} curl -X POST \ https://api.baseten.co/v1/models/{model_id}/environments/production/resume_promotion \ -H "Authorization: Api-Key $BASETEN_API_KEY" ``` ```python theme={"system"} response = requests.post( "https://api.baseten.co/v1/models/{model_id}/environments/production/resume_promotion", headers={"Authorization": f"Api-Key {API_KEY}"}, ) print(response.json()) ``` ### Cancel Gracefully cancels the rolling deployment. Traffic ramps back to the previous deployment and candidate replicas scale down. ```bash theme={"system"} curl -X POST \ https://api.baseten.co/v1/models/{model_id}/environments/production/cancel_promotion \ -H "Authorization: Api-Key $BASETEN_API_KEY" ``` ```python theme={"system"} response = requests.post( "https://api.baseten.co/v1/models/{model_id}/environments/production/cancel_promotion", headers={"Authorization": f"Api-Key {API_KEY}"}, ) print(response.json()) ``` Returns a `status` of `CANCELED` (instant cancel for non-rolling deployments) or `RAMPING_DOWN` (graceful rollback for rolling deployments). ### Force cancel Immediately cancels the rolling deployment and returns all traffic to the previous deployment. Use this when you need to roll back without waiting for the graceful ramp-down. Force canceling may cause brief service disruption if the previous deployment is under-provisioned. ```bash theme={"system"} curl -X POST \ https://api.baseten.co/v1/models/{model_id}/environments/production/force_cancel_promotion \ -H "Authorization: Api-Key $BASETEN_API_KEY" ``` ```python theme={"system"} response = requests.post( "https://api.baseten.co/v1/models/{model_id}/environments/production/force_cancel_promotion", headers={"Authorization": f"Api-Key {API_KEY}"}, ) print(response.json()) ``` ### Force roll forward Immediately completes the rolling deployment, shifting all traffic to the candidate deployment. This works even if the deployment is in the process of rolling back. Force rolling forward may promote an under-provisioned deployment if the candidate has not finished scaling up. ```bash theme={"system"} curl -X POST \ https://api.baseten.co/v1/models/{model_id}/environments/production/force_roll_forward_promotion \ -H "Authorization: Api-Key $BASETEN_API_KEY" ``` ```python theme={"system"} response = requests.post( "https://api.baseten.co/v1/models/{model_id}/environments/production/force_roll_forward_promotion", headers={"Authorization": f"Api-Key {API_KEY}"}, ) print(response.json()) ``` ## Autoscaling during rolling deployments To compensate for autoscaling being disabled during rolling deployments: * Set `replica_overhead_percent` to pre-provision the current deployment before the rolling deployment starts. For example, a value of `50` adds 50% more replicas to the current deployment before any traffic shifts. * Set `stabilization_time_seconds` to add a wait period between steps, giving you time to monitor metrics before the next traffic shift. * Factor in expected traffic when setting your environment's `min_replica` and `max_replica` before starting the rolling deployment. Autoscaling resumes automatically when the rolling deployment reaches a terminal status: `SUCCEEDED`, `FAILED`, or `CANCELED`. ## Deployment cleanup After a rolling deployment completes, the `promotion_cleanup_strategy` setting controls what happens to the previous deployment. * `SCALE_TO_ZERO`: Scales the previous deployment to zero replicas. It remains available for reactivation. This is the default. * `KEEP`: Leaves the previous deployment running at its current replica count. * `DEACTIVATE`: Deactivates the previous deployment. It stops serving traffic and releases all resources. Set it alongside your other promotion settings: ```bash theme={"system"} curl -X PATCH \ https://api.baseten.co/v1/models/{model_id}/environments/production \ -H "Authorization: Api-Key $BASETEN_API_KEY" \ -H "Content-Type: application/json" \ -d '{ "promotion_settings": { "promotion_cleanup_strategy": "DEACTIVATE" } }' ``` ```python theme={"system"} response = requests.patch( "https://api.baseten.co/v1/models/{model_id}/environments/production", headers={"Authorization": f"Api-Key {API_KEY}"}, json={ "promotion_settings": { "promotion_cleanup_strategy": "DEACTIVATE" } }, ) print(response.json()) ``` # Binary IO Source: https://docs.baseten.co/development/chain/binaryio Performant serialization of numeric data Numeric data or audio/video are most efficiently transmitted as bytes. Other representations such as JSON or base64 encoding lose precision, add significant parsing overhead and increase message sizes (for example, \~33% increase for base64 encoding). Chains extends the JSON-centred pydantic ecosystem with two ways how you can include binary data: numpy array support and raw bytes. ## Numpy `ndarray` support Once you have your data represented as a numpy array, you can easily (and often without copying) convert it to `torch`, `tensorflow` or other common numeric library's objects. To include numpy arrays in a pydantic model, chains has a special field type implementation `NumpyArrayField`. For example: ```python theme={"system"} import numpy as np import pydantic from truss_chains import pydantic_numpy class DataModel(pydantic.BaseModel): some_numbers: pydantic_numpy.NumpyArrayField other_field: str ... numbers = np.random.random((3, 2)) data = DataModel(some_numbers=numbers, other_field="Example") print(data) # some_numbers=NumpyArrayField(shape=(3, 2), dtype=float64, data=[ # [0.39595027 0.23837526] # [0.56714894 0.61244946] # [0.45821942 0.42464844]]) # other_field='Example' ``` `NumpyArrayField` is a wrapper around the actual numpy array. Inside your python code, you can work with its `array` attribute: ```python theme={"system"} data.some_numbers.array += 10 # some_numbers=NumpyArrayField(shape=(3, 2), dtype=float64, data=[ # [10.39595027 10.23837526] # [10.56714894 10.61244946] # [10.45821942 10.42464844]]) # other_field='Example' ``` The interesting part is, how it serializes when making communicating between Chainlets or with a client. It can work in two modes: JSON and binary. ### Binary As a JSON alternative that supports byte data, Chains uses `msgpack` (with `msgpack_numpy`) to serialize the dict representation. For Chainlet-Chainlet RPCs this is done automatically for you by enabling binary mode of the dependency Chainlets, see [all options](/reference/sdk/chains#truss-chains-depends): ```python theme={"system"} import truss_chains as chains class Worker(chains.ChainletBase): async def run_remote(self, data: DataModel) -> DataModel: data.some_numbers.array += 10 return data class Consumer(chains.ChainletBase): def __init__(self, worker=chains.depends(Worker, use_binary=True)): self._worker = worker async def run_remote(self): numbers = np.random.random((3, 2)) data = DataModel(some_numbers=numbers, other_field="Example") result = await self._worker.run_remote(data) ``` Now the data is transmitted in a fast and compact way between Chainlets which often gives performance increases. ### Binary client If you want to send such data as input to a chain or parse binary output from a chain, you have to add the `msgpack` serialization client-side: ```python theme={"system"} import requests import msgpack import msgpack_numpy msgpack_numpy.patch() # Register hook for numpy. # Dump to "python" dict and then to binary. data_dict = data.model_dump(mode="python") data_bytes = msgpack.dumps(data_dict) # Set binary content type in request header. headers = { "Content-Type": "application/octet-stream", "Authorization": ... } response = requests.post(url, data=data_bytes, headers=headers) response_dict = msgpack.loads(response.content) response_model = ResponseModel.model_validate(response_dict) ``` The steps of dumping from a pydantic model and validating the response dict into a pydantic model can be skipped, if you prefer working with raw dicts on the client. The implementation of `NumpyArrayField` only needs `pydantic`, no other Chains dependencies. So you can take that implementation code in isolation and integrate it in your client code. Some version combinations of `msgpack` and `msgpack_numpy` give errors, we know that `msgpack = ">=1.0.2"` and `msgpack-numpy = ">=0.4.8"` work. ### JSON The JSON-schema to represent the array is a dict of `shape (tuple[int]), dtype (str), data_b64 (str)`. For example, ```python theme={"system"} print(data.model_dump_json()) '{"some_numbers":{"shape":[3,2],"dtype":"float64", "data_b64":"30d4/rnKJEAsvm...' ``` The base64 data corresponds to `np.ndarray.tobytes()`. To get back to the array from the JSON string, use the model's `model_validate_json` method. As discussed in the beginning, this schema is not performant for numeric data and only offered as a compatibility layer (JSON does not allow bytes) - generally prefer the binary format. # Simple `bytes` fields It is possible to add a `bytes` field to a pydantic model used in a chain, or as a plain argument to `run_remote`. This can be useful to include non-numpy data formats such as images or audio/video snippets. In this case, the "normal" JSON representation does not work and all involved requests or Chainlet-Chainlet-invocations must use binary mode. The same steps as for arrays [above](#binary-client) apply: construct dicts with `bytes` values and keys corresponding to the `run_remote` argument names or the field names in the pydantic model. Then use `msgpack` to serialize and deserialize those dicts. Don't forget to add `Content-type` headers and that `response.json()` will not work. # Concepts Source: https://docs.baseten.co/development/chain/concepts Glossary of Chains concepts and terminology ## Chainlet A Chainlet is the basic building block of Chains. A Chainlet is a Python class that specifies: * A set of compute resources. * A Python environment with software dependencies. * A typed interface [ `run_remote()`](/development/chain/concepts#run-remote-chaining-chainlets) for other Chainlets to call. This is the simplest possible Chainlet. Only the [`run_remote()`](/development/chain/concepts#run-remote-chaining-chainlets) method is required, and we can layer in other concepts to create a more capable Chainlet. ```python theme={"system"} import truss_chains as chains class SayHello(chains.ChainletBase): async def run_remote(self, name: str) -> str: return f"Hello, {name}" ``` You can modularize your code by creating your own chainlet sub-classes, refer to our [subclassing guide](/development/chain/subclassing). ### Remote configuration Chainlets are meant for deployment as remote services. Each Chainlet specifies its own requirements for compute hardware (CPU count, GPU type and count, etc) and software dependencies (Python libraries or system packages). This configuration is built into a Docker image automatically as part of the deployment process. When no configuration is provided, the Chainlet will be deployed on a basic instance with one vCPU, 2GB of RAM, no GPU, and a standard set of Python and system packages. Configuration is set using the [`remote_config`](/reference/sdk/chains#remote-configuration) class variable within the Chainlet: ```python theme={"system"} import truss_chains as chains class MyChainlet(chains.ChainletBase): remote_config = chains.RemoteConfig( docker_image=chains.DockerImage( pip_requirements=["torch==2.3.0", ...] ), compute=chains.Compute(gpu="H100", ...), assets=chains.Assets(secret_keys=["hf_access_token"], ...), ) ``` To select an exact instance type instead of specifying individual resource fields, use `instance_type`: ```python theme={"system"} compute=chains.Compute(instance_type="H100:8x80") ``` When `instance_type` is specified, `cpu_count`, `memory`, and `gpu` fields are ignored. See the [remote configuration reference](/reference/sdk/chains#remote-configuration) for a complete list of options. ### Initialization Chainlets are implemented as classes because we often want to set up expensive static resources once at startup and then re-use it with each invocation of the Chainlet. For example, we only want to initialize an AI model and download its weights once then re-use it every time we run inference. We do this setup in `__init__()`, which is run exactly once when the Chainlet is deployed or scaled up. ```python theme={"system"} import truss_chains as chains class PhiLLM(chains.ChainletBase): def __init__(self) -> None: import torch import transformers self._model = transformers.AutoModelForCausalLM.from_pretrained( PHI_HF_MODEL, torch_dtype=torch.float16, device_map="auto", ) self._tokenizer = transformers.AutoTokenizer.from_pretrained( PHI_HF_MODEL, ) ``` Chainlet initialization also has two important features: context and dependency injection of other Chainlets, explained below. #### Context (access information) You can add [ `DeploymentContext`](/reference/sdk/chains#class-truss-chains-deploymentcontext) object as an optional argument to the `__init__`-method of a Chainlet. This allows you to use secrets within your Chainlet, such as using a `hf_access_token` to access a gated model on Hugging Face (note that when using secrets, they also need to be added to the `assets`). ```python theme={"system"} import truss_chains as chains class MistralLLM(chains.ChainletBase): remote_config = chains.RemoteConfig( ... assets = chains.Assets(secret_keys=["hf_access_token"], ...), ) def __init__( self, # Adding the `context` argument, allows us to access secrets context: chains.DeploymentContext = chains.depends_context(), ) -> None: import transformers # Using the secret from context to access a gated model on HF self._model = transformers.AutoModelForCausalLM.from_pretrained( "mistralai/Mistral-7B-Instruct-v0.2", use_auth_token=context.secrets["hf_access_token"], ) ``` #### Depends (call other Chainlets) The Chains framework uses the [`chains.depends()`](/reference/sdk/chains#truss-chains-depends) function in Chainlets' `__init__()` method to track the dependency relationship between different Chainlets within a Chain. This syntax, inspired by dependency injection, is used to translate local Python function calls into calls to the remote Chainlets in production. Once a dependency Chainlet is added with [`chains.depends()`](/reference/sdk/chains#truss-chains-depends), its [`run_remote()`](/development/chain/concepts#run-remote-chaining-chainlets) method can call this dependency Chainlet, for example, below `HelloAll` we can make calls to `SayHello`: ```python theme={"system"} import truss_chains as chains class HelloAll(chains.ChainletBase): def __init__(self, say_hello_chainlet=chains.depends(SayHello)) -> None: self._say_hello = say_hello_chainlet async def run_remote(self, names: list[str]) -> str: output = [] for name in names: output.append(self._say_hello.run_remote(name)) return "\n".join(output) ``` ## Run remote (chaining Chainlets) The `run_remote()` method is run each time the Chainlet is called. It is the sole public interface for the Chainlet (though you can have as many private helper functions as you want) and its inputs and outputs must have type annotations. In `run_remote()` you implement the actual work of the Chainlet, such as model inference or data chunking: ```python theme={"system"} import truss_chains as chains class PhiLLM(chains.ChainletBase): async def run_remote(self, messages: Messages) -> str: import torch model_inputs = await self._tokenizer.apply_chat_template( messages, tokenize=False, add_generation_prompt=True ) inputs = await self._tokenizer(model_inputs, return_tensors="pt") input_ids = inputs["input_ids"].to("cuda") with torch.no_grad(): outputs = await self._model.generate( input_ids=input_ids, **self._generate_args) output_text = await self._tokenizer.decode( outputs[0], skip_special_tokens=True) return output_text ``` We recommend implementing this as an `async` method and using async APIs for doing all the work (for example, downloads, vLLM or TRT inference). It is possible to stream results back, see our [streaming guide](/development/chain/streaming). If `run_remote()` makes calls to other Chainlets, for example, invoking a dependency Chainlet for each element in a list, you can benefit from concurrent execution, by making the `run_remote()` an `async` method and starting the calls as concurrent tasks `asyncio.ensure_future(self._dep_chainlet.run_remote(...))`. ## Entrypoint The entrypoint is called directly from the deployed Chain's API endpoint and kicks off the entire chain. The entrypoint is also responsible for returning the final result back to the client. Using the [`@chains.mark_entrypoint`](/reference/sdk/chains#truss-chains-mark-entrypoint) decorator, one Chainlet within a file is set as the entrypoint to the chain. ```python theme={"system"} @chains.mark_entrypoint class HelloAll(chains.ChainletBase): ``` Optionally you can also set a Chain display name (not to be confused with Chainlet display name) with this decorator: ```python theme={"system"} @chains.mark_entrypoint("My Awesome Chain") class HelloAll(chains.ChainletBase): ``` ## I/O and `pydantic` data types To make orchestrating multiple remotely deployed services possible, Chains relies heavily on typed inputs and outputs. Values must be serialized to a safe exchange format to be sent over the network. The Chains framework uses the type annotations to infer how data should be serialized and currently is restricted to types that are JSON compatible. Types can be: * Direct type annotations for simple types such as `int`, `float`, or `list[str]`. * Pydantic models to define a schema for nested data structures or multiple arguments. An example of pydantic input and output types for a Chainlet is given below: ```python theme={"system"} import enum import pydantic class Modes(enum.Enum): MODE_0 = "MODE_0" MODE_1 = "MODE_1" class SplitTextInput(pydantic.BaseModel): data: str num_partitions: int mode: Modes class SplitTextOutput(pydantic.BaseModel): parts: list[str] part_lens: list[int] ``` Refer to the [pydantic docs](https://docs.pydantic.dev/latest/) for more details on how to define custom pydantic data models. Also refer to the [guide](/development/chain/binaryio) about efficient integration of binary and numeric data. ## Chains compared to Truss Chains is an alternate SDK for packaging and deploying AI models. It carries over many features and concepts from Truss and gives you access to the benefits of Baseten (resource provisioning, autoscaling, fast cold starts, etc), but it is not a 1-1 replacement for Truss. Here are some key differences: * Rather than running `truss init` and creating a Truss in a directory, a Chain is a single file, giving you more flexibility for implementing multi-step model inference. Create an example with `truss chains init`. * Configuration is done inline in typed Python code rather than in a `config.yaml` file. * While Chainlets are converted to Truss models when run on Baseten, `Chainlet != TrussModel`. Chains is designed for compatibility and incremental adoption, with a stub function for wrapping existing deployed models. # Deploy Source: https://docs.baseten.co/development/chain/deploy Deploy your Chain on Baseten Deploying a Chain is an atomic action that deploys every Chainlet within the Chain. Each Chainlet specifies its own remote environment: hardware resources, Python and system dependencies, autoscaling settings. ### Published deployment By default, pushing a Chain creates a published deployment: ```sh theme={"system"} truss chains push ./my_chain.py ``` Where `my_chain.py` contains the entrypoint Chainlet for your Chain. Published deployments have access to full autoscaling settings. Each time you push, a new deployment is created. ### Development To create a development deployment for rapid iteration, use `--watch`: ```sh theme={"system"} truss chains push ./my_chain.py --watch ``` Development deployments are intended for testing and can't scale past one replica. Each time you make a development deployment, it overwrites the existing development deployment. Development deployments support rapid iteration with live code patching. See the [watch guide](/development/chain/watch). ### Environments To deploy a Chain to an environment, run: ```sh theme={"system"} truss chains push ./my_chain.py --environment {env_name} ``` Environments are intended for live traffic and have access to full autoscaling settings. Each time you deploy to an environment, a new deployment is created. Once the new deployment is live, it replaces the previous deployment, which is relegated to the published deployments list. [Learn more](/deployment/environments) about environments. # Architecture and design Source: https://docs.baseten.co/development/chain/design How to structure your Chainlets A Chain is composed of multiple connected Chainlets working together to perform a task. For example, the Chain in the diagram below takes a large audio file as input. Then it splits it into smaller chunks, transcribes each chunk in parallel (reducing the end-to-end latency), and finally aggregates and returns the results. To build an efficient Chain, we recommend drafting your high level structure as a flowchart or diagram. This can help you identifying parallelizable units of work and steps that need different (model/hardware) resources. If one Chainlet creates many "sub-tasks" by calling other dependency Chainlets (for example, in a loop over partial work items), these calls should be done as `aynscio`-tasks that run concurrently. That way you get the most out of the parallelism that Chains offers. This design pattern is extensively used in the [audio transcription example](/examples/chains-audio-transcription). While using `asyncio` is essential for performance, it can also be tricky. Here are a few caveats to look out for: * Executing operations in an async function that block the event loop for more than a fraction of a second. This hinders the "flow" of processing requests concurrently and starting RPCs to other Chainlets. Ideally use native async APIs. Frameworks like vLLM or triton server offer such APIs, similarly file downloads can be made async and you might find [`AsyncBatcher`](https://github.com/hussein-awala/async-batcher) useful. If there is no async support, consider running blocking code in a thread/process pool (as an attribute of a Chainlet). * Creating async tasks (e.g. with `asyncio.ensure_future`) does not start the task *immediately*. In particular, when starting several tasks in a loop, `ensure_future` must be alternated with operations that yield to the event loop that, so the task can be started. If the loop is not `async for` or contains other `await` statements, a "dummy" await can be added, for example `await asyncio.sleep(0)`. This allows the tasks to be started concurrently. # Engine-Builder LLM Models Source: https://docs.baseten.co/development/chain/engine-builder-models Engine-Builder LLM models are pre-trained models that are optimized for specific inference tasks. Baseten's [Engine-Builder](/engines/engine-builder-llm/overview) enables the deployment of optimized model inference engines. Currently, it supports TensorRT-LLM. Truss Chains allows seamless integration of these engines into structured workflows. This guide provides a quick entry point for Chains users. ## LLama 7B example Use the `EngineBuilderLLMChainlet` baseclass to configure an LLM engine. The additional `engine_builder_config` field specifies model architecture, repository, and engine parameters and more, the full options are detailed in the [Engine-Builder configuration guide](/engines/engine-builder-llm/engine-builder-config). ```python theme={"system"} import truss_chains as chains from truss.base import trt_llm_config, truss_config class Llama7BChainlet(chains.EngineBuilderLLMChainlet): remote_config = chains.RemoteConfig( compute=chains.Compute(gpu=truss_config.Accelerator.H100), assets=chains.Assets(secret_keys=["hf_access_token"]), ) engine_builder_config = truss_config.TRTLLMConfiguration( build=trt_llm_config.TrussTRTLLMBuildConfiguration( base_model=trt_llm_config.TrussTRTLLMModel.LLAMA, checkpoint_repository=trt_llm_config.CheckpointRepository( source=trt_llm_config.CheckpointSource.HF, repo="meta-llama/Llama-3.1-8B-Instruct", ), max_batch_size=8, max_seq_len=4096, tensor_parallel_count=1, ) ) ``` ## Differences from standard Chainlets * No `run_remote` implementation: Unlike regular Chainlets, `EngineBuilderLLMChainlet` doesn't require users to implement `run_remote()`. Instead, it automatically wires into the deployed engine’s API. All LLM Chainlets have the same function signature: `chains.EngineBuilderLLMInput` as input and a stream (`AsyncIterator`) of strings as output. Likewise `EngineBuilderLLMChainlet`s can only be used as dependencies, but not have dependencies themselves. * No `run_local` ([guide](/development/chain/localdev)) and `watch` ([guide](/development/chain/watch)) Standard Chains support a local debugging mode and watch. However, when using `EngineBuilderLLMChainlet`, local execution isn't available, and testing must be done after deployment. For a faster dev loop of the rest of your chain (everything except the engine builder chainlet) you can substitute those chainlets with stubs like you can do for an already deployed truss model \[[guide](/development/chain/stub)]. ## Integrate the Engine-Builder chainlet After defining an `EngineBuilderLLMInput` like `Llama7BChainlet` above, you can use it as a dependency in other conventional chainlets: ```python theme={"system"} from typing import AsyncIterator import truss_chains as chains @chains.mark_entrypoint class TestController(chains.ChainletBase): """Example using the Engine-Builder Chainlet in another Chainlet.""" def __init__(self, llm=chains.depends(Llama7BChainlet)) -> None: self._llm = llm async def run_remote(self, prompt: str) -> AsyncIterator[str]: messages = [{"role": "user", "content": prompt}] llm_input = chains.EngineBuilderLLMInput(messages=messages) async for chunk in self._llm.run_remote(llm_input): yield chunk ``` # Error Handling Source: https://docs.baseten.co/development/chain/errorhandling Understanding and handling Chains errors Error handling in Chains follows the principle that the root cause "bubbles up" until the entrypoint, which returns an error response. Similarly to how python stack traces contain all the layers from where an exception was raised up until the main function. Consider the case of a Chain where the entrypoint calls `run_remote` of a Chainlet named `TextToNum` and this in turn invokes `TextReplicator`. The respective `run_remote` methods might also use other helper functions that appear in the call stack. Below is an example stack trace that shows how the root cause (a `ValueError`) is propagated up to the entrypoint's `run_remote` method (this is what you would see as an error log): ``` Chainlet-Traceback (most recent call last): File "/packages/itest_chain.py", line 132, in run_remote value = self._accumulate_parts(text_parts.parts) File "/packages/itest_chain.py", line 144, in _accumulate_parts value += self._text_to_num.run_remote(part) ValueError: (showing chained remote errors, root error at the bottom) ├─ Error in dependency Chainlet `TextToNum`: │ Chainlet-Traceback (most recent call last): │ File "/packages/itest_chain.py", line 87, in run_remote │ generated_text = self._replicator.run_remote(data) │ ValueError: (showing chained remote errors, root error at the bottom) │ ├─ Error in dependency Chainlet `TextReplicator`: │ │ Chainlet-Traceback (most recent call last): │ │ File "/packages/itest_chain.py", line 52, in run_remote │ │ validate_data(data) │ │ File "/packages/itest_chain.py", line 36, in validate_data │ │ raise ValueError(f"This input is too long: {len(data)}.") ╰ ╰ ValueError: This input is too long: 100. ``` ## Exception handling and retries Above stack trace is what you see if you don't catch the exception. It is possible to add error handling around each remote Chainlet invocation. Chains tries to raise the same exception class on the *caller* Chainlet as was raised in the *dependency* Chainlet. * Builtin exceptions (for example, `ValueError`) always work. * Custom or third-party exceptions (for example, from `torch`) can be only raised in the caller if they are included in the dependencies of the caller as well. If the exception class cannot be resolved, a `GenericRemoteException` is raised instead. Note that the *message* of re-raised exceptions is the concatenation of the original message and the formatted stack trace of the dependency Chainlet. In some cases it might make sense to simply retry a remote invocation (for example, if it failed due to some transient problems like networking or any "flaky" parts). `depends` can be configured with additional [options](/reference/sdk/chains#truss-chains-depends) for that. Below example shows how you can add automatic retries and error handling for the call to `TextReplicator` in `TextToNum`: ```python theme={"system"} import truss_chains as chains class TextToNum(chains.ChainletBase): def __init__( self, replicator: TextReplicator = chains.depends(TextReplicator, retries=3), ) -> None: self._replicator = replicator async def run_remote(self, data: ...): try: generated_text = await self._replicator.run_remote(data) except ValueError: ... # Handle error. ``` ## Stack filtering The stack trace is intended to show the user implemented code in `run_remote` (and user implemented helper functions). Under the hood, the calls from one Chainlet to another go through an HTTP connection, managed by the Chains framework. And each Chainlet itself is run as a FastAPI server with several layers of request handling code "above". To provide concise, readable stacks, all of this non-user code is filtered out. # Your first Chain Source: https://docs.baseten.co/development/chain/getting-started Build and deploy two example Chains This quickstart guide contains instructions for creating two Chains: 1. A simple CPU-only “hello world”-Chain. 2. A Chain that implements Phi-3 Mini and uses it to write poems. ## Prerequisites Install [Truss](https://pypi.org/project/truss/): ```bash theme={"system"} uv venv && source .venv/bin/activate uv pip install --upgrade truss ``` ```bash theme={"system"} python3 -m venv .venv && source .venv/bin/activate pip install --upgrade truss ``` ```bash theme={"system"} python3 -m venv .venv && .venv\Scripts\activate pip install --upgrade truss ``` You also need a [Baseten account](https://app.baseten.co/signup) with an [API key](https://app.baseten.co/settings/account/api_keys). ## Example: Hello World Chains are written in Python files. In your working directory, create `hello_chain/hello.py`: ```sh theme={"system"} mkdir hello_chain cd hello_chain touch hello.py ``` In the file, we'll specify a basic Chain. It has two Chainlets: * `HelloWorld`, the entrypoint, which handles the input and output. * `RandInt`, which generates a random integer. It is used a as a dependency by `HelloWorld`. Via the entrypoint, the Chain takes a maximum value and returns the string " Hello World!" repeated a variable number of times. ```python hello.py theme={"system"} import random import truss_chains as chains class RandInt(chains.ChainletBase): async def run_remote(self, max_value: int) -> int: return random.randint(1, max_value) @chains.mark_entrypoint class HelloWorld(chains.ChainletBase): def __init__(self, rand_int=chains.depends(RandInt, retries=3)) -> None: self._rand_int = rand_int async def run_remote(self, max_value: int) -> str: num_repetitions = await self._rand_int.run_remote(max_value) return "Hello World! " * num_repetitions ``` ### The Chainlet class-contract Exactly one Chainlet must be marked as the entrypoint with the [`@chains.mark_entrypoint`](/reference/sdk/chains#truss-chains-mark-entrypoint) decorator. This Chainlet is responsible for handling public-facing input and output for the whole Chain in response to an API call. A Chainlet class has a single public method, [`run_remote()`](/development/chain/concepts#run-remote-chaining-chainlets), which is the API endpoint for the entrypoint Chainlet and the function that other Chainlets can use as a dependency. The [`run_remote()`](/development/chain/concepts#run-remote-chaining-chainlets) method must be fully type-annotated with primitive python types or [pydantic models](https://docs.pydantic.dev/latest/). Chainlets cannot be naively instantiated. The only correct usages are: 1. Make one Chainlet depend on another one via the [`chains.depends()`](/reference/sdk/chains#truss-chains-depends) directive as an `__init__`-argument as shown above for the `RandInt` Chainlet. 2. In the [local debugging mode](/development/chain/localdev#test-a-chain-locally). Beyond that, you can structure your code as you like, with private methods, imports from other files, and so forth. Keep in mind that Chainlets are intended for distributed, replicated, remote execution, so using global variables, global state, and certain Python features like importing modules dynamically at runtime should be avoided as they may not work as intended. ### Deploy your Chain to Baseten To deploy your Chain to Baseten, run: ```bash theme={"system"} truss chains push --watch hello.py ``` The deploy command results in an output like this: ``` ⛓️ HelloWorld - Chainlets ⛓️ ╭──────────────────────┬─────────────────────────┬─────────────╮ │ Status │ Name │ Logs URL │ ├──────────────────────┼─────────────────────────┼─────────────┤ │ 💚 ACTIVE │ HelloWorld (entrypoint) │ https://... │ ├──────────────────────┼─────────────────────────┼─────────────┤ │ 💚 ACTIVE │ RandInt (dep) │ https://... │ ╰──────────────────────┴─────────────────────────┴─────────────╯ Deployment succeeded. You can run the chain with: curl -X POST 'https://chain-.../run_remote' \ -H "Authorization: Api-Key $BASETEN_API_KEY" \ -d '' ``` Wait for the status to turn to `ACTIVE` and test invoking your Chain (replace `$INVOCATION_URL` in below command): ```bash theme={"system"} curl -X POST $INVOCATION_URL \ -H "Authorization: Api-Key $BASETEN_API_KEY" \ -d '{"max_value": 10}' # "Hello World! Hello World! Hello World! " ``` ## Example: Poetry with LLMs Our second example also has two Chainlets, but is somewhat more complex and realistic. The Chainlets are: * `PoemGenerator`, the entrypoint, which handles the input and output and orchestrates calls to the LLM. * `PhiLLM`, which runs inference on Phi-3 Mini. This Chain takes a list of words and returns a poem about each word, written by Phi-3. Here's the architecture: We build this Chain in a new working directory (if you are still inside `hello_chain/`, go up one level with `cd ..` first): ```sh theme={"system"} mkdir poetry_chain cd poetry_chain touch poems.py ``` A similar end-to-end code example, using Mistral as an LLM, is available in the [examples repo](https://github.com/basetenlabs/model/tree/main/truss-chains/examples/mistral). ### Building the LLM Chainlet The main difference between this Chain and the previous one is that we now have an LLM that needs a GPU and more complex dependencies. Copy the following code into `poems.py`: ```python poems.py theme={"system"} import asyncio from typing import List import pydantic import truss_chains as chains from truss import truss_config PHI_HF_MODEL = "microsoft/Phi-3-mini-4k-instruct" # This configures to cache model weights from the hunggingface repo # in the docker image that is used for deploying the Chainlet. PHI_CACHE = truss_config.ModelRepo( repo_id=PHI_HF_MODEL, allow_patterns=["*.json", "*.safetensors", ".model"] ) class Messages(pydantic.BaseModel): messages: List[dict[str, str]] class PhiLLM(chains.ChainletBase): # `remote_config` defines the resources required for this chainlet. remote_config = chains.RemoteConfig( docker_image=chains.DockerImage( # The phi model needs some extra python packages. pip_requirements=[ "accelerate==0.30.1", "einops==0.8.0", "transformers==4.41.2", "torch==2.3.0", ] ), # The phi model needs a GPU and more CPUs. compute=chains.Compute(cpu_count=2, gpu="T4"), # Cache the model weights in the image assets=chains.Assets(cached=[PHI_CACHE]), ) def __init__(self) -> None: # Note the imports of the *specific* python requirements are # pushed down to here. This code will only be executed on the # remotely deployed Chainlet, not in the local environment, # so we don't need to install these packages in the local # dev environment. import torch import transformers self._model = transformers.AutoModelForCausalLM.from_pretrained( PHI_HF_MODEL, torch_dtype=torch.float16, device_map="auto", ) self._tokenizer = transformers.AutoTokenizer.from_pretrained( PHI_HF_MODEL, ) self._generate_args = { "max_new_tokens" : 512, "temperature" : 1.0, "top_p" : 0.95, "top_k" : 50, "repetition_penalty" : 1.0, "no_repeat_ngram_size": 0, "use_cache" : True, "do_sample" : True, "eos_token_id" : self._tokenizer.eos_token_id, "pad_token_id" : self._tokenizer.pad_token_id, } async def run_remote(self, messages: Messages) -> str: import torch model_inputs = self._tokenizer.apply_chat_template( messages, tokenize=False, add_generation_prompt=True ) inputs = self._tokenizer(model_inputs, return_tensors="pt") input_ids = inputs["input_ids"].to("cuda") with torch.no_grad(): outputs = self._model.generate( input_ids=input_ids, **self._generate_args) output_text = self._tokenizer.decode( outputs[0], skip_special_tokens=True) return output_text ``` ### Building the entrypoint Now that we have an LLM, we can use it in a poem generator Chainlet. Add the following code to `poems.py`: ```python poems.py theme={"system"} import asyncio @chains.mark_entrypoint class PoemGenerator(chains.ChainletBase): def __init__(self, phi_llm: PhiLLM = chains.depends(PhiLLM)) -> None: self._phi_llm = phi_llm async def run_remote(self, words: list[str]) -> list[str]: tasks = [] for word in words: messages = Messages( messages=[ { "role" : "system", "content": ( "You are poet who writes short, " "lighthearted, amusing poetry." ), }, {"role": "user", "content": f"Write a poem about {word}"}, ] ) tasks.append( asyncio.ensure_future(self._phi_llm.run_remote(messages))) await asyncio.sleep(0) # Yield to event loop, to allow starting tasks. return list(await asyncio.gather(*tasks)) ``` Note that we use `asyncio.ensure_future` around each RPC to the LLM chainlet. This makes the current python process start these remote calls concurrently, i.e. the next call is started before the previous one has finished and we can minimize our overall runtime. To await the results of all calls, `asyncio.gather` is used which gives us back normal python objects. If the LLM is hit with many concurrent requests, it can auto-scale up (if autoscaling is configured). More advanced LLM models have batching capabilities, so for those even a single instance can serve concurrent request. ### Deploy your Chain to Baseten To deploy your Chain to Baseten, run: ```bash theme={"system"} truss chains push --watch poems.py ``` Wait for the status to turn to `ACTIVE` and test invoking your Chain (replace `$INVOCATION_URL` in below command): ```bash theme={"system"} curl -X POST $INVOCATION_URL \ -H "Authorization: Api-Key $BASETEN_API_KEY" \ -d '{"words": ["bird", "plane", "superman"]}' #[[ #" [INST] Generate a poem about: bird [/INST] In the quiet hush of...", #" [INST] Generate a poem about: plane [/INST] In the vast, boundless...", #" [INST] Generate a poem about: superman [/INST] In the realm where..." #]] ``` # Invocation Source: https://docs.baseten.co/development/chain/invocation Call your deployed Chain Once your Chain is deployed, you can call it via its API endpoint. Chains use the same inference API as models: * [Environment endpoint](/reference/inference-api/predict-endpoints/environments-run-remote) * [Development endpoint](/reference/inference-api/predict-endpoints/development-run-remote) * [Endpoint by ID](/reference/inference-api/predict-endpoints/deployment-run-remote) Here's an example which calls the development deployment: ```python call_chain.py theme={"system"} import requests import os # From the Chain overview page on Baseten # E.g. "https://chain-.api.baseten.co/development/run_remote" CHAIN_URL = "" baseten_api_key = os.environ["BASETEN_API_KEY"] # JSON keys and types match the `run_remote` method signature. data = {...} resp = requests.post( CHAIN_URL, headers={"Authorization": f"Api-Key {baseten_api_key}"}, json=data, ) print(resp.json()) ``` ### How to pass chain input The data schema of the inference request corresponds to the function signature of [`run_remote()`](/development/chain/concepts#run-remote-chaining-chainlets) in your entrypoint Chainlet. For example, for the Hello Chain, `HelloAll.run_remote()`: ```python theme={"system"} async def run_remote(self, names: list[str]) -> str: ``` You'd pass the following JSON payload: ```json theme={"system"} { "names": ["Marius", "Sid", "Bola"] } ``` That is, the keys in the JSON record, match the argument names and values match the types of`run_remote.` ### Async chain inference Like Truss models, Chains support async invocation. The [guide for models](/inference/async) applies largely. In particular for how to wrap the input and set up the webhook to process results. The following additional points are chains specific: * Use chain-based URLS: * `https://chain-{chain}.api.baseten.co/production/async_run_remote` * `https://chain-{chain}.api.baseten.co/development/async_run_remote` * `https://chain-{chain}.api.baseten.co/deployment/{deployment}/async_run_remote`. * `https://chain-{chain}.api.baseten.co/environments/{env_name}/async_run_remote`. * Only the entrypoint is invoked asynchronously. Internal Chainlet-Chainlet calls run synchronously. # Local Development Source: https://docs.baseten.co/development/chain/localdev Iterating, Debugging, Testing, Mocking Chains are designed for production in replicated remote deployments. But alongside that production-ready power, we offer great local development and deployment experiences. Chains exists to help you build multi-step, multi-model pipelines. The abstractions that Chains introduces are based on six opinionated principles: three for architecture and three for developer experience. **Architecture principles** Each step in the pipeline can set its own hardware requirements and software dependencies, separating GPU and CPU workloads. Each component has independent autoscaling parameters for targeted resource allocation, removing bottlenecks from your pipelines. Components specify a single public interface for flexible-but-safe composition and are reusable between projects **Developer experience principles** Eliminate entire taxonomies of bugs by writing typed Python code and validating inputs, outputs, module initializations, function signatures, and even remote server configurations. Seamless local testing and cloud deployments: test Chains locally with support for mocking the output of any step and simplify your cloud deployment loops by separating large model deployments from quick updates to glue code. Use Chains to orchestrate existing model deployments, like pre-packaged models from Baseten’s model library, alongside new model pipelines built entirely within Chains. Locally, a Chain is just Python files in a source tree. While that gives you a lot of flexibility in how you structure your code, there are some constraints and rules to follow to ensure successful distributed, remote execution in production. The best thing you can do while developing locally with Chains is to run your code frequently, even if you do not have a `__main__` section: the Chains framework runs various validations at module initialization to help you catch issues early. Additionally, running `mypy` and fixing reported type errors can help you find problems early in a rapid feedback loop, before attempting a (much slower) deployment. Complementary to the purely local development Chains also has a "watch" mode, like Truss, see the [watch guide](/development/chain/watch). ## Test a Chain locally Let's revisit our "Hello World" Chain: ```python hello_chain/hello.py theme={"system"} import asyncio import truss_chains as chains # This Chainlet does the work class SayHello(chains.ChainletBase): async def run_remote(self, name: str) -> str: return f"Hello, {name}" # This Chainlet orchestrates the work @chains.mark_entrypoint class HelloAll(chains.ChainletBase): def __init__(self, say_hello_chainlet=chains.depends(SayHello)) -> None: self._say_hello = say_hello_chainlet async def run_remote(self, names: list[str]) -> str: tasks = [] for name in names: tasks.append(asyncio.ensure_future( self._say_hello.run_remote(name))) return "\n".join(await asyncio.gather(*tasks)) # Test the Chain locally if __name__ == "__main__": with chains.run_local(): hello_chain = HelloAll() result = asyncio.get_event_loop().run_until_complete( hello_chain.run_remote(["Marius", "Sid", "Bola"])) print(result) ``` When the `__main__()` module is run, local instances of the Chainlets are created, allowing you to test functionality of your chain just by executing the Python file: ```bash theme={"system"} cd hello_chain python hello.py # Hello, Marius # Hello, Sid # Hello, Bola ``` ## Mock execution of GPU Chainlets Using `run_local()` to run your code locally requires that your development environment have the compute resources and dependencies that each Chainlet needs. But that often isn't possible when building with AI models. Chains offers a workaround, mocking, to let you test the coordination and business logic of your multi-step inference pipeline without worrying about running the model locally. The second example in the [getting started guide](/development/chain/getting-started) implements a Truss Chain for generating poems with Phi-3. This Chain has two Chainlets: 1. The `PhiLLM` Chainlet, which can run on NVIDIA GPUs such as the L4. 2. The `PoemGenerator` Chainlet, which easily runs on a CPU. If you have an NVIDIA T4 under your desk, good for you. For the rest of us, we can mock the `PhiLLM` Chainlet that is infeasible to run locally so that we can quickly test the `PoemGenerator` Chainlet. To do this, we define a mock Phi-3 model in our `__main__` module and give it a [`run_remote()`](/development/chain/concepts#run-remote-chaining-chainlets) method that produces a test output that matches the output type we expect from the real Chainlet. Then, we inject an instance of this mock Chainlet into our Chain: ```python poems.py theme={"system"} if __name__ == "__main__": class FakePhiLLM: async def run_remote(self, prompt: str) -> str: return f"Here's a poem about {prompt.split(" ")[-1]}" with chains.run_local(): poem_generator = PoemGenerator(phi_llm=FakePhiLLM()) result = asyncio.get_event_loop().run_until_complete( poem_generator.run_remote(words=["bird", "plane", "superman"])) print(result) ``` And run your Python file: ```bash theme={"system"} python poems.py # ['Here's a poem about bird', 'Here's a poem about plane', 'Here's a poem about superman'] ``` ### Typing of mocks You may notice that the argument `phi_llm` expects a type `PhiLLM`, while we pass an instance of `FakePhiLLM`. These aren't the same, which is formally a type error. However, this works at runtime because we constructed `FakePhiLLM` to implement the same *protocol* as the real thing. We can make this explicit by defining a `Protocol` as a type annotation: ```python theme={"system"} from typing import Protocol class PhiProtocol(Protocol): def run_remote(self, data: str) -> str: ... ``` and changing the argument type in `PoemGenerator`: ```python theme={"system"} @chains.mark_entrypoint class PoemGenerator(chains.ChainletBase): def __init__(self, phi_llm: PhiProtocol = chains.depends(PhiLLM)) -> None: self._phi_llm = phi_llm ``` This is a bit more work and not needed to execute the code, but it shows how typing consistency can be achieved if desired. # Overview Source: https://docs.baseten.co/development/chain/overview Chains is a framework for building robust, performant multi-step and multi-model inference pipelines and deploying them to production. It addresses the common challenges of managing latency, cost and dependencies for complex workflows, while leveraging Truss’ existing battle-tested performance, reliability and developer toolkit.