# AI tools Source: https://docs.baseten.co/ai-tools Connect AI tools to Baseten documentation for context-aware assistance with deploying and serving models. Baseten docs are optimized for AI tools. Connect your assistants, coding tools, and agents directly to the docs so they have up-to-date context when helping you build on Baseten. Every page includes a contextual menu (the icon in the top-right corner of any page) with shortcuts to copy content and connect your MCP server. ## MCP server The Model Context Protocol (MCP) connects AI tools directly to Baseten documentation. When connected, your AI tool searches the docs in real time while generating responses, so you get answers grounded in current documentation rather than stale training data. The Baseten docs MCP server is available at: ``` https://docs.baseten.co/mcp ``` Add the MCP server to Claude Code: ```bash theme={"system"} claude mcp add --transport http baseten-docs https://docs.baseten.co/mcp ``` Claude Code searches Baseten docs automatically when relevant to your prompts. Navigate to the **Connectors** page in Claude settings. Select **Add custom connector**, then enter: * **Name:** Baseten Docs * **URL:** `https://docs.baseten.co/mcp` Select **Add**. When starting a conversation, select the attachments button (the plus icon) and choose the Baseten Docs connector. Claude searches the docs as needed while responding. Use `Cmd + Shift + P` (macOS) or `Ctrl + Shift + P` (Windows/Linux) to open the command palette. Search for **"Open MCP settings"**. Select **Add custom MCP**. This opens your `mcp.json` file. Add the Baseten docs server: ```json mcp.json theme={"system"} { "mcpServers": { "baseten-docs": { "type": "http", "url": "https://docs.baseten.co/mcp" } } } ``` Create or update `.vscode/mcp.json` in your project: ```json .vscode/mcp.json theme={"system"} { "servers": { "baseten-docs": { "type": "http", "url": "https://docs.baseten.co/mcp" } } } ``` ### Other MCP clients Any MCP-compatible tool (Goose, ChatGPT, Windsurf, and others) can connect using the server URL `https://docs.baseten.co/mcp`. Refer to your tool's documentation for how to add an MCP server. You can also use `npx add-mcp` to auto-detect supported AI tools on your system and configure them: ```bash theme={"system"} npx add-mcp https://docs.baseten.co ``` ## Skills The skills file describes what AI agents can accomplish with Baseten, including required inputs and constraints. AI coding tools use this file to understand Baseten capabilities without reading every documentation page. Install the Baseten docs skill into your AI coding tool: ```bash theme={"system"} npx skills add https://docs.baseten.co ``` This gives your AI tool structured knowledge of Baseten's capabilities so it can help you deploy models, configure autoscaling, set up inference endpoints, and more with product-aware guidance. View the skill file directly at [docs.baseten.co/skill.md](https://docs.baseten.co/skill.md). Skills and MCP serve complementary purposes. **Skills** tell an AI tool *what Baseten can do* and how to do it. **MCP** lets the tool *search current documentation* for specific details. For the best results, install both. ## llms.txt The `llms.txt` file is an industry-standard directory that helps LLMs index documentation efficiently, similar to how `sitemap.xml` helps search engines. Baseten docs automatically host two versions: * [docs.baseten.co/llms.txt](https://docs.baseten.co/llms.txt): a structured list of all pages with descriptions. * [docs.baseten.co/llms-full.txt](https://docs.baseten.co/llms-full.txt): the full text content of all pages. These files stay up to date automatically and require no configuration. AI tools and search engines like ChatGPT, Perplexity, and Google AI Overviews use them to understand and cite Baseten documentation. ## Markdown access Every documentation page is available as Markdown by appending `.md` to the URL. For example: ``` https://docs.baseten.co/quickstart.md ``` AI agents receive page content as Markdown instead of HTML, which reduces token usage and improves processing speed. You can use this to quickly copy any page's content into an AI conversation. ## Contextual menu reference The contextual menu on each page provides one-click access to these integrations. Select the menu icon in the top-right corner of any page. | Option | Description | | ------------------- | --------------------------------------------------------- | | Copy page | Copies the page as Markdown for pasting into any AI tool. | | View as Markdown | Opens the page as raw Markdown in a new tab. | | Copy MCP server URL | Copies the MCP server URL to your clipboard. | | Connect to Cursor | Installs the MCP server in Cursor. | | Connect to VS Code | Installs the MCP server in VS Code. | # Cancel a queued async request. Source: https://docs.baseten.co/api-reference/cancel-a-queued-async-request /reference/inference-api/inference-api-spec.json delete /async_request/{request_id} Cancels an async request. Only requests with `QUEUED` status may be canceled. Rate limited to 20 requests per second. # Get the status of an async request. Source: https://docs.baseten.co/api-reference/get-the-status-of-an-async-request /reference/inference-api/inference-api-spec.json get /async_request/{request_id} Returns the current status of an async model or chain request. Rate limited to 20 requests per second. # Asynchronously call a named environment of a chain. Source: https://docs.baseten.co/api-reference/non-regional/asynchronously-call-a-named-environment-of-a-chain /reference/inference-api/inference-api-spec.json post /environments/{env_name}/async_run_remote # Asynchronously call a named environment of a model. Source: https://docs.baseten.co/api-reference/non-regional/asynchronously-call-a-named-environment-of-a-model /reference/inference-api/inference-api-spec.json post /environments/{env_name}/async_predict # Asynchronously call a specific deployment of a chain. Source: https://docs.baseten.co/api-reference/non-regional/asynchronously-call-a-specific-deployment-of-a-chain /reference/inference-api/inference-api-spec.json post /deployment/{deployment_id}/async_run_remote # Asynchronously call a specific deployment of a model. Source: https://docs.baseten.co/api-reference/non-regional/asynchronously-call-a-specific-deployment-of-a-model /reference/inference-api/inference-api-spec.json post /deployment/{deployment_id}/async_predict # Asynchronously call the development deployment of a chain. Source: https://docs.baseten.co/api-reference/non-regional/asynchronously-call-the-development-deployment-of-a-chain /reference/inference-api/inference-api-spec.json post /development/async_run_remote # Asynchronously call the development deployment of a model. Source: https://docs.baseten.co/api-reference/non-regional/asynchronously-call-the-development-deployment-of-a-model /reference/inference-api/inference-api-spec.json post /development/async_predict # Asynchronously call the production environment of a chain. Source: https://docs.baseten.co/api-reference/non-regional/asynchronously-call-the-production-environment-of-a-chain /reference/inference-api/inference-api-spec.json post /production/async_run_remote Enqueues an asynchronous request for the chain deployment promoted to the production environment. # Asynchronously call the production environment of a model. Source: https://docs.baseten.co/api-reference/non-regional/asynchronously-call-the-production-environment-of-a-model /reference/inference-api/inference-api-spec.json post /production/async_predict Enqueues an asynchronous predict request for the deployment promoted to the production environment. Returns a request ID that can be used to poll for status or cancel the request. # Call a specific chain deployment by deployment ID. Source: https://docs.baseten.co/api-reference/non-regional/call-a-specific-chain-deployment-by-deployment-id /reference/inference-api/inference-api-spec.json post /deployment/{deployment_id}/run_remote # Call a specific deployment of a model by deployment ID. Source: https://docs.baseten.co/api-reference/non-regional/call-a-specific-deployment-of-a-model-by-deployment-id /reference/inference-api/inference-api-spec.json post /deployment/{deployment_id}/predict Sends a synchronous predict request to the specified deployment. # Call the chain deployment associated with a specified environment. Source: https://docs.baseten.co/api-reference/non-regional/call-the-chain-deployment-associated-with-a-specified-environment /reference/inference-api/inference-api-spec.json post /environments/{env_name}/run_remote # Call the development deployment of a chain. Source: https://docs.baseten.co/api-reference/non-regional/call-the-development-deployment-of-a-chain /reference/inference-api/inference-api-spec.json post /development/run_remote # Call the development deployment of a model. Source: https://docs.baseten.co/api-reference/non-regional/call-the-development-deployment-of-a-model /reference/inference-api/inference-api-spec.json post /development/predict Sends a synchronous predict request to the development deployment. # Call the model deployment associated with a specified environment. Source: https://docs.baseten.co/api-reference/non-regional/call-the-model-deployment-associated-with-a-specified-environment /reference/inference-api/inference-api-spec.json post /environments/{env_name}/predict Sends a synchronous predict request to the deployment promoted to the specified environment. # Call the production environment of a chain. Source: https://docs.baseten.co/api-reference/non-regional/call-the-production-environment-of-a-chain /reference/inference-api/inference-api-spec.json post /production/run_remote Sends a synchronous request to the chain deployment promoted to the production environment. The request body is forwarded to the chain's `run_remote` entrypoint. # Call the production environment of a model. Source: https://docs.baseten.co/api-reference/non-regional/call-the-production-environment-of-a-model /reference/inference-api/inference-api-spec.json post /production/predict Sends a synchronous predict request to the deployment promoted to the production environment. The request body is forwarded directly to the model's `predict` function. # Get async queue status for a named environment. Source: https://docs.baseten.co/api-reference/non-regional/get-async-queue-status-for-a-named-environment /reference/inference-api/inference-api-spec.json get /environments/{env_name}/async_queue_status # Get async queue status for a specific deployment. Source: https://docs.baseten.co/api-reference/non-regional/get-async-queue-status-for-a-specific-deployment /reference/inference-api/inference-api-spec.json get /deployment/{deployment_id}/async_queue_status # Get async queue status for the development deployment. Source: https://docs.baseten.co/api-reference/non-regional/get-async-queue-status-for-the-development-deployment /reference/inference-api/inference-api-spec.json get /development/async_queue_status # Get async queue status for the production environment. Source: https://docs.baseten.co/api-reference/non-regional/get-async-queue-status-for-the-production-environment /reference/inference-api/inference-api-spec.json get /production/async_queue_status Returns the number of queued and in-progress async requests for the deployment promoted to the production environment. Rate limited to 20 requests per second. # Wake a named environment of a model. Source: https://docs.baseten.co/api-reference/non-regional/wake-a-named-environment-of-a-model /reference/inference-api/inference-api-spec.json post /environments/{env_name}/wake # Wake a specific deployment of a model by deployment ID. Source: https://docs.baseten.co/api-reference/non-regional/wake-a-specific-deployment-of-a-model-by-deployment-id /reference/inference-api/inference-api-spec.json post /deployment/{deployment_id}/wake # Wake the development deployment of a model. Source: https://docs.baseten.co/api-reference/non-regional/wake-the-development-deployment-of-a-model /reference/inference-api/inference-api-spec.json post /development/wake # Wake the production environment of a model. Source: https://docs.baseten.co/api-reference/non-regional/wake-the-production-environment-of-a-model /reference/inference-api/inference-api-spec.json post /production/wake Triggers a wake for the deployment promoted to the production environment. Returns immediately with 202 Accepted. # Asynchronously call a regional environment of a chain. Source: https://docs.baseten.co/api-reference/regional/asynchronously-call-a-regional-environment-of-a-chain /reference/inference-api/inference-api-spec.json post /async_run_remote Enqueues an asynchronous run_remote request via a regional hostname. The environment is determined by the hostname, not the path. # Asynchronously call a regional environment of a model. Source: https://docs.baseten.co/api-reference/regional/asynchronously-call-a-regional-environment-of-a-model /reference/inference-api/inference-api-spec.json post /async_predict Enqueues an asynchronous predict request via a regional hostname. The environment is determined by the hostname, not the path. # Call a regional environment of a chain. Source: https://docs.baseten.co/api-reference/regional/call-a-regional-environment-of-a-chain /reference/inference-api/inference-api-spec.json post /run_remote Sends a synchronous run_remote request via a regional hostname. The environment is determined by the hostname, not the path. # Call a regional environment of a model. Source: https://docs.baseten.co/api-reference/regional/call-a-regional-environment-of-a-model /reference/inference-api/inference-api-spec.json post /predict Sends a synchronous predict request via a regional hostname. The environment is determined by the hostname, not the path. # Get async queue status for a regional environment. Source: https://docs.baseten.co/api-reference/regional/get-async-queue-status-for-a-regional-environment /reference/inference-api/inference-api-spec.json get /async_queue_status # Wake a regional environment of a model. Source: https://docs.baseten.co/api-reference/regional/wake-a-regional-environment-of-a-model /reference/inference-api/inference-api-spec.json post /wake # How Baseten works Source: https://docs.baseten.co/concepts/howbasetenworks Follow a model from truss push to a running endpoint: the build pipeline, request routing, autoscaling, and deployment lifecycle. The [overview](/overview) covers Baseten's capabilities. This page covers the underlying mechanics: how a config file becomes a running endpoint, how Baseten routes requests to your model, how the autoscaler manages capacity, and how you promote a model from development to production. ## Multi-cloud capacity management (MCM) Behind every Baseten deployment is the multi-cloud capacity management (MCM) system. MCM acts as the infrastructure control plane, unifying thousands of GPUs across multiple cloud service providers and geographic regions. When you request a resource (an H100 in US-East-1 or a cluster of B200s in a private region), MCM provisions the hardware, configures networking, and monitors health. It abstracts differences between cloud providers to ensure the Baseten Inference Stack runs identically on any underlying infrastructure. This system powers Baseten's high availability by enabling active-active deployments across different clouds. If a region or provider faces a capacity crunch or outage, MCM rapidly re-routes and re-provisions workloads to maintain service continuity. ## The build pipeline With MCM handling infrastructure provisioning, here's what happens when you deploy a model. When you run `truss push`, the CLI validates your `config.yaml`, archives your project directory, and uploads it to cloud storage. Baseten receives the archive and starts the build. For [Engine-Builder-LLM](/engines/engine-builder-llm/overview), Baseten downloads model weights from the source repository (Hugging Face, S3, or GCS) and compiles them with TensorRT-LLM. The compilation step builds optimized CUDA kernels for the target GPU architecture, applies quantization if configured, and sets up tensor parallelism across multiple GPUs. Baseten packages the compiled engine, runtime configuration, and serving infrastructure into a container, deploys it to GPU infrastructure, and exposes it as an API endpoint. The `truss push` command returns once the upload finishes. For engine-based deployments, compilation can take several minutes. Watch progress in the deployment logs or check the dashboard, which shows "Active" when the endpoint is ready for requests. For [custom model code](/development/model/custom-model-code) deployments, the build is faster: Baseten installs your Python dependencies, packages your `Model` class into a container, and deploys it. You remain responsible for any inference optimization in custom builds. ## Request routing Each deployment gets a dedicated subdomain: `https://model-{model_id}.api.baseten.co/`. The URL path determines which deployment handles the request. Requests to `/production/predict` go to the production environment, while `/development/predict` goes to the development deployment. You can also target a specific deployment by ID or a custom environment by name. Once the environment is resolved, the load balancer routes the request to an active replica. If the model has scaled to zero, Baseten spins up a replica and queues the request until the model loads and becomes ready. The caller receives the response regardless of whether the model was warm or cold-started. Engine-based deployments serve an [OpenAI-compatible API](/reference/inference-api/chat-completions) at the `/v1/chat/completions` path, so any code written for the OpenAI SDK works without modification. Custom model deployments use the [predict API](/reference/inference-api/overview), which accepts and returns arbitrary JSON. For long-running workloads, [async requests](/inference/async) return a request ID immediately. The request enters a queue managed by an async request service. A background worker then calls your model and delivers the result via webhook. Sync requests are prioritized over async requests when competing for concurrency slots, which helps prevent background work from starving real-time traffic. ## Autoscaling Baseten's autoscaler watches in-flight request counts and adjusts replicas to maintain each one near its [concurrency target](/deployment/autoscaling/overview). Scaling up is immediate. When average utilization crosses the target threshold (default 70%) within the autoscaling window (default 60 seconds), the autoscaler adds replicas up to the configured maximum. Scaling down is deliberately slow. When traffic drops, the autoscaler flags excess replicas for removal but keeps them alive for a configurable delay (default 900 seconds). Replicas are removed gradually rather than all at once, which prevents the cluster from thrashing during bursty traffic. Setting [`min_replica`](/deployment/autoscaling/overview) to 0 enables scale-to-zero. The model won't incur GPU cost when idle, but the next request triggers a cold start. Setting `min_replica` to 1 or higher keeps warm capacity ready at all times, trading cost for lower latency. ## Cold starts and the weight delivery network The slowest part of a cold start is loading model weights, which can reach hundreds of gigabytes. Baseten addresses this with the [Baseten Delivery Network (BDN)](/development/model/bdn), a multi-tier caching system for model weights. When you first deploy, BDN mirrors your model weights from the source repository to Baseten's own blob storage. After that, no cold start depends on an upstream service like Hugging Face or S3. When a new replica starts, the BDN agent on the node fetches a manifest for the weights, downloads them through an in-cluster cache (shared across all pods in the cluster), and stores them in a node-level cache (shared across all replicas on the same node). Identical files across different models are deduplicated, so a GLM fine-tune that shares most weights with the base model only downloads the delta. Subsequent cold starts on the same node or in the same cluster are significantly faster than the first. Container images use streaming, so the model begins loading weights before the image download completes. ## Environments and promotion Every model starts with a development deployment: a single replica with scale-to-zero enabled and live reload for fast iteration. When the model is ready for production traffic, promote it to an environment. The [production environment](/deployment/environments) exists by default. You can create additional environments (staging, shadow, or canary) for testing and gradual rollouts. Each environment has a stable endpoint URL, its own autoscaling settings, and dedicated metrics. The endpoint URL stays constant when you promote new deployments, so your application code doesn't need to change. Promotion replaces the current deployment in an environment with the new one. The new deployment inherits the environment's autoscaling settings. Baseten demotes the previous deployment and scales it to zero, allowing you to roll back by re-promoting it. You can also push directly to an environment with `truss push --environment staging` to skip the development stage. Only one promotion can be active per environment at a time to prevent conflicting updates. For configuration details on deployments, environments, resources, and CI/CD, see [Deployment concepts](/deployment/concepts). # Why Baseten Source: https://docs.baseten.co/concepts/whybaseten Mission-critical inference with dedicated infrastructure, global scale, and full control. Baseten provides high-performance inference for teams that have outgrown shared API endpoints. We deliver the performance of custom-built infrastructure with the ease of a managed platform, allowing you to deploy and scale any model behind a production-grade API. ## Mission-critical inference Inference is the core of your application. When it fails, your product stops working. We built Baseten to handle mission-critical workloads, offering 99.99% uptime and low-latency performance at any scale. Operating thousands of GPUs across multiple regions and cloud providers exposes the limits of traditional deployment. Single points of failure, regional capacity constraints, and the overhead of managing heterogeneous clouds create significant operational risk. We solved these problems with our Multi-cloud Capacity Management (MCM) system. ## Multi-cloud Capacity Management (MCM) MCM is a unified control layer that provisions and scales resources across 10+ clouds and regions. It handles the complexity of cloud-agnostic orchestration, giving you a single pane of glass for your entire inference fleet. Whether you run in our cloud, yours, or both, the experience is identical. MCM enables three deployment modes, all sharing the same high-performance inference stack: ### Baseten Cloud Fully managed, multi-cloud inference. This is the fastest path to production, offering limitless scale and global latency optimization. We manage the infrastructure so you can focus on your models. ### Baseten Self-hosted The full Baseten stack inside your own VPC. Use this when you have strict data security, privacy, or sovereignty requirements. You maintain complete control over your data and networking while benefiting from Baseten’s autoscaling and performance optimizations. ### Baseten Hybrid The best of both worlds. Run core workloads in your VPC for maximum control and burst to Baseten Cloud on demand. This approach eliminates the trade-off between strict compliance and the need for elastic flex capacity. ## The Baseten advantage ML teams at Abridge, Writer, and Patreon use Baseten to serve millions of users. Our platform is built on four pillars that ensure your success in production: * **Model performance:** Our engineers apply the latest research in custom kernels and runtimes, delivering low latency and high throughput out of the box. * **Reliable infrastructure:** Deploy across clusters and clouds with active-active reliability and built-in redundancy. * **Operational control:** Use deep observability, secret management, and fine-grained autoscaling to maintain your SLAs. * **Compliance by design:** SOC 2 Type II, HIPAA, and GDPR compliance ensure that your deployments meet the highest standards for data security. ## Comparison of deployment options | Feature | Baseten Cloud | Self-hosted | Hybrid | | :----------------- | :--------------------- | :----------------- | :----------------------- | | **Scaling** | Unlimited, multi-cloud | Within your VPC | VPC with Cloud spillover | | **Data Residency** | Region-locked options | Full local control | Local with Cloud options | | **Compliance** | SOC 2, HIPAA, GDPR | Your compliance | Hybrid compliance | | **Time to Market** | Hours | Days | Days | Baseten gives you the visibility and control of your own infrastructure without the operational burden. Whether you're deploying a single LLM or an entire library of models, you can start with a managed solution and transition to self-hosted or hybrid modes as your requirements evolve. # Cold starts Source: https://docs.baseten.co/deployment/autoscaling/cold-starts Understand cold starts and how to minimize their impact on your deployments. A *cold start* is the time required to initialize a new replica when scaling up. Cold starts affect the latency of requests that trigger new replica creation. *** ## When cold starts happen Cold starts occur in two scenarios: 1. **Scale-from-zero**: When a deployment with zero active replicas receives its first request. 2. **Scaling events**: When traffic increases and the autoscaler adds new replicas. *** ## What contributes to cold start time Cold start duration depends on several factors: | Factor | Impact | | -------------- | ---------------------------------------------------------------------- | | Model loading | Loading model weights (10s–100s of GBs), typically the dominant factor | | Container pull | Downloading Docker image layers | | Initialization | Running your model's setup code | For large models, cold starts can take minutes. Model weight downloads are usually the bottleneck. Even with optimizations, the physics of moving hundreds of gigabytes of data creates inherent lag. *** ## Minimizing cold starts ### Keep replicas warm Set [`min_replica`](/reference/management-api/deployments/autoscaling/updates-a-deployments-autoscaling-settings) to always have at least one replica ready to serve requests. This eliminates cold starts for the first request but increases cost. ```json theme={"system"} { "min_replica": 1 } ``` For production redundancy, set `min_replica ≥ 2` so one replica can fail during maintenance without causing cold starts. ### Pre-warm before expected traffic For predictable traffic spikes, increase min replicas before the expected load: ```bash theme={"system"} # 10-15 minutes before expected spike curl -X PATCH \ https://api.baseten.co/v1/models/{model_id}/deployments/{deployment_id}/autoscaling_settings \ -H "Authorization: Api-Key $BASETEN_API_KEY" \ -d '{"min_replica": 5}' ``` After traffic stabilizes, reset to your normal minimum. ### Use longer scale-down delay A longer scale-down delay keeps replicas warm during temporary traffic dips: ```json theme={"system"} { "scale_down_delay": 900 } ``` This prevents cold starts when traffic returns within the delay window. *** ## Platform optimizations Baseten automatically applies several optimizations to reduce cold start times: **Baseten Delivery Network (Recommended)**: The [`weights`](/development/model/bdn) configuration optimizes cold starts by mirroring weights to Baseten's infrastructure and caching them close to your model pods. See [Baseten Delivery Network (BDN)](/development/model/bdn) for full configuration options. **Network accelerator (Legacy)**: Parallelized byte-range downloads speed up model loading from Hugging Face, S3, GCS, and R2. Network Acceleration is deprecated in favor of the new `weights` configuration, which provides superior cold start performance through multi-tier caching. See [Baseten Delivery Network (BDN)](/development/model/bdn) for the recommended approach. **Image streaming**: Optimized images stream into nodes, allowing model loading to begin before the full download completes: ``` Successfully pulled streaming-enabled image in 15.851s. Image size: 32 GB. ``` These optimizations are applied automatically. *** ## The tradeoff Cold starts create a fundamental tradeoff between **cost** and **latency**: | Approach | Cost | Latency | | -------------------------------- | ----------------------------- | ------------------------------------------ | | Scale to zero (`min_replica: 0`) | Lower: no cost when idle | Higher: first request waits for cold start | | Always on (`min_replica: ≥1`) | Higher: pay for idle replicas | Lower: no cold starts | For latency-sensitive production workloads, the cost of keeping replicas warm is usually justified. For batch workloads or development, scale-to-zero often makes sense. *** ## Next steps * [Request lifecycle](/deployment/autoscaling/request-lifecycle): What happens to requests during cold starts, including queuing and timeout behavior. * [Autoscaling](/deployment/autoscaling/overview): Configure min replicas and scale-down delay. * [Traffic patterns](/deployment/autoscaling/traffic-patterns): Pre-warming strategies for different traffic types. * [Troubleshooting](/troubleshooting/deployments#autoscaling-issues): Diagnose cold start issues. # Autoscaling Source: https://docs.baseten.co/deployment/autoscaling/overview Configure autoscaling to dynamically adjust replicas based on traffic while minimizing idle compute costs. Without autoscaling, you'd choose between two bad options: pay for enough GPUs to handle your peak traffic 24/7, or accept that requests fail when load exceeds your fixed capacity. Autoscaling eliminates this tradeoff by adjusting the number of **replicas** backing a deployment based on demand. When traffic rises, the autoscaler adds replicas. When it falls, it removes them. The goal is to match capacity to load so you pay for what you use without sacrificing latency. Baseten [bills per minute](/observability/usage) while a replica is deploying, scaling, or serving requests. A deployment scaled to zero replicas incurs no charges, but the wake-up period when a new request arrives is billable. For details on minimizing that startup cost, see [Cold starts](/deployment/autoscaling/cold-starts). Baseten provides default settings that work for most workloads. Tune your autoscaling settings based on your model and traffic. | Parameter | Default | Range | What it controls | | ------------------ | ------- | -------- | ---------------------------------------- | | Min replicas | 0 | ≥ 0 | Baseline capacity (0 = scale to zero). | | Max replicas | 1 | ≥ 1 | Cost/capacity ceiling. | | Autoscaling window | 60s | 10–3600s | Time window for traffic analysis. | | Scale-down delay | 900s | 0–3600s | Wait time before removing idle replicas. | | Concurrency target | 1 | ≥ 1 | Requests per replica before scaling. | | Target utilization | 70% | 1–100% | Headroom before scaling triggers. | You can configure autoscaling settings through the Baseten UI or API. 1. Select your deployment. 2. Under **Replicas** for your production environment, choose **Configure**. 3. Configure the autoscaling settings and choose **Update**. UI view to configure autoscaling ```bash theme={"system"} curl -X PATCH \ https://api.baseten.co/v1/models/{model_id}/deployments/{deployment_id}/autoscaling_settings \ -H "Authorization: Api-Key $BASETEN_API_KEY" \ -d '{ "min_replica": 2, "max_replica": 10, "concurrency_target": 32, "target_utilization_percentage": 70, "autoscaling_window": 60, "scale_down_delay": 900 }' ``` For more information, see the [API reference](/reference/management-api/deployments/autoscaling/updates-a-deployments-autoscaling-settings). ```python theme={"system"} import requests import os API_KEY = os.environ.get("BASETEN_API_KEY") response = requests.patch( "https://api.baseten.co/v1/models/{model_id}/deployments/{deployment_id}/autoscaling_settings", headers={"Authorization": f"Api-Key {API_KEY}"}, json={ "min_replica": 2, "max_replica": 10, "concurrency_target": 32, "target_utilization_percentage": 70, "autoscaling_window": 60, "scale_down_delay": 900 } ) print(response.json()) ``` For more information, see the [API reference](/reference/management-api/deployments/autoscaling/updates-a-deployments-autoscaling-settings). *** ## How autoscaling works The autoscaler monitors in-flight requests across all active replicas. Every **autoscaling window** (60 seconds by default), it compares the average load per replica against your **concurrency target** adjusted by **target utilization**. When that threshold is crossed, the autoscaler adds replicas until the concurrency target is met or the maximum replica count is reached. Consider a deployment with a concurrency target of 10 and target utilization of 70%. The autoscaler triggers at 7 concurrent requests per replica (10 x 0.70). If traffic jumps from 5 to 25 in-flight requests, the autoscaler calculates that 4 replicas are needed (ceiling of 25 / 7) and begins provisioning them. Scaling down is deliberately slower. When traffic drops, the autoscaler doesn't remove replicas immediately. Instead, it waits for the **scale-down delay** (15 minutes by default), then removes half the excess replicas, waits again, and removes half of what remains. This exponential back-off prevents oscillation: if traffic briefly dips and returns, your replicas are still warm. Scaling down stops at the minimum replica count. *** ## Replicas Each replica is an independent instance of your model, running on its own hardware and capable of serving requests in parallel with other replicas. The autoscaler controls how many replicas are active at any given time, but you set the boundaries. The floor for your deployment's capacity. The autoscaler won't scale below this number. **Range:** ≥ 0 The default of 0 enables *scale-to-zero*: when no requests arrive for long enough, all replicas shut down and your deployment incurs no charges. The tradeoff is that the next request triggers a [cold start](/deployment/autoscaling/cold-starts), which can take minutes for large models. During that wake-up period, [billing is per minute](/observability/usage) even though the replica isn't yet serving responses. For production deployments, set `min_replica` to at least 2. This eliminates cold starts and provides redundancy if one replica fails. The ceiling for your deployment's capacity. The autoscaler won't scale above this number. **Range:** ≥ 1 This setting protects against runaway scaling and unexpected costs. If traffic exceeds what your maximum replicas can handle, requests queue rather than triggering new replicas. See [Request lifecycle](/deployment/autoscaling/request-lifecycle) for details on queuing and load shedding behavior. The default of 1 effectively disables autoscaling: you get exactly one replica regardless of load. Estimate max replicas: $$ (peak\_requests\_per\_second / throughput\_per\_replica) + buffer $$ For high-volume workloads requiring guaranteed capacity, [contact Baseten](mailto:support@baseten.co) about reserved capacity options. *** ## Scaling triggers The autoscaler needs to know when your replicas are "full." Two settings define that threshold: **concurrency target** sets how many simultaneous requests each replica should handle, and **target utilization** adds headroom so the autoscaler acts before replicas are completely saturated. How many requests each replica can handle simultaneously. This directly determines replica count for a given load. **Range:** ≥ 1 The autoscaler calculates desired replicas: $$ ceiling(in\_flight\_requests / (concurrency\_target \times target\_utilization)) $$ *In-flight requests* are requests sent to your model that haven't returned a response (for streaming, until the stream completes). This count is exposed as [`baseten_concurrent_requests`](/observability/export-metrics/supported-metrics#baseten_concurrent_requests) in the metrics dashboard and metrics export. The right value depends on how your model uses hardware. Image generation models that consume all GPU memory per request can only process one at a time, so a concurrency target of 1 is correct. LLMs and embedding models batch requests internally and can handle dozens simultaneously, so higher targets (32 or more) reduce cost by packing more work onto each replica. **Tradeoff:** Higher concurrency = fewer replicas (lower cost) but more per-replica queueing (higher latency). Lower concurrency = more replicas (higher cost) but less queueing (lower latency). **Starting points by model type:** | Model type | Starting concurrency | | ----------------------- | -------------------- | | Standard Truss model | 1 | | vLLM / LLM inference | 32–128 | | SGLang | 32 | | Text embeddings (TEI) | 32 | | BEI embeddings | 96+ (min ≥ 8) | | Whisper (async batch) | 256 | | Image generation (SDXL) | 1 | For engine-specific guidance, see [Autoscaling engines](/engines/performance-concepts/autoscaling-engines). **Concurrency target** controls requests sent *to* a replica and triggers autoscaling. **predict\_concurrency** (Truss config.yaml) controls requests processed *inside* the container. Concurrency target should be less than or equal to predict\_concurrency. See the `predict_concurrency` field in the [Truss configuration reference](/reference/truss-configuration) for details. Headroom before scaling triggers. The autoscaler scales when utilization reaches this percentage of the concurrency target, not when replicas are fully loaded. **Range:** 1–100% The effective threshold is: $$ concurrency\_target × target\_utilization $$ With a concurrency target of 10 and utilization of 70%, scaling triggers at 7 concurrent requests (10 x 0.70), leaving 30% headroom for absorbing spikes while new replicas start. Lower values (50-60%) provide more headroom for spikes but cost more. Higher values (80%+) are cost-efficient for steady traffic but absorb spikes less effectively. Target utilization is **not** GPU utilization. It measures request slot usage relative to your concurrency target, not hardware utilization. *** ## Scaling dynamics Once the autoscaler decides to scale, two settings control the pace. The **autoscaling window** determines how far back the autoscaler looks when measuring traffic, and the **scale-down delay** determines how long it waits before removing idle replicas. Together, they let you tune the tradeoff between responsiveness and stability. How far back (in seconds) the autoscaler looks when measuring traffic. Traffic is averaged over this window to make scaling decisions. **Range:** 10–3600 seconds A 60-second window smooths out momentary spikes by averaging load over the past minute. Shorter windows (30-60s) react quickly to traffic changes, which suits bursty workloads. Longer windows (2-5 min) ignore short-lived fluctuations and prevent the autoscaler from chasing noise. How long (in seconds) the autoscaler waits after load drops before removing replicas. **Range:** 0–3600 seconds When load drops, the autoscaler starts a countdown. If load stays low for the full delay, it removes replicas using exponential back-off (half the excess, wait, half again). If traffic returns before the countdown finishes, the replicas stay active and the countdown resets. This is your primary lever for preventing *oscillation*. If replicas repeatedly scale up and down, increase this value first. A **short window** with a **long delay** gives you fast scale-up while maintaining capacity during temporary dips. This is a good starting configuration for most workloads. *** ## Development deployments Development deployments are designed for iteration, not production traffic. Replicas are fixed at 0-1 to match the [`truss watch`](/reference/cli/truss/watch) workflow, where you're testing changes on a single instance rather than handling concurrent users. You can still adjust timing and concurrency settings. | Setting | Value | Modifiable | | ------------------ | ----------- | ---------- | | Min replicas | 0 | No | | Max replicas | 1 | No | | Autoscaling window | 60 seconds | Yes | | Scale-down delay | 900 seconds | Yes | | Concurrency target | 1 | Yes | | Target utilization | 70% | Yes | To enable full autoscaling with configurable replica settings, [promote the deployment to production](/deployment/deployments). *** ## Next steps Identify your traffic pattern and get recommended starting settings. Understand cold starts and how to minimize their impact. Complete autoscaling API documentation. Recommended settings for BEI and Engine-Builder-LLM with dynamic batching. *** ## Troubleshooting Having issues with autoscaling? See [Autoscaling troubleshooting](/troubleshooting/deployments#autoscaling-issues) for solutions to common problems like oscillation, slow scale-up, and unexpected costs. # Request lifecycle Source: https://docs.baseten.co/deployment/autoscaling/request-lifecycle What happens to a request from submission to response, including routing, queuing, timeouts, and error handling. When you send an inference request, it doesn't go straight to model code. Whether you use [Model APIs](/inference/model-apis/overview), an OpenAI-compatible endpoint for a deployment you manage, or the [predict API](/inference/calling-your-model), the request passes through authentication, routing, and replica selection first. For Truss deployments with custom model code, your `predict` function runs only after those steps. These layers exist so that Baseten can manage replicas on your behalf: scaling them up when traffic spikes, scaling them down when it drops, and distributing requests across them without any load-balancing code on your side. Understanding what each layer does helps you reason about latency, interpret status codes, and debug production issues. ## How a request reaches your model Your request first hits Baseten's inference gateway, which authenticates it against your [API key](/organization/api-keys). If authentication fails, the gateway returns a `401 Unauthorized` before the request reaches any model infrastructure. Once authenticated, the request moves to the routing layer, which decides which replica should handle it. Baseten routes requests to the least-utilized replica based on how full each one is relative to its [concurrency target](/deployment/autoscaling/overview#concurrency-target). Rather than spreading requests evenly across all replicas, the router prefers replicas that already have headroom, which keeps the total number of active replicas low. This matters because you're [billed per minute](/observability/usage) for each running replica. When the router finds a replica with available capacity, it forwards the request. The replica runs inference. For deployments that use the predict API, your `predict` function executes here. The response flows back through the same path to the client. For most requests, the routing overhead is negligible compared to your model's inference time. The sections below cover what happens when this straightforward path breaks down: when no replica is available, when replicas are overloaded, and when requests fail partway through. ## What happens when no replica is available If your deployment has scaled to zero, or all existing replicas are at capacity and the autoscaler is still bringing up new ones, incoming requests have nowhere to go. Rather than rejecting them immediately, Baseten parks the request at the routing layer and waits for a replica to become available. Once one is ready, the parked request is forwarded and processed normally. From the client's perspective, the response simply takes longer: the wait time is added on top of the normal inference time. This parking behavior is what makes [scale-to-zero](/deployment/autoscaling/overview#min_replica) practical. You don't need to build retry logic into your client just because your deployment was idle; the request waits for you. But the wait isn't indefinite. If the server-side timeout (currently 600 seconds) expires before a replica becomes available, the parked request receives a `429`. For large models that take several minutes to load weights, you may want to keep [minimum replicas](/deployment/autoscaling/overview#min_replica) above zero so requests always have somewhere to go. [Async requests](/inference/async) follow a different pattern. The first async request parks and waits, just like a sync request. But subsequent async requests that arrive while there's still no capacity receive an immediate `429` with a `CAPACITY_EXCEEDED` error instead of the `202 Accepted` they'd normally get. This prevents a situation where your client thinks a request was accepted and starts polling for results, when it's actually still waiting for a replica to start. For strategies to reduce cold start latency, including warm replicas, pre-warming, and the Baseten Delivery Network, see [Cold starts](/deployment/autoscaling/cold-starts). ## Request queuing and load shedding Even when replicas are running, they can fill up. When all replicas are at their [concurrency target](/deployment/autoscaling/overview#concurrency-target) and the autoscaler hasn't yet finished adding new ones, incoming requests queue at the routing layer. This queuing is automatic: you don't configure it and your client doesn't see it. The request simply waits until a slot opens up on a replica. Baseten has a **load shedding** safety valve that rejects new requests with a `429` if queued payloads exceed a memory threshold, but this threshold is high enough that it rarely triggers under normal conditions. The more likely issue you'll encounter is requests waiting a long time during traffic spikes, not requests being rejected. Because your client has no visibility into the queue, a request that's waiting for capacity looks the same as a request that's taking a long time to run inference. If you don't want requests to hang indefinitely in this situation, set a client-side timeout so your application can fail fast and either retry or surface an error to the user. To reduce queuing overall, increase your [max replicas](/deployment/autoscaling/overview#max_replica) so the autoscaler can add capacity faster. Adjusting your [concurrency target](/deployment/autoscaling/overview#concurrency-target) also helps, since a higher target means each replica absorbs more requests before the queue starts filling. ## Internal retries When a request reaches a replica but the replica returns a `502`, `503`, or `504`, the routing layer doesn't surface the error to your client immediately. Instead, it retries the request automatically using exponential backoff, starting at 100 milliseconds and doubling up to 30 seconds between attempts. For status code errors like these, retries continue until the request's context timeout expires or 15 minutes of total elapsed time, whichever comes first. Connection-level failures, where the replica is completely unreachable, are capped at 16 attempts instead. [Async requests](/inference/async) are not retried. From your client's perspective, retries show up as added latency rather than errors. A request that would have failed on the first attempt may succeed on the second or third, but take noticeably longer than usual. If you're investigating occasional latency spikes where requests take much longer than expected but eventually succeed, you can check the `X-BASETEN-MODEL-PREDICTION-ATTEMPTS` response header: a value greater than 1 confirms that at least one retry happened. Under memory pressure (above 80% utilization on the routing layer), a circuit breaker disables retries entirely to protect stability, resuming them after a 30-second cooldown once memory drops. If a request was pinned to a specific replica via sticky session and that replica returns a `503`, the retry routes to a different replica rather than trying the same one again. ## Timeouts The **predict timeout** controls how long a sync request can take from the moment it's forwarded to a replica until a response must be returned. If your model's inference exceeds this window, the request is cancelled and the client receives a `504`. The server-side default is 600 seconds (10 minutes), and it isn't currently user-configurable. If you need requests to fail faster than that, set a client-side timeout in your HTTP client. The **async predict timeout** works the same way for [async requests](/inference/async), except that instead of returning a `504` to the caller, the request is marked as failed with a `MODEL_PREDICT_TIMEOUT` error status and your webhook receives the error payload. The **parking timeout**, which governs how long a request waits in the queue when no replica is available, is set equal to the predict timeout. The logic behind this is that if a request wouldn't have time to complete inference even if a replica appeared right now, there's no benefit to holding it in the queue any longer. One practical consequence is that the predict timeout also determines how long your deployment can take to cold-start before parked requests begin failing. For **streaming responses**, timeouts behave differently because the HTTP headers, including the `200` status code, are sent when the stream begins. If the timeout expires mid-stream, the stream stops and the connection closes without an error code, since the status was already written. Most HTTP clients surface this as a connection reset or incomplete response rather than a timeout error. ## HTTP status codes The inference API returns a specific set of status codes, and the sections above explain the conditions that produce each one. This table is a reference for quick lookup. | Code | Meaning | When it occurs | What to do | | ----- | --------------------- | --------------------------------------------------------------------------------------------------------------------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | | `200` | Success | Normal predict response. | None. | | `202` | Accepted | Async predict request queued successfully. | Poll for results or wait for your [webhook](/inference/async). | | `401` | Unauthorized | Invalid or missing API key. | Check your [API key](/organization/api-keys). | | `429` | Too Many Requests | Load shedding triggered, no capacity available, or parking timeout expired during a cold start. | Retry with exponential backoff. If persistent, increase [max replicas](/deployment/autoscaling/overview#max_replica) or [concurrency target](/deployment/autoscaling/overview#concurrency-target). | | `499` | Client Closed Request | Client disconnected before the response was written. | No server-side action needed. Review client-side timeout configuration if unexpected. | | `502` | Bad Gateway | The request context was cancelled, or the model became unavailable during inference. | Retry. If persistent, check model logs for crashes or errors in your `predict` function. | | `503` | Service Unavailable | The routing layer couldn't find a replica endpoint, typically during a deployment rollout or immediately after a replica failure. | Retry. If persistent, check deployment status in the Baseten dashboard. | | `504` | Gateway Timeout | The request exceeded the server-side predict timeout (600 seconds). | Optimize your model's inference speed. If you're seeing this consistently, contact support about adjusting the timeout. | A `429` during a cold start doesn't mean the deployment is permanently overloaded. It means the parking timeout expired before a replica finished starting. Retrying after a brief wait (30 seconds to a minute) often succeeds once the replica is ready. ## Request cancellation When a client disconnects before the response is written, the routing layer detects the closed connection and cancels the in-flight work. The server logs this as a `499`. In the common case, such as a user closing a browser tab or a client-side timeout firing, this is harmless and the `499` is informational rather than an error. The more important question is whether cancellation propagates all the way to the GPU. If a client disconnects during a long generation and the model keeps running, you're paying for GPU time that produces tokens nobody will read. Baseten cancels in-flight work automatically so this doesn't happen. When the routing layer detects a disconnect, it signals the inference engine, which aborts the running request and frees GPU resources. This works across engines including TRT-LLM and vLLM. If you're using a custom model server, you can implement cancellation yourself using Truss request objects. See [Request handling](/development/model/requests) for code examples. ## Next steps Reduce cold start latency with warm replicas and pre-warming strategies. Configure concurrency targets, replica counts, and scaling dynamics. Fire-and-forget inference with webhook delivery. Diagnose common deployment issues including autoscaling problems. # Traffic patterns Source: https://docs.baseten.co/deployment/autoscaling/traffic-patterns Identify your traffic pattern and configure autoscaling settings to match. Different traffic patterns require different autoscaling configurations. Identify your pattern below for recommended starting settings. These are **starting points**, not final answers. Monitor your deployment's performance and adjust based on observed behavior. See [Autoscaling](/deployment/autoscaling/overview) for parameter details. *** ## Jittery traffic Small, frequent spikes that quickly return to baseline. ### Characteristics * Baseline replica count is steady, but **spikes up by 2x several times per hour**. * Spikes are short-lived and return to baseline quickly. * Often not real load growth, just temporary surges causing overreaction. ### Common causes * Consumer products with intermittent usage bursts. * Traffic splitting or A/B testing with low percentages. * Polling clients with synchronized intervals. ### Recommended settings | Parameter | Value | Why | | ------------------ | ----------------- | ----------------------------------------------- | | Autoscaling window | **2-5 minutes** | Smooth out noise, avoid reacting to every spike | | Scale-down delay | **300-600s** | Moderate stability | | Target utilization | **70%** | Default is fine | | Concurrency target | Benchmarked value | Start conservative | A longer autoscaling window averages out the jitter so the autoscaler doesn't chase every small spike. You're trading reaction speed for stability, which is acceptable when the spikes aren't sustained load increases. If you're still seeing oscillation with these settings, increase the scale-down delay before lowering target utilization. *** ## Bursty traffic ### Characteristics * Traffic **jumps sharply** (2x+ within 60 seconds). * Stays high for a sustained period before dropping. * The "pain" is queueing and latency spikes while new replicas start. ### Common causes * Daily morning ramp-up (users starting their day). * Marketing events, product launches, viral moments. * Top-of-hour scheduled jobs or cron-triggered traffic. ### Recommended settings | Parameter | Value | Why | | ------------------ | ---------- | --------------------------------------------- | | Autoscaling window | **30-60s** | React quickly to genuine load increases | | Scale-down delay | **900s+** | Handle back-to-back waves without thrashing | | Target utilization | **50-60%** | More headroom absorbs the burst while scaling | | Min replicas | **≥2** | Redundancy + reduces cold start impact | Short window means fast reaction. Long delay prevents scaling down between waves. Lower utilization gives you buffer capacity while new replicas start. ### Pre-warming for predictable bursts If your bursts are predictable (morning ramp, scheduled events), pre-warm by bumping min replicas before the expected spike: ```bash theme={"system"} curl -X PATCH \ https://api.baseten.co/v1/models/{model_id}/deployments/{deployment_id}/autoscaling_settings \ -H "Authorization: Api-Key $BASETEN_API_KEY" \ -d '{"min_replica": 5}' ``` After the burst subsides, reset to your normal minimum: ```bash theme={"system"} curl -X PATCH \ https://api.baseten.co/v1/models/{model_id}/deployments/{deployment_id}/autoscaling_settings \ -H "Authorization: Api-Key $BASETEN_API_KEY" \ -d '{"min_replica": 2}' ``` Automate pre-warming with cron jobs or your orchestration system. Bumping min replicas 10-15 minutes before known peaks avoids cold starts for the first requests after the spike. *** ## Scheduled traffic ### Characteristics * **Long periods of low or zero traffic**. * Large bursts tied to job schedules (hourly, daily, weekly). * Traffic patterns are predictable but infrequent. ### Common causes * ETL pipelines and data processing jobs. * Embedding backfills and batch inference. * Periodic evaluation or testing jobs. * Document processing triggered by user uploads. ### Recommended settings | Parameter | Value | Why | | ------------------ | --------------------------------------------------------------- | ----------------------------------------- | | Min replicas | **0** (if cold starts acceptable) or **1** (during job windows) | Cost savings when idle | | Scale-down delay | **Moderate to high** | Jobs often come in waves | | Autoscaling window | **60-120s** | Don't overreact to the first few requests | | Target utilization | **70%** | Default is fine | Scale-to-zero saves significant cost during idle periods. The moderate window prevents overreacting to the initial requests of a batch. If jobs come in waves, a longer delay keeps replicas warm between them. ### Scheduled pre-warming For predictable batch jobs, use cron + API to pre-warm. 5 minutes before the hourly job, scale up: ```bash theme={"system"} 0 * * * * curl -X PATCH \ https://api.baseten.co/v1/models/{model_id}/deployments/{deployment_id}/autoscaling_settings \ -H "Authorization: Api-Key $BASETEN_API_KEY" \ -d '{"min_replica": 3}' ``` 30 minutes after the job completes, scale back down: ```bash theme={"system"} 30 * * * * curl -X PATCH \ https://api.baseten.co/v1/models/{model_id}/deployments/{deployment_id}/autoscaling_settings \ -H "Authorization: Api-Key $BASETEN_API_KEY" \ -d '{"min_replica": 0}' ``` If you use scale-to-zero, the first request of each batch will experience a [cold start](/deployment/autoscaling/cold-starts). For latency-sensitive batch jobs, keep min replicas at 1 during expected job windows. *** ## Steady traffic ### Characteristics * Traffic **rises and falls gradually** over the day. * Classic diurnal pattern with no sharp edges. * Predictable, cyclical behavior. ### Common causes * Always-on inference APIs with consistent user base. * B2B applications with business-hours usage. * Production workloads with stable, mature traffic. ### Recommended settings | Parameter | Value | Why | | ------------------ | ------------ | ------------------------------ | | Target utilization | **70-80%** | Can run replicas hotter safely | | Autoscaling window | **60-120s** | Moderate reaction speed | | Scale-down delay | **300-600s** | Moderate | | Min replicas | **≥2** | Redundancy for production | Without sudden spikes, you don't need as much headroom. You can run replicas at higher utilization (lower cost) because load changes are gradual and predictable. The autoscaler has time to react. Smooth traffic is the easiest to tune. Start with defaults, monitor for a week, then optimize for cost by gradually raising target utilization while watching p95 latency. *** ## Identifying your pattern Not sure which pattern you have? Check your metrics: 1. Go to your model's **Metrics** tab in the Baseten dashboard 2. Look at **Inference volume** and **Replicas** over the past week 3. Compare to the patterns above | You see... | Your pattern is... | | ----------------------------------------------------- | ------------------ | | Frequent small spikes that quickly return to baseline | Jittery | | Sharp jumps that stay high for a while | Bursty | | Long flat periods with occasional large bursts | Scheduled | | Gradual rises and falls, smooth curves | Steady | Some workloads are a mix of patterns. If your traffic has both smooth diurnal patterns AND occasional bursts, optimize for the bursts (they cause the most pain) and accept slightly higher cost during steady periods. *** ## Next steps * [Autoscaling](/deployment/autoscaling/overview): Full parameter documentation. * [Troubleshooting autoscaling](/troubleshooting/deployments#autoscaling-issues): Diagnose and fix common problems. * [Truss configuration reference](/reference/truss-configuration): Configure predict\_concurrency in your model. # CI/CD Source: https://docs.baseten.co/deployment/ci-cd Automate Truss deployments with GitHub Actions. Manual `truss push` works when one person deploys one model. When your model code lives in a shared repository with multiple contributors, deploys drift out of sync: someone pushes from a stale branch, a config change skips review, a broken model reaches production because nobody ran a predict check first. The [Truss Push GitHub Action](https://github.com/marketplace/actions/truss-push) ties deployment to your Git workflow. Every push or pull request can trigger a deploy, validate the model with a predict request, and clean up automatically. The action supports both Truss models and [chains](/development/chain/deploy). ## What happens during a run The action runs through four phases, each in a collapsible log group in the GitHub Actions UI: 1. **Load config**: For models, reads `config.yaml` from the Truss directory and extracts `model_metadata.example_model_input` for the predict step (unless you override it with `predict-payload`). For chains, detects the entrypoint class from the `.py` file. 2. **Deploy**: Pushes the model or chain to Baseten and streams deployment logs directly into the GitHub Actions output. You don't need to open the Baseten dashboard to watch the build. The action names each deployment from git context: `PR-42_abc1234` for pull requests, `abc1234` for direct pushes (customizable with `deployment-name`). 3. **Predict**: Sends a predict request and reports latency. For streaming models (when the payload includes `"stream": true`), reports time-to-first-byte, token count, and tokens per second. 4. **Cleanup**: Deactivates the newly created deployment if `cleanup: true`. Set `cleanup: false` when deploying to an environment or when you want to inspect the deployment manually. After every run, the action writes a summary table to the GitHub Actions job summary with deploy time, predict metrics, and a direct link to the deployment logs on Baseten. ## Prerequisites Store your Baseten API key as an [encrypted secret](https://docs.github.com/en/actions/security-guides/encrypted-secrets) named `BASETEN_API_KEY` in your repository or organization settings. See [API keys](/organization/api-keys) for how to generate one. ## Deploy to an environment on merge Deploy a validated model to a specific environment every time code merges to `main`. Create `.github/workflows/deploy.yml` and add the following: ```yaml theme={"system"} name: Deploy to production on: push: branches: [main] jobs: deploy: runs-on: ubuntu-latest steps: - uses: actions/checkout@v4 - uses: basetenlabs/action-truss-push@v0.1 with: truss-directory: "./my-model" baseten-api-key: ${{ secrets.BASETEN_API_KEY }} environment: "production" cleanup: false ``` Setting `environment` publishes the deployment to the specified environment. Setting `cleanup: false` keeps the deployment active so it can serve traffic. ## Validate on pull request Catch model regressions before they reach production. The action deploys, runs a predict request, and tears down the deployment inside the PR check. Create `.github/workflows/validate-model.yml` and add the following: ```yaml theme={"system"} name: Validate model on: pull_request: branches: [main] jobs: validate: runs-on: ubuntu-latest steps: - uses: actions/checkout@v4 - uses: basetenlabs/action-truss-push@v0.1 with: truss-directory: "./my-model" baseten-api-key: ${{ secrets.BASETEN_API_KEY }} ``` The action reads `model_metadata.example_model_input` from your `config.yaml` to build the predict request. With the default (`cleanup: true`), the deployment is deactivated after validation, so no resources are left running. ## Deploy a chain Deploy a Baseten chain from a Python source file. The action auto-detects chains when `truss-directory` points to a `.py` file. ```yaml theme={"system"} - uses: basetenlabs/action-truss-push@v0.1 with: truss-directory: "./chains/my_chain.py" baseten-api-key: ${{ secrets.BASETEN_API_KEY }} model-name: "my-rag-chain" cleanup: false predict-payload: '{"query": "What is Baseten?"}' ``` For chains, the predict payload must be provided explicitly with `predict-payload` because there's no `config.yaml` to read example input from. ## Deploy multiple models Use a matrix strategy to deploy each model in your repository as a separate job. Create `.github/workflows/deploy-all.yml` and add the following: ```yaml theme={"system"} name: Deploy models on: push: branches: [main] jobs: deploy: runs-on: ubuntu-latest strategy: matrix: model: - path: models/text-classifier - path: models/image-generator - path: models/embeddings steps: - uses: actions/checkout@v4 - uses: basetenlabs/action-truss-push@v0.1 with: truss-directory: ${{ matrix.model.path }} baseten-api-key: ${{ secrets.BASETEN_API_KEY }} environment: "production" cleanup: false ``` Each matrix entry runs as a separate job. If one model fails, the others still deploy. ## Custom predict validation Override the default predict payload when your model needs a specific input shape that differs from `model_metadata.example_model_input`. ```yaml theme={"system"} - uses: basetenlabs/action-truss-push@v0.1 with: truss-directory: "./my-model" baseten-api-key: ${{ secrets.BASETEN_API_KEY }} predict-payload: '{"prompt": "Hello, world!", "max_new_tokens": 128}' predict-timeout: 60 ``` If neither `predict-payload` nor `model_metadata.example_model_input` is set, the action skips the predict step entirely and the deployment isn't validated. ## Deploy with labels Attach metadata labels to track deployments in your CI pipeline. ```yaml theme={"system"} - uses: basetenlabs/action-truss-push@v0.1 with: truss-directory: "./my-model" baseten-api-key: ${{ secrets.BASETEN_API_KEY }} labels: '{"team": "ml-platform", "triggered-by": "ci"}' ``` ## Override model name Set a custom model name instead of using the name from `config.yaml`. ```yaml theme={"system"} - uses: basetenlabs/action-truss-push@v0.1 with: truss-directory: "./my-model" baseten-api-key: ${{ secrets.BASETEN_API_KEY }} model-name: "my-custom-name" ``` ## Use action outputs The action exposes outputs you can reference in downstream steps. This example posts the deploy time as a PR comment. ```yaml theme={"system"} steps: - uses: actions/checkout@v4 - uses: basetenlabs/action-truss-push@v0.1 id: deploy with: truss-directory: "./my-model" baseten-api-key: ${{ secrets.BASETEN_API_KEY }} - name: Comment on PR if: github.event_name == 'pull_request' uses: actions/github-script@v7 with: script: | github.rest.issues.createComment({ owner: context.repo.owner, repo: context.repo.repo, issue_number: context.issue.number, body: `Model deployed in ${{ steps.deploy.outputs.deploy-time-seconds }}s. Status: ${{ steps.deploy.outputs.status }}` }) ``` See the full list of inputs and outputs in the [Truss Push GitHub Action reference](/reference/ci/github-action). ## Troubleshooting **`deploy_timeout`:** The default timeout is 45 minutes, which accommodates large builds like TRT-LLM. For smaller models, reduce `deploy-timeout-minutes` to fail faster. If your model legitimately needs more time, increase the value. **`deploy_failed`:** Check your `config.yaml` for syntax errors and verify the `BASETEN_API_KEY` secret is set correctly. The action logs the full build output in collapsible sections. Expand them in the GitHub Actions UI to see the exact error. **`predict_failed`:** Verify the predict payload shape matches what your model expects. Check `model_metadata.example_model_input` in `config.yaml`, or override it with `predict-payload`. For chains, the predict payload must be provided explicitly. **`cleanup_failed`:** The deployment may still be running. Deactivate it manually from the [Baseten dashboard](https://app.baseten.co). **`429 Too Many Requests`:** The action calls management API endpoints that are rate limited per API key. Matrix jobs that fan out across many models can exceed the per-endpoint limits. See [management API rate limits](/reference/management-api/rate-limits) for thresholds and backoff guidance. **No predict output:** If neither `predict-payload` nor `model_metadata.example_model_input` is configured, the action skips prediction entirely. The deployment runs but isn't validated. Add an example input to your `config.yaml` (models) or set `predict-payload` (chains) to enable validation. # Concepts Source: https://docs.baseten.co/deployment/concepts Deployments, environments, resources, autoscaling, and CI/CD on Baseten. When you run `truss push`, Baseten creates a [deployment](/deployment/deployments): a running instance of your model on GPU infrastructure with an API endpoint. This page explains how deployments are managed, versioned, and scaled. ## Deployments A [deployment](/deployment/deployments) is a single version of your model running on specific hardware. Every `truss push` creates a new deployment. You can have multiple deployments of the same model running simultaneously, which is how you test new versions without affecting production traffic. Deployments can be deactivated to stop serving (and stop incurring cost) or deleted permanently when they're no longer needed. For rapid iteration, use `truss push --watch` to create a **development deployment** — a mutable instance that live-reloads as you edit your model code. Development deployments can't be promoted to an environment. Baseten deployments dashboard showing multiple model versions ## Environments As your model matures, you'll want a way to manage releases. [Environments](/deployment/environments) provide stable endpoints that persist across deployments. A typical setup has a development environment for testing and a production environment for live traffic. Each environment maintains its own autoscaling settings, metrics, and endpoint URL. When a new deployment is ready, you promote it to an environment, and traffic shifts to the new version without changing the endpoint your application calls. Deployment environments with development and production endpoints ## Resources Every deployment runs on a specific [instance type](/deployment/resources) that defines its GPU, CPU, and memory allocation. Choosing the right instance balances inference speed against cost. You'll set the instance type in your `config.yaml` before deployment, or adjust it later through the Baseten UI. Smaller models run well on an L4 (24 GB VRAM), while large LLMs may need A100s or H100s with tensor parallelism across multiple GPUs. Resource configuration showing GPU instance type selection ## Autoscaling You don't manage replicas manually. [Autoscaling](/deployment/autoscaling/overview) adjusts the number of running instances based on incoming traffic. You'll configure a minimum and maximum replica count, a concurrency target, and a scale-down delay. When traffic drops, replicas scale down (optionally to zero, eliminating all cost). When traffic spikes, new replicas spin up automatically. [Cold start optimization](/deployment/autoscaling/cold-starts) and network acceleration keep response times fast even when scaling from zero. Autoscaling configuration with replica count and concurrency settings For the mechanics of how the autoscaler tracks in-flight requests and adjusts replicas, see [How Baseten works](/concepts/howbasetenworks#autoscaling). For engine-specific autoscaling settings (BEI and Engine-Builder-LLM), see [Autoscaling engines](/engines/performance-concepts/autoscaling-engines). ## Request lifecycle When a request reaches your deployment, it passes through authentication, routing, and replica selection before your model code executes. Understanding this path helps you diagnose errors and configure timeouts. See [Request lifecycle](/deployment/autoscaling/request-lifecycle) for the full journey of a request, including queuing, load shedding, and HTTP status codes. ## CI/CD When your model code lives in a Git repository, you can automate deployments with CI/CD. The [Truss Push GitHub Action](/deployment/ci-cd) deploys your model, validates it with a predict request, and optionally promotes it to production. You'll configure the trigger (such as pushes or pull requests to specific branches) in your GitHub Actions workflow file. # Deployments Source: https://docs.baseten.co/deployment/deployments Deploy, manage, and scale machine learning models with Baseten A **deployment** in Baseten is a **containerized instance of a model** that serves inference requests via an API endpoint. Deployments exist independently but can be **promoted to an environment** for structured access and scaling. Baseten **automatically wraps every deployment in a REST API**. Once deployed, models can be queried with a simple HTTP request: ```python theme={"system"} import requests resp = requests.post( "https://model-{modelID}.api.baseten.co/deployment/[{deploymentID}]/predict", headers={"Authorization": "Api-Key YOUR_API_KEY"}, json={'text': 'Hello my name is {MASK}'}, ) print(resp.json()) ``` [Learn more about running inference on your deployment](/inference/calling-your-model) *** # Development deployment A **development deployment** is a mutable instance designed for rapid iteration. Create one with `truss push --watch` (for models) or `truss chains push --watch` (for Chains). It is always in the **development state** and cannot be renamed or detached from it. Key characteristics: * **Live reload** enables direct updates without redeployment. * **Single replica, scales to zero** when idle to conserve compute resources. * **No autoscaling or zero-downtime updates.** * **Can be promoted** to create a persistent deployment. Once promoted, the development deployment transitions to a **deployment** and can optionally be promoted to an environment. *** # Environments and promotion Environments provide **logical isolation** for managing deployments but are **not required** for a deployment to function. You can execute a deployment independently or promoted to an environment for controlled traffic allocation and scaling. * The **production environment** exists by default. * **Custom environments** (for example, staging) can be created for specific workflows. * **Promoting a deployment doesn't modify its behavior**, only its routing and lifecycle management. ## Rolling deployments Rolling deployments replace replicas incrementally when promoting a deployment to an environment. Instead of swapping all traffic at once, rolling deployments scale up the candidate, shift traffic proportionally, and scale down the previous deployment in controlled steps. You can pause, resume, cancel, or force-complete a rolling deployment at any point. See [Rolling deployments](/deployment/rolling-deployments) for configuration, control actions, and status reference. ## Canary deployments (deprecated) Canary deployments are deprecated. Use [rolling deployments](/deployment/rolling-deployments) for incremental traffic shifting with finer control over replica provisioning and rollback. Canary deployments support incremental traffic shifting to a new deployment in 10 evenly distributed stages over a configurable time window. Enable or cancel canary rollouts via the UI or [REST API](/reference/management-api/environments/update-an-environments-settings). *** # Managing deployments ## Naming deployments By default, deployments of a model are named `deployment-1`, `deployment-2`, and so forth sequentially. You can instead give deployments custom names via two methods: 1. While creating the deployment, using a [command line argument in truss push](/reference/sdk/truss#deploying-a-model). 2. After creating the deployment, in the model management page within your Baseten dashboard. Renaming deployments is purely aesthetic and does not affect model management API paths, which work via model and deployment IDs. ## Deactivating a deployment Deactivate a deployment to suspend inference execution while preserving configuration. * **Remains visible in the dashboard.** * **Consumes no compute resources** but can be reactivated anytime. * **API requests return a 404 error while deactivated.** For demand-driven deployments, consider [autoscaling with scale to zero](/reference/management-api/deployments/autoscaling/updates-a-deployments-autoscaling-settings). ## Deleting deployments You can **permanently delete** deployments, but production deployments must be replaced before deletion. * **Deleted deployments are purged from the dashboard** but retained in usage logs. * **All associated compute resources are released.** * **API requests return a 404 error post-deletion.** Deletion is irreversible. Use deactivation if retention is required. # Environments Source: https://docs.baseten.co/deployment/environments Manage your model's release cycles with environments. Environments provide structured management for deployments, ensuring controlled rollouts, stable endpoints, and autoscaling. They help teams stage, test, and release models without affecting production traffic. Deployments can be promoted to an environment (for example, "staging") to validate outputs before moving to production, allowing for safer model iteration and evaluation. *** ## Deployment management Environments support structured validation before promoting a deployment, including: * Automated tests and evaluations. * Manual testing in pre-production. * Gradual traffic shifts with canary deployments. * Shadow serving for real-world analysis. Promoting a deployment ensures it inherits environment-specific scaling and monitoring settings: * Dedicated API endpoint: See the [Predict API reference](/reference/inference-api/overview#predict-endpoints). * Autoscaling controls: Scale behavior is managed per environment. * Traffic ramp-up: Supports [canary rollouts](/deployment/deployments#canary-deployments) and [rolling deployments](/deployment/rolling-deployments). * Monitoring and metrics: [Export environment metrics](/observability/export-metrics/overview). The production environment operates like any other environment but has restrictions: * It can't be deleted unless the entire model is removed. * You can't create additional environments named "production." *** ## Custom environments In addition to the standard production environment, you can create as many custom environments as needed: 1. In the model management page on the Baseten dashboard. 2. Via the [create environment endpoint](/reference/management-api/environments/create-an-environment) in the management API. *** ## Deployment promotion When you promote a deployment to an environment, Baseten associates the deployment with that environment and applies the environment's autoscaling settings. If the deployment can be reused directly, promotion completes without creating new resources. Otherwise, Baseten creates a new deployment with a unique ID, initializes its resources, and replaces the existing deployment in that environment. A new deployment is created when: * The deployment is already associated with another environment. * The environment has a different instance type or resource profile. * [Re-deploy on promotion](#re-deploy-on-promotion) is enabled. If a previous deployment existed in the environment, the new one inherits its autoscaling settings and the old deployment is demoted. ### Published deployment promotion If a published deployment (not a development deployment) is promoted, its autoscaling settings are updated to match the environment. Previous deployments are demoted but remain in the system. *** ## Direct deployment to an environment You can deploy directly to a named environment by specifying `--environment` in `truss push`: ```sh theme={"system"} cd my_model/ truss push --environment {environment_name} ``` Only one active promotion per environment is allowed at a time. *** ## Environment access in code The environment name is available in `model.py` via the `environment` keyword argument: ```python theme={"system"} def __init__(self, **kwargs): self._environment = kwargs["environment"] ``` You can use the environment in your `load()` method to configure per-environment behavior: ```python theme={"system"} def load(self): if self._environment.get("name") == "production": self.setup_sentry() self.model = self.load_production_weights() else: self.model = self.load_default_weights() ``` If you use environment-specific configuration in `load()`, you'll need to enable re-deploy on promotion to ensure the environment is correctly initialized after each promotion. See [Re-deploy on promotion](#re-deploy-on-promotion) for details. *** ## Re-deploy on promotion By default, promoting a deployment reuses the existing deployment when possible. This is the fastest promotion path, but it means `load()` doesn't re-run. Any environment-specific configuration set during the original `load()` call persists, even if the deployment moves to a different environment. You can configure an environment to create a fresh deployment every time you promote to it. The new deployment runs `load()` with the target environment's context, so environment-specific configuration takes effect. Enable this if your `load()` method uses `kwargs["environment"]` to configure per-environment behavior, or if you promote the same source deployment to multiple environments and want each to get a fresh deployment. Toggle **Re-deploy when promoting** in the environment settings on your model's page in the Baseten dashboard, or set it via the [update environment settings endpoint](/reference/management-api/environments/update-an-environments-settings). If you promote a deployment that's already associated with an environment, Baseten creates a new deployment regardless of this setting. *** ## Regional environments Regional environments restrict inference traffic to a specific geographic region for data residency compliance. When your organization enables regional environments, each environment gets a dedicated regional endpoint that routes directly to infrastructure in the designated region. Your Baseten account team configures regional environments at the organization level. Contact them to enable regional environments. ### Regional endpoint format Regional endpoints embed the environment name in the hostname instead of the URL path: Call a model's regional endpoint with `/predict` or `/async_predict`. ``` https://model-{model_id}-{env_name}.api.baseten.co/predict ``` For example, a model with ID `abc123` in the `prod-us` environment: ``` https://model-abc123-prod-us.api.baseten.co/predict ``` Call a chain's regional endpoint with `/run_remote` or `/async_run_remote`. ``` https://chain-{chain_id}-{env_name}.api.baseten.co/run_remote ``` Connect to a regional WebSocket endpoint for models or chains. ``` wss://model-{model_id}-{env_name}.api.baseten.co/websocket wss://chain-{chain_id}-{env_name}.api.baseten.co/websocket ``` Connect to a regional gRPC endpoint using the `grpc.api.baseten.co` subdomain. ``` model-{model_id}-{env_name}.grpc.api.baseten.co:443 ``` The regional endpoint URL appears in your model's API endpoint section in the Baseten dashboard once your organization has regional environments enabled. ### API restrictions on regional endpoints Regional endpoints derive the environment exclusively from the hostname. Path-based routing (`/environments/`, `/production/`, `/deployment/`) is rejected. For gRPC, don't set `x-baseten-environment` or `x-baseten-deployment` metadata headers. *** ## Environment deletion You can delete environments, except for production. To remove a production deployment, first promote another deployment to production or delete the entire model. * Deleted environments are removed from the overview but remain in billing history. * They don't consume resources after deletion. * API requests to a deleted environment return a 404 error. Deletion is permanent. Consider deactivation instead. # Regional environments Source: https://docs.baseten.co/deployment/regional-environments Guarantee inference data stays in a specific geographic region with regional environments. Regional environments route inference traffic for a deployment exclusively to workload planes within a designated geographic region. Use regional environments to meet data residency and compliance requirements, such as GDPR, without managing separate models per region. Regional environments require initial configuration by Baseten. [Contact support](mailto:support@baseten.co) to set up regional restrictions for your environments. ## How regional environments work Regional environments build on [environments](/deployment/environments) and [restricted environments](/organization/restricted-environments) to add region-level routing guarantees. When Baseten configures regional restrictions for an environment, two things happen: 1. **Replicas are constrained** to workload planes within the designated region — deployments promoted to that environment only run in the allowed region. 2. **A regional inference endpoint** becomes available that routes traffic directly to the region-specific workload plane, guaranteeing data stays in the designated region. ### Comparing regional and standard endpoints Standard environment endpoints don't guarantee regional routing. Traffic may pass through a workload plane outside the intended region depending on DNS resolution. Regional endpoints use a different URL format that maps directly to a region-specific workload plane: | Endpoint type | URL format | Regional guarantee | | :------------ | :------------------------------------------------------------------------ | :----------------- | | Standard | `https://model-{model_id}.api.baseten.co/environments/{env_name}/predict` | No | | Regional | `https://model-{model_id}-{env_name}.api.baseten.co/predict` | Yes | The standard endpoint continues to function after you enable regional environments. However, it doesn't guarantee that traffic stays within the restricted region. If you use regional environments, migrate your calling code to the regional endpoint to maintain compliance. The standard endpoint routes traffic through the original CNAME, which may point to a workload plane outside the restricted region. ### Calling a regional endpoint Regional endpoints accept the same request format as standard predict endpoints: Create an `httpx.Client` with the regional endpoint as the `base_url`. Reuse the client across requests for connection pooling. See [Configure HTTP clients](/inference/http-client-configuration) for recommended timeout and pool settings. ```python theme={"system"} import httpx import os model_id = "" env_name = "prod-us" client = httpx.Client( base_url=f"https://model-{model_id}-{env_name}.api.baseten.co", headers={"Authorization": f"Api-Key {os.environ['BASETEN_API_KEY']}"}, ) response = client.post("/predict", json={"prompt": "Hello, world!"}) print(response.json()) ``` Send a POST request to the regional endpoint with your API key in the `Authorization` header. ```sh theme={"system"} curl -X POST https://model-{model_id}-{env_name}.api.baseten.co/predict \ -H "Authorization: Api-Key $BASETEN_API_KEY" \ -H 'Content-Type: application/json' \ -d '{"prompt": "Hello, world!"}' ``` Use the built-in `fetch` API to call the regional endpoint. Replace `modelId` and `envName` with your model ID and environment name. ```javascript theme={"system"} const modelId = ""; const envName = "prod-us"; const resp = await fetch( `https://model-${modelId}-${envName}.api.baseten.co/predict`, { method: "POST", headers: { Authorization: `Api-Key ${process.env.BASETEN_API_KEY}`, "Content-Type": "application/json", }, body: JSON.stringify({ prompt: "Hello, world!" }), } ); const data = await resp.json(); console.log(data); ``` ## Setting up regional environments 1. **Create environments** with region-specific names (for example, `prod-us`, `prod-eu`, `staging-eu`). Use [restricted environments](/organization/restricted-environments) to control access. 2. **[Contact Baseten support](mailto:support@baseten.co)** to configure regional restrictions for your environments. We'll work with you to set them up per your required specs. 3. **Update your calling code** to use the regional endpoint format: `https://model-{model_id}-{env_name}.api.baseten.co/predict`. ### Environment naming requirements Environment names used with regional environments must be valid DNS subdomain labels: * Lowercase alphanumeric characters and hyphens only. * Can't start or end with a hyphen. * Maximum 40 characters. * `development` is a reserved name and can't be used. Regional environments apply across all models in a team. If you name an environment `prod-us` on one model, creating `prod-us` on another model in the same team applies the same regional restrictions. ## Deploying to regional environments Deploy and promote to regional environments the same way as standard environments: ```sh theme={"system"} truss push --environment prod-us ``` Replicas spin up only in workload planes within the allowed region. ### Promotion behavior When you promote a deployment to a regional environment, Baseten ensures regional restrictions are enforced. If the deployment was previously running without regional restrictions, a forced redeploy occurs to ensure compliance. This happens even when "turn off redeploy on promotion" is on for the model. ## Supported regions Baseten can configure regional restrictions for a variety of geographic regions, including US, EU, UK, and Australia. [Contact support](mailto:support@baseten.co) to discuss your specific regional requirements. # Resources Source: https://docs.baseten.co/deployment/resources Manage and configure model resources Every AI/ML model on Baseten runs on an **instance**, a dedicated set of hardware allocated to the model server. Selecting the right instance type ensures **optimal performance** while controlling **compute costs**. * **Insufficient resources**: Slow inference or failures. * **Excess resources**: Higher costs without added benefit. ## Instance type resource components * **Instance**: The allocated hardware for inference. * **Node**: The compute unit within an instance, comprising 8 GPUs with associated vCPU, RAM, and VRAM. * **vCPU**: Virtual CPU cores for general computing. * **RAM**: Memory available to the CPU. * **GPU**: Specialized hardware for accelerated ML workloads. * **VRAM**: Dedicated GPU memory for model execution. *** # Configuring model resources Define resources **before deployment** in Truss or **adjust them later** via the Baseten UI. ### Defining resources in Truss Define resource requirements in [`config.yaml`](/development/model/configuration) before running `truss push`. * **Published deployment** (`truss push`): Creates a new deployment (named sequentially: `deployment-1`, `deployment-2`, etc.) using the resources in [`config.yaml`](/development/model/configuration). * **Development deployment** (`truss push --watch`): Overwrites the existing development deployment with the specified resource configuration and starts watching for changes. Use [`truss watch`](/development/model/deploy-and-iterate) to resume watching an existing development deployment. * **Production deployment** (`truss push --promote`): Creates a new deployment and promotes it to production, replacing the active deployment. * **Environment deployment** (`truss push --environment `): Deploys directly to a [custom environment](/deployment/environments) like staging. Changes to `config.yaml` only affect new deployments. To update resources on an existing published deployment, edit resources in the [Baseten UI](#updating-resources-in-the-baseten-ui). You can configure resources in two ways: **Option 1: Specify individual resource fields** ```yaml config.yaml theme={"system"} resources: accelerator: L4 cpu: "4" memory: 16Gi ``` Baseten provisions the **smallest instance that meets the specified constraints**: * cpu: "3" or "4" → Maps to a 4-core instance. * cpu: "5" to "8" → Maps to an 8-core instance. `Gi` in `resources.memory` refers to **Gibibytes**, which are slightly larger than **Gigabytes**. **Option 2: Specify an exact instance type** An instance type is the full SKU name that uniquely identifies a specific hardware configuration. When you specify individual resource fields like `cpu` and `accelerator`, Baseten selects the smallest instance that meets your requirements. With `instance_type`, you specify exactly which instance you want, no guessing required. Use `instance_type` when you: * Know the exact hardware configuration you need. * Want to ensure consistent instance selection across deployments. * Are following a recommendation for a specific model (for example, "use an L4 with 4 vCPUs and 16 GiB RAM"). ```yaml config.yaml theme={"system"} resources: instance_type: "L4:4x16" ``` The format encodes the hardware specs: `:x`. For example, `L4:4x16` means an L4 GPU with 4 vCPUs and 16 GiB of RAM. When `instance_type` is specified, other resource fields (`cpu`, `memory`, `accelerator`, `use_gpu`) are ignored. ### Updating resources in the Baseten UI Once deployed, you can only update resource configurations **through the Baseten UI**. Changing the instance type deploys a copy of the deployment using the specified instance type. For a list of available instance types, see the [instance type reference](/deployment/resources#instance-type-reference). *** # Instance type reference Specs and benchmarks for every Baseten instance type. Choosing the right instance for model inference means balancing performance and cost. This page lists all available instance types on Baseten to help you deploy and serve models effectively. ## CPU-only instances Cost-effective options for lighter workloads. No GPU. * **Starts at**: \$0.00058/min * **Best for**: Transformers pipelines, small QA models, text embeddings | Instance | \$/min | vCPU | RAM | | -------- | --------- | ---- | ------ | | 1x2 | \$0.00058 | 1 | 2 GiB | | 1x4 | \$0.00086 | 1 | 4 GiB | | 2x8 | \$0.00173 | 2 | 8 GiB | | 4x16 | \$0.00346 | 4 | 16 GiB | | 8x32 | \$0.00691 | 8 | 32 GiB | | 16x64 | \$0.01382 | 16 | 64 GiB | To select a CPU-only instance, use the format `CPU:x` (for example, `instance_type: "CPU:4x16"`). **Example workloads:** * `1x2`: Text classification (for example, Truss quickstart) * `4x16`: LayoutLM Document QA * `4x16+`: Sentence Transformers embeddings on larger corpora ## GPU instances Accelerated inference for LLMs, diffusion models, and Whisper. | Instance | \$/min | vCPU | RAM | GPU | VRAM | | -------------- | --------- | ---- | -------- | ---------------------- | -------- | | T4x4x16 | \$0.01052 | 4 | 16 GiB | NVIDIA T4 | 16 GiB | | T4x8x32 | \$0.01504 | 8 | 32 GiB | NVIDIA T4 | 16 GiB | | T4x16x64 | \$0.02408 | 16 | 64 GiB | NVIDIA T4 | 16 GiB | | L4x4x16 | \$0.01414 | 4 | 16 GiB | NVIDIA L4 | 24 GiB | | L4:2x24x96 | \$0.04002 | 24 | 96 GiB | 2 NVIDIA L4s | 48 GiB | | L4:4x48x192 | \$0.08003 | 48 | 192 GiB | 4 NVIDIA L4s | 96 GiB | | A10Gx4x16 | \$0.02012 | 4 | 16 GiB | NVIDIA A10G | 24 GiB | | A10Gx8x32 | \$0.02424 | 8 | 32 GiB | NVIDIA A10G | 24 GiB | | A10Gx16x64 | \$0.03248 | 16 | 64 GiB | NVIDIA A10G | 24 GiB | | A10G:2x24x96 | \$0.05672 | 24 | 96 GiB | 2 NVIDIA A10Gs | 48 GiB | | A10G:4x48x192 | \$0.11344 | 48 | 192 GiB | 4 NVIDIA A10Gs | 96 GiB | | A10G:8x192x768 | \$0.32576 | 192 | 768 GiB | 8 NVIDIA A10Gs | 188 GiB | | A100x12x144 | \$0.06667 | 12 | 144 GiB | 1 NVIDIA A100 | 80 GiB | | A100:2x24x288 | \$0.13333 | 24 | 288 GiB | 2 NVIDIA A100s | 160 GiB | | A100:3x36x432 | \$0.20000 | 36 | 432 GiB | 3 NVIDIA A100s | 240 GiB | | A100:4x48x576 | \$0.26667 | 48 | 576 GiB | 4 NVIDIA A100s | 320 GiB | | A100:5x60x720 | \$0.33333 | 60 | 720 GiB | 5 NVIDIA A100s | 400 GiB | | A100:6x72x864 | \$0.40000 | 72 | 864 GiB | 6 NVIDIA A100s | 480 GiB | | A100:7x84x1008 | \$0.46667 | 84 | 1008 GiB | 7 NVIDIA A100s | 560 GiB | | A100:8x96x1152 | \$0.53333 | 96 | 1152 GiB | 8 NVIDIA A100s | 640 GiB | | H100 | \$0.10833 | - | - | 1 NVIDIA H100 | 80 GiB | | H100:2 | \$0.21667 | - | - | 2 NVIDIA H100s | 160 GiB | | H100:4 | \$0.43333 | - | - | 4 NVIDIA H100s | 320 GiB | | H100:8 | \$0.86667 | - | - | 8 NVIDIA H100s | 640 GiB | | H100MIG | \$0.06250 | - | - | Fractional NVIDIA H100 | 40 GiB | | H200 | \$0.20800 | 28 | 384 GiB | 1 NVIDIA H200 | 141 GiB | | H200:2 | \$0.41600 | 58 | 768 GiB | 2 NVIDIA H200s | 282 GiB | | H200:4 | \$0.83200 | 112 | 1536 GiB | 4 NVIDIA H200s | 564 GiB | | H200:8 | \$1.66400 | 224 | 3072 GiB | 8 NVIDIA H200s | 1128 GiB | | B200 | \$0.16633 | 28 | 384 GiB | 1 NVIDIA B200 | 180 GiB | | B200:2 | \$0.33267 | 58 | 768 GiB | 2 NVIDIA B200s | 360 GiB | | B200:4 | \$0.66533 | 112 | 1536 GiB | 4 NVIDIA B200s | 720 GiB | | B200:8 | \$1.33067 | 224 | 3072 GiB | 8 NVIDIA B200s | 1440 GiB | H200 and B200 instances are available on request. [Contact us](mailto:support@baseten.co) to get access. To select a GPU instance with `instance_type`: * **Single GPU**: `:x` (for example, `"L4:4x16"`). * **Multi-GPU**: `:xx` (for example, `"A100:2x24x288"`). * **H100/H200/B200**: `` or `:` (for example, `"H100:2"`, `"B200:4"`). * **Fractional H100**: `"H100_40GB"`. ## GPU details and workloads ### T4 Turing-series GPU * 2,560 CUDA / 320 Tensor cores * 16 GiB VRAM * **Best for:** Whisper, small LLMs like StableLM 3B ### L4 Ada Lovelace-series GPU * 7,680 CUDA / 240 Tensor cores * 24 GiB VRAM, 300 GiB/s * 121 TFLOPS (fp16) * **Best for**: Stable Diffusion XL * **Limit**: Not suitable for LLMs due to bandwidth ### A10G Ampere-series GPU * 9,216 CUDA / 288 Tensor cores * 24 GiB VRAM, 600 GiB/s * 70 TFLOPS (fp16) * **Best for**: Mistral 7B, Whisper, Stable Diffusion/SDXL ### A100 Ampere-series GPU * 6,912 CUDA / 432 Tensor cores * 80 GiB VRAM, 1.94 TB/s * 312 TFLOPS (fp16) * **Best for**: Mixtral, Llama 2 70B (2 A100s), Falcon 180B (5 A100s), SDXL ### H100 Hopper-series GPU * 16,896 CUDA / 640 Tensor cores * 80 GiB VRAM, 3.35 TB/s * 990 TFLOPS (fp16) * **Best for**: Mixtral 8x7B, Llama 2 70B (2xH100), SDXL ### H100MIG Fractional H100 (3/7 compute, ½ memory) * 7,242 CUDA cores, 40 GiB VRAM * 1.675 TB/s bandwidth * **Best for**: Efficient LLM inference at lower cost than A100 # Rolling deployments Source: https://docs.baseten.co/deployment/rolling-deployments Gradually shift traffic to a new deployment with replica-based rolling deployments. Rolling deployments replace replicas incrementally when promoting a deployment to an environment. Instead of swapping all traffic at once, rolling deployments scale up the candidate deployment, shift traffic proportionally, and scale down the previous deployment in controlled steps. Use rolling deployments when you need zero-downtime updates with the ability to pause, cancel, or force-complete the deployment at any point. Rolling deployments are not supported for [Chains](/chains/overview). This feature is available for individual model deployments only. Autoscaling is disabled for the entire duration of a rolling deployment. Replica counts don't adjust automatically until the deployment reaches a terminal status (SUCCEEDED, FAILED, or CANCELED). Use the `replica_overhead_percent` setting to pre-provision additional capacity before the deployment starts. ## How rolling deployments work A rolling deployment follows a repeating three-step cycle: 1. **Scale up** candidate deployment replicas by the configured percentage. 2. **Shift traffic** proportionally to match the new replica ratio. 3. **Scale down** the previous deployment replicas by the same percentage. This cycle repeats until all traffic and replicas run on the candidate deployment, at which point it becomes the active deployment in the environment. The diagram below shows this cycle in action. Each step highlights the replicas and traffic being changed. Use the controls to pause or adjust speed. Adjust the values and click **Apply** to restart the simulation with your configuration. ### Provisioning modes Rolling deployments support two mutually exclusive provisioning modes. You must configure exactly one: * `max_surge_percent`: Scales up candidate replicas before scaling down previous replicas. * `max_unavailable_percent`: Scales down previous replicas before scaling up candidate replicas. Both can't be non-zero at the same time, and both can't be zero at the same time. ## Enabling rolling deployments Enable rolling deployments on any environment by updating the environment's promotion settings. Rolling deployments are disabled by default. ```bash theme={"system"} curl -X PATCH \ https://api.baseten.co/v1/models/{model_id}/environments/production \ -H "Authorization: Api-Key $BASETEN_API_KEY" \ -H "Content-Type: application/json" \ -d '{ "promotion_settings": { "rolling_deploy": true, "rolling_deploy_config": { "max_surge_percent": 10, "max_unavailable_percent": 0, "stabilization_time_seconds": 60, "replica_overhead_percent": 0 } } }' ``` ```python theme={"system"} import requests import os API_KEY = os.environ.get("BASETEN_API_KEY") response = requests.patch( "https://api.baseten.co/v1/models/{model_id}/environments/production", headers={"Authorization": f"Api-Key {API_KEY}"}, json={ "promotion_settings": { "rolling_deploy": True, "rolling_deploy_config": { "max_surge_percent": 10, "max_unavailable_percent": 0, "stabilization_time_seconds": 60, "replica_overhead_percent": 0, }, } }, ) print(response.json()) ``` Once rolling deployments are enabled, any subsequent [promotion to the environment](/reference/management-api/deployments/promote/promotes-a-deployment-to-an-environment) uses the rolling deployment workflow. ## Configuration reference Configure rolling deployments through the `rolling_deploy_config` object in the environment's `promotion_settings`. Percentage of additional replicas to provision during each step. Set to `0` to use max unavailable mode instead. **Range:** 0–50 Percentage of replicas that can be unavailable during each step. Set to `0` to use max surge mode instead. **Range:** 0–50 Seconds to wait after each traffic shift before proceeding to the next step. Use this to monitor metrics between steps. **Range:** 0–3600 Percentage of additional replicas to pre-provision on the current deployment before the rolling deployment starts. Compensates for autoscaling being disabled. **Range:** 0–500 Additional promotion settings configured at the `promotion_settings` level: Enables rolling deployments for the environment. ## Deployment statuses The `in_progress_promotion` field on the [environment detail endpoint](/reference/management-api/environments/get-an-environments-details) tracks the current state of a rolling deployment. | Status | Description | | -------------- | -------------------------------------------------------------------------------------------------- | | `RELEASING` | Candidate deployment is building and initializing replicas. | | `RAMPING_UP` | Scaling up candidate replicas and shifting traffic. | | `PAUSED` | Rolling deployment is paused at its current traffic split. Replicas stay at their current count. | | `RAMPING_DOWN` | Graceful cancel in progress. Traffic is shifting back to the previous deployment. | | `SUCCEEDED` | Rolling deployment completed. The candidate is now the active deployment. Autoscaling resumes. | | `FAILED` | Rolling deployment failed. Traffic remains on the previous deployment. Autoscaling resumes. | | `CANCELED` | Rolling deployment was canceled. Traffic returned to the previous deployment. Autoscaling resumes. | The `in_progress_promotion` object also includes `percent_traffic_to_new_version`, which reports the current percentage of traffic routed to the candidate deployment. ## Deployment control actions ### Pause Pauses the rolling deployment after the current step completes. Use this to inspect metrics or logs before proceeding. ```bash theme={"system"} curl -X POST \ https://api.baseten.co/v1/models/{model_id}/environments/production/pause_promotion \ -H "Authorization: Api-Key $BASETEN_API_KEY" ``` ```python theme={"system"} response = requests.post( "https://api.baseten.co/v1/models/{model_id}/environments/production/pause_promotion", headers={"Authorization": f"Api-Key {API_KEY}"}, ) print(response.json()) ``` ### Resume Resumes a paused rolling deployment from where it left off. ```bash theme={"system"} curl -X POST \ https://api.baseten.co/v1/models/{model_id}/environments/production/resume_promotion \ -H "Authorization: Api-Key $BASETEN_API_KEY" ``` ```python theme={"system"} response = requests.post( "https://api.baseten.co/v1/models/{model_id}/environments/production/resume_promotion", headers={"Authorization": f"Api-Key {API_KEY}"}, ) print(response.json()) ``` ### Cancel Gracefully cancels the rolling deployment. Traffic ramps back to the previous deployment and candidate replicas scale down. ```bash theme={"system"} curl -X POST \ https://api.baseten.co/v1/models/{model_id}/environments/production/cancel_promotion \ -H "Authorization: Api-Key $BASETEN_API_KEY" ``` ```python theme={"system"} response = requests.post( "https://api.baseten.co/v1/models/{model_id}/environments/production/cancel_promotion", headers={"Authorization": f"Api-Key {API_KEY}"}, ) print(response.json()) ``` Returns a `status` of `CANCELED` (instant cancel for non-rolling deployments) or `RAMPING_DOWN` (graceful rollback for rolling deployments). ### Force cancel Immediately cancels the rolling deployment and returns all traffic to the previous deployment. Use this when you need to roll back without waiting for the graceful ramp-down. Force canceling may cause brief service disruption if the previous deployment is under-provisioned. ```bash theme={"system"} curl -X POST \ https://api.baseten.co/v1/models/{model_id}/environments/production/force_cancel_promotion \ -H "Authorization: Api-Key $BASETEN_API_KEY" ``` ```python theme={"system"} response = requests.post( "https://api.baseten.co/v1/models/{model_id}/environments/production/force_cancel_promotion", headers={"Authorization": f"Api-Key {API_KEY}"}, ) print(response.json()) ``` ### Force roll forward Immediately completes the rolling deployment, shifting all traffic to the candidate deployment. This works even if the deployment is in the process of rolling back. Force rolling forward may promote an under-provisioned deployment if the candidate has not finished scaling up. ```bash theme={"system"} curl -X POST \ https://api.baseten.co/v1/models/{model_id}/environments/production/force_roll_forward_promotion \ -H "Authorization: Api-Key $BASETEN_API_KEY" ``` ```python theme={"system"} response = requests.post( "https://api.baseten.co/v1/models/{model_id}/environments/production/force_roll_forward_promotion", headers={"Authorization": f"Api-Key {API_KEY}"}, ) print(response.json()) ``` ## Autoscaling during rolling deployments To compensate for autoscaling being disabled during rolling deployments: * Set `replica_overhead_percent` to pre-provision the current deployment before the rolling deployment starts. For example, a value of `50` adds 50% more replicas to the current deployment before any traffic shifts. * Set `stabilization_time_seconds` to add a wait period between steps, giving you time to monitor metrics before the next traffic shift. * Factor in expected traffic when setting your environment's `min_replica` and `max_replica` before starting the rolling deployment. Autoscaling resumes automatically when the rolling deployment reaches a terminal status: `SUCCEEDED`, `FAILED`, or `CANCELED`. ## Deployment cleanup After a rolling deployment completes, the `promotion_cleanup_strategy` setting controls what happens to the previous deployment. * `SCALE_TO_ZERO`: Scales the previous deployment to zero replicas. It remains available for reactivation. This is the default. * `KEEP`: Leaves the previous deployment running at its current replica count. * `DEACTIVATE`: Deactivates the previous deployment. It stops serving traffic and releases all resources. Set it alongside your other promotion settings: ```bash theme={"system"} curl -X PATCH \ https://api.baseten.co/v1/models/{model_id}/environments/production \ -H "Authorization: Api-Key $BASETEN_API_KEY" \ -H "Content-Type: application/json" \ -d '{ "promotion_settings": { "promotion_cleanup_strategy": "DEACTIVATE" } }' ``` ```python theme={"system"} response = requests.patch( "https://api.baseten.co/v1/models/{model_id}/environments/production", headers={"Authorization": f"Api-Key {API_KEY}"}, json={ "promotion_settings": { "promotion_cleanup_strategy": "DEACTIVATE" } }, ) print(response.json()) ``` # Binary IO Source: https://docs.baseten.co/development/chain/binaryio Performant serialization of numeric data Numeric data or audio/video are most efficiently transmitted as bytes. Other representations such as JSON or base64 encoding lose precision, add significant parsing overhead and increase message sizes (for example, \~33% increase for base64 encoding). Chains extends the JSON-centred pydantic ecosystem with two ways how you can include binary data: numpy array support and raw bytes. ## Numpy `ndarray` support Once you have your data represented as a numpy array, you can easily (and often without copying) convert it to `torch`, `tensorflow` or other common numeric library's objects. To include numpy arrays in a pydantic model, chains has a special field type implementation `NumpyArrayField`. For example: ```python theme={"system"} import numpy as np import pydantic from truss_chains import pydantic_numpy class DataModel(pydantic.BaseModel): some_numbers: pydantic_numpy.NumpyArrayField other_field: str ... numbers = np.random.random((3, 2)) data = DataModel(some_numbers=numbers, other_field="Example") print(data) # some_numbers=NumpyArrayField(shape=(3, 2), dtype=float64, data=[ # [0.39595027 0.23837526] # [0.56714894 0.61244946] # [0.45821942 0.42464844]]) # other_field='Example' ``` `NumpyArrayField` is a wrapper around the actual numpy array. Inside your python code, you can work with its `array` attribute: ```python theme={"system"} data.some_numbers.array += 10 # some_numbers=NumpyArrayField(shape=(3, 2), dtype=float64, data=[ # [10.39595027 10.23837526] # [10.56714894 10.61244946] # [10.45821942 10.42464844]]) # other_field='Example' ``` The interesting part is how it serializes when communicating between Chainlets or with a client. It can work in two modes: JSON and binary. ### Binary As a JSON alternative that supports byte data, Chains uses `msgpack` (with `msgpack_numpy`) to serialize the dict representation. For Chainlet-Chainlet RPCs this is done automatically for you by enabling binary mode of the dependency Chainlets, see [all options](/reference/sdk/chains#truss-chains-depends): ```python theme={"system"} import truss_chains as chains class Worker(chains.ChainletBase): async def run_remote(self, data: DataModel) -> DataModel: data.some_numbers.array += 10 return data class Consumer(chains.ChainletBase): def __init__(self, worker=chains.depends(Worker, use_binary=True)): self._worker = worker async def run_remote(self): numbers = np.random.random((3, 2)) data = DataModel(some_numbers=numbers, other_field="Example") result = await self._worker.run_remote(data) ``` Now the data is transmitted in a fast and compact way between Chainlets which often gives performance increases. ### Binary client If you want to send such data as input to a chain or parse binary output from a chain, you have to add the `msgpack` serialization client-side: ```python theme={"system"} import requests import msgpack import msgpack_numpy msgpack_numpy.patch() # Register hook for numpy. # Dump to "python" dict and then to binary. data_dict = data.model_dump(mode="python") data_bytes = msgpack.dumps(data_dict) # Set binary content type in request header. headers = { "Content-Type": "application/octet-stream", "Authorization": ... } response = requests.post(url, data=data_bytes, headers=headers) response_dict = msgpack.loads(response.content) response_model = ResponseModel.model_validate(response_dict) ``` The steps of dumping from a pydantic model and validating the response dict into a pydantic model can be skipped, if you prefer working with raw dicts on the client. The implementation of `NumpyArrayField` only needs `pydantic`, no other Chains dependencies. So you can take that implementation code in isolation and integrate it in your client code. Some version combinations of `msgpack` and `msgpack_numpy` give errors, we know that `msgpack = ">=1.0.2"` and `msgpack-numpy = ">=0.4.8"` work. ### JSON The JSON-schema to represent the array is a dict of `shape (tuple[int]), dtype (str), data_b64 (str)`. For example, ```python theme={"system"} print(data.model_dump_json()) '{"some_numbers":{"shape":[3,2],"dtype":"float64", "data_b64":"30d4/rnKJEAsvm...' ``` The base64 data corresponds to `np.ndarray.tobytes()`. To get back to the array from the JSON string, use the model's `model_validate_json` method. As discussed in the beginning, this schema is not performant for numeric data and only offered as a compatibility layer (JSON does not allow bytes) - generally prefer the binary format. # Simple `bytes` fields It is possible to add a `bytes` field to a pydantic model used in a chain, or as a plain argument to `run_remote`. This can be useful to include non-numpy data formats such as images or audio/video snippets. In this case, the "normal" JSON representation does not work and all involved requests or Chainlet-Chainlet-invocations must use binary mode. The same steps as for arrays [above](#binary-client) apply: construct dicts with `bytes` values and keys corresponding to the `run_remote` argument names or the field names in the pydantic model. Then use `msgpack` to serialize and deserialize those dicts. Don't forget to add `Content-type` headers and that `response.json()` will not work. # Concepts Source: https://docs.baseten.co/development/chain/concepts Glossary of Chains concepts and terminology ## Chainlet A Chainlet is the basic building block of Chains. A Chainlet is a Python class that specifies: * A set of compute resources. * A Python environment with software dependencies. * A typed interface [ `run_remote()`](/development/chain/concepts#run-remote-chaining-chainlets) for other Chainlets to call. This is the simplest possible Chainlet. Only the [`run_remote()`](/development/chain/concepts#run-remote-chaining-chainlets) method is required, and we can layer in other concepts to create a more capable Chainlet. ```python theme={"system"} import truss_chains as chains class SayHello(chains.ChainletBase): async def run_remote(self, name: str) -> str: return f"Hello, {name}" ``` You can modularize your code by creating your own chainlet sub-classes, refer to our [subclassing guide](/development/chain/subclassing). ### Remote configuration Chainlets are meant for deployment as remote services. Each Chainlet specifies its own requirements for compute hardware (CPU count, GPU type and count, etc) and software dependencies (Python libraries or system packages). This configuration is built into a Docker image automatically as part of the deployment process. When no configuration is provided, the Chainlet will be deployed on a basic instance with one vCPU, 2GB of RAM, no GPU, and a standard set of Python and system packages. Configuration is set using the [`remote_config`](/reference/sdk/chains#remote-configuration) class variable within the Chainlet: ```python theme={"system"} import truss_chains as chains class MyChainlet(chains.ChainletBase): remote_config = chains.RemoteConfig( docker_image=chains.DockerImage( pip_requirements=["torch==2.3.0", ...] ), compute=chains.Compute(gpu="H100", ...), assets=chains.Assets(secret_keys=["hf_access_token"], ...), ) ``` To select an exact instance type instead of specifying individual resource fields, use `instance_type`: ```python theme={"system"} compute=chains.Compute(instance_type="H100:8x80") ``` When `instance_type` is specified, `cpu_count`, `memory`, and `gpu` fields are ignored. See the [remote configuration reference](/reference/sdk/chains#remote-configuration) for a complete list of options. ### Initialization Chainlets are implemented as classes because we often want to set up expensive static resources once at startup and then re-use it with each invocation of the Chainlet. For example, we only want to initialize an AI model and download its weights once then re-use it every time we run inference. We do this setup in `__init__()`, which is run exactly once when the Chainlet is deployed or scaled up. ```python theme={"system"} import truss_chains as chains class PhiLLM(chains.ChainletBase): def __init__(self) -> None: import torch import transformers self._model = transformers.AutoModelForCausalLM.from_pretrained( PHI_HF_MODEL, torch_dtype=torch.float16, device_map="auto", ) self._tokenizer = transformers.AutoTokenizer.from_pretrained( PHI_HF_MODEL, ) ``` Chainlet initialization also has two important features: context and dependency injection of other Chainlets, explained below. #### Context (access information) You can add [ `DeploymentContext`](/reference/sdk/chains#class-truss-chains-deploymentcontext) object as an optional argument to the `__init__`-method of a Chainlet. This allows you to use secrets within your Chainlet, such as using a `hf_access_token` to access a gated model on Hugging Face (note that when using secrets, they also need to be added to the `assets`). ```python theme={"system"} import truss_chains as chains class MistralLLM(chains.ChainletBase): remote_config = chains.RemoteConfig( ... assets = chains.Assets(secret_keys=["hf_access_token"], ...), ) def __init__( self, # Adding the `context` argument, allows us to access secrets context: chains.DeploymentContext = chains.depends_context(), ) -> None: import transformers # Using the secret from context to access a gated model on HF self._model = transformers.AutoModelForCausalLM.from_pretrained( "mistralai/Mistral-7B-Instruct-v0.2", use_auth_token=context.secrets["hf_access_token"], ) ``` #### Depends (call other Chainlets) The Chains framework uses the [`chains.depends()`](/reference/sdk/chains#truss-chains-depends) function in Chainlets' `__init__()` method to track the dependency relationship between different Chainlets within a Chain. This syntax, inspired by dependency injection, is used to translate local Python function calls into calls to the remote Chainlets in production. Once a dependency Chainlet is added with [`chains.depends()`](/reference/sdk/chains#truss-chains-depends), its [`run_remote()`](/development/chain/concepts#run-remote-chaining-chainlets) method can call this dependency Chainlet, for example, below `HelloAll` we can make calls to `SayHello`: ```python theme={"system"} import truss_chains as chains class HelloAll(chains.ChainletBase): def __init__(self, say_hello_chainlet=chains.depends(SayHello)) -> None: self._say_hello = say_hello_chainlet async def run_remote(self, names: list[str]) -> str: output = [] for name in names: output.append(self._say_hello.run_remote(name)) return "\n".join(output) ``` ## Run remote (chaining Chainlets) The `run_remote()` method is run each time the Chainlet is called. It is the sole public interface for the Chainlet (though you can have as many private helper functions as you want) and its inputs and outputs must have type annotations. In `run_remote()` you implement the actual work of the Chainlet, such as model inference or data chunking: ```python theme={"system"} import truss_chains as chains class PhiLLM(chains.ChainletBase): async def run_remote(self, messages: Messages) -> str: import torch model_inputs = await self._tokenizer.apply_chat_template( messages, tokenize=False, add_generation_prompt=True ) inputs = await self._tokenizer(model_inputs, return_tensors="pt") input_ids = inputs["input_ids"].to("cuda") with torch.no_grad(): outputs = await self._model.generate( input_ids=input_ids, **self._generate_args) output_text = await self._tokenizer.decode( outputs[0], skip_special_tokens=True) return output_text ``` We recommend implementing this as an `async` method and using async APIs for doing all the work (for example, downloads, vLLM or TRT inference). It is possible to stream results back, see our [streaming guide](/development/chain/streaming). If `run_remote()` makes calls to other Chainlets, for example, invoking a dependency Chainlet for each element in a list, you can benefit from concurrent execution, by making the `run_remote()` an `async` method and starting the calls as concurrent tasks `asyncio.create_task(self._dep_chainlet.run_remote(...))`. ## Entrypoint The entrypoint is called directly from the deployed Chain's API endpoint and kicks off the entire chain. The entrypoint is also responsible for returning the final result back to the client. Using the [`@chains.mark_entrypoint`](/reference/sdk/chains#truss-chains-mark-entrypoint) decorator, one Chainlet within a file is set as the entrypoint to the chain. ```python theme={"system"} @chains.mark_entrypoint class HelloAll(chains.ChainletBase): ``` Optionally you can also set a Chain display name (not to be confused with Chainlet display name) with this decorator: ```python theme={"system"} @chains.mark_entrypoint("My Awesome Chain") class HelloAll(chains.ChainletBase): ``` ## I/O and `pydantic` data types To make orchestrating multiple remotely deployed services possible, Chains relies heavily on typed inputs and outputs. Values must be serialized to a safe exchange format to be sent over the network. The Chains framework uses the type annotations to infer how data should be serialized and currently is restricted to types that are JSON compatible. Types can be: * Direct type annotations for simple types such as `int`, `float`, or `list[str]`. * Pydantic models to define a schema for nested data structures or multiple arguments. An example of pydantic input and output types for a Chainlet is given below: ```python theme={"system"} import enum import pydantic class Modes(enum.Enum): MODE_0 = "MODE_0" MODE_1 = "MODE_1" class SplitTextInput(pydantic.BaseModel): data: str num_partitions: int mode: Modes class SplitTextOutput(pydantic.BaseModel): parts: list[str] part_lens: list[int] ``` Refer to the [pydantic docs](https://docs.pydantic.dev/latest/) for more details on how to define custom pydantic data models. Also refer to the [guide](/development/chain/binaryio) about efficient integration of binary and numeric data. ## Chains compared to Truss Chains is an alternate SDK for packaging and deploying AI models. It carries over many features and concepts from Truss and gives you access to the benefits of Baseten (resource provisioning, autoscaling, fast cold starts, etc), but it is not a 1-1 replacement for Truss. Here are some key differences: * Rather than running `truss init` and creating a Truss in a directory, a Chain is a single file, giving you more flexibility for implementing multi-step model inference. Create an example with `truss chains init`. * Configuration is done inline in typed Python code rather than in a `config.yaml` file. * While Chainlets are converted to Truss models when run on Baseten, `Chainlet != TrussModel`. Chains is designed for compatibility and incremental adoption, with a stub function for wrapping existing deployed models. # Deploy Source: https://docs.baseten.co/development/chain/deploy Deploy your Chain on Baseten Deploying a Chain is an atomic action that deploys every Chainlet within the Chain. Each Chainlet specifies its own remote environment: hardware resources, Python and system dependencies, autoscaling settings. ### Published deployment By default, pushing a Chain creates a published deployment: ```sh theme={"system"} truss chains push ./my_chain.py ``` Where `my_chain.py` contains the entrypoint Chainlet for your Chain. Published deployments have access to full autoscaling settings. Each time you push, a new deployment is created. ### Development To create a development deployment for rapid iteration, use `--watch`: ```sh theme={"system"} truss chains push ./my_chain.py --watch ``` Development deployments are intended for testing and can't scale past one replica. Each time you make a development deployment, it overwrites the existing development deployment. Development deployments support rapid iteration with live code patching. See the [watch guide](/development/chain/watch). ### Environments To deploy a Chain to an environment, run: ```sh theme={"system"} truss chains push ./my_chain.py --environment {env_name} ``` Environments are intended for live traffic and have access to full autoscaling settings. Each time you deploy to an environment, a new deployment is created. Once the new deployment is live, it replaces the previous deployment, which is relegated to the published deployments list. [Learn more](/deployment/environments) about environments. # Architecture and design Source: https://docs.baseten.co/development/chain/design How to structure your Chainlets A Chain is composed of multiple connected Chainlets working together to perform a task. For example, the Chain in the diagram below takes a large audio file as input. Then it splits it into smaller chunks, transcribes each chunk in parallel (reducing the end-to-end latency), and finally aggregates and returns the results. To build an efficient Chain, we recommend drafting your high level structure as a flowchart or diagram. This can help you identifying parallelizable units of work and steps that need different (model/hardware) resources. If one Chainlet creates many "sub-tasks" by calling other dependency Chainlets (for example, in a loop over partial work items), these calls should be done as `aynscio`-tasks that run concurrently. That way you get the most out of the parallelism that Chains offers. This design pattern is extensively used in the [audio transcription example](/examples/chains-audio-transcription). While using `asyncio` is essential for performance, it can also be tricky. Here are a few caveats to look out for: * Executing operations in an async function that block the event loop for more than a fraction of a second. This hinders the "flow" of processing requests concurrently and starting RPCs to other Chainlets. Ideally use native async APIs. Frameworks like vLLM or triton server offer such APIs, similarly file downloads can be made async and you might find [`AsyncBatcher`](https://github.com/hussein-awala/async-batcher) useful. If there is no async support, consider running blocking code in a thread/process pool (as an attribute of a Chainlet). * Creating async tasks (for example, with `asyncio.create_task`) does not start the task *immediately*. In particular, when starting several tasks in a loop, `create_task` must be alternated with operations that yield to the event loop that, so the task can be started. If the loop is not `async for` or contains other `await` statements, a "dummy" await can be added, for example `await asyncio.sleep(0)`. This allows the tasks to be started concurrently. # Engine-Builder LLM Models Source: https://docs.baseten.co/development/chain/engine-builder-models Engine-Builder LLM models are pre-trained models that are optimized for specific inference tasks. Baseten's [Engine-Builder](/engines/engine-builder-llm/overview) enables the deployment of optimized model inference engines. Currently, it supports TensorRT-LLM. Truss Chains allows seamless integration of these engines into structured workflows. This guide provides a quick entry point for Chains users. ## LLama 7B example Use the `EngineBuilderLLMChainlet` baseclass to configure an LLM engine. The additional `engine_builder_config` field specifies model architecture, repository, and engine parameters and more, the full options are detailed in the [Engine-Builder configuration guide](/engines/engine-builder-llm/engine-builder-config). ```python theme={"system"} import truss_chains as chains from truss.base import trt_llm_config, truss_config class Llama7BChainlet(chains.EngineBuilderLLMChainlet): remote_config = chains.RemoteConfig( compute=chains.Compute(gpu=truss_config.Accelerator.H100), assets=chains.Assets(secret_keys=["hf_access_token"]), ) engine_builder_config = truss_config.TRTLLMConfiguration( build=trt_llm_config.TrussTRTLLMBuildConfiguration( base_model=trt_llm_config.TrussTRTLLMModel.LLAMA, checkpoint_repository=trt_llm_config.CheckpointRepository( source=trt_llm_config.CheckpointSource.HF, repo="meta-llama/Llama-3.1-8B-Instruct", ), max_batch_size=8, max_seq_len=4096, tensor_parallel_count=1, ) ) ``` ## Differences from standard Chainlets * No `run_remote` implementation: Unlike regular Chainlets, `EngineBuilderLLMChainlet` doesn't require users to implement `run_remote()`. Instead, it automatically wires into the deployed engine’s API. All LLM Chainlets have the same function signature: `chains.EngineBuilderLLMInput` as input and a stream (`AsyncIterator`) of strings as output. Likewise `EngineBuilderLLMChainlet`s can only be used as dependencies, but not have dependencies themselves. * No `run_local` ([guide](/development/chain/localdev)) and `watch` ([guide](/development/chain/watch)) Standard Chains support a local debugging mode and watch. However, when using `EngineBuilderLLMChainlet`, local execution isn't available, and testing must be done after deployment. For a faster dev loop of the rest of your chain (everything except the engine builder chainlet) you can substitute those chainlets with stubs like you can do for an already deployed truss model \[[guide](/development/chain/stub)]. ## Integrate the Engine-Builder chainlet After defining an `EngineBuilderLLMInput` like `Llama7BChainlet` above, you can use it as a dependency in other conventional chainlets: ```python theme={"system"} from typing import AsyncIterator import truss_chains as chains @chains.mark_entrypoint class TestController(chains.ChainletBase): """Example using the Engine-Builder Chainlet in another Chainlet.""" def __init__(self, llm=chains.depends(Llama7BChainlet)) -> None: self._llm = llm async def run_remote(self, prompt: str) -> AsyncIterator[str]: messages = [{"role": "user", "content": prompt}] llm_input = chains.EngineBuilderLLMInput(messages=messages) async for chunk in self._llm.run_remote(llm_input): yield chunk ``` # Error Handling Source: https://docs.baseten.co/development/chain/errorhandling Understanding and handling Chains errors Error handling in Chains follows the principle that the root cause "bubbles up" until the entrypoint, which returns an error response. Similarly to how python stack traces contain all the layers from where an exception was raised up until the main function. Consider the case of a Chain where the entrypoint calls `run_remote` of a Chainlet named `TextToNum` and this in turn invokes `TextReplicator`. The respective `run_remote` methods might also use other helper functions that appear in the call stack. Below is an example stack trace that shows how the root cause (a `ValueError`) is propagated up to the entrypoint's `run_remote` method (this is what you would see as an error log): ``` Chainlet-Traceback (most recent call last): File "/packages/itest_chain.py", line 132, in run_remote value = self._accumulate_parts(text_parts.parts) File "/packages/itest_chain.py", line 144, in _accumulate_parts value += self._text_to_num.run_remote(part) ValueError: (showing chained remote errors, root error at the bottom) ├─ Error in dependency Chainlet `TextToNum`: │ Chainlet-Traceback (most recent call last): │ File "/packages/itest_chain.py", line 87, in run_remote │ generated_text = self._replicator.run_remote(data) │ ValueError: (showing chained remote errors, root error at the bottom) │ ├─ Error in dependency Chainlet `TextReplicator`: │ │ Chainlet-Traceback (most recent call last): │ │ File "/packages/itest_chain.py", line 52, in run_remote │ │ validate_data(data) │ │ File "/packages/itest_chain.py", line 36, in validate_data │ │ raise ValueError(f"This input is too long: {len(data)}.") ╰ ╰ ValueError: This input is too long: 100. ``` ## Exception handling and retries Above stack trace is what you see if you don't catch the exception. It is possible to add error handling around each remote Chainlet invocation. Chains tries to raise the same exception class on the *caller* Chainlet as was raised in the *dependency* Chainlet. * Builtin exceptions (for example, `ValueError`) always work. * Custom or third-party exceptions (for example, from `torch`) can be only raised in the caller if they are included in the dependencies of the caller as well. If the exception class cannot be resolved, a `GenericRemoteException` is raised instead. Note that the *message* of re-raised exceptions is the concatenation of the original message and the formatted stack trace of the dependency Chainlet. In some cases it might make sense to simply retry a remote invocation (for example, if it failed due to some transient problems like networking or any "flaky" parts). `depends` can be configured with additional [options](/reference/sdk/chains#truss-chains-depends) for that. Below example shows how you can add automatic retries and error handling for the call to `TextReplicator` in `TextToNum`: ```python theme={"system"} import truss_chains as chains class TextToNum(chains.ChainletBase): def __init__( self, replicator: TextReplicator = chains.depends(TextReplicator, retries=3), ) -> None: self._replicator = replicator async def run_remote(self, data: ...): try: generated_text = await self._replicator.run_remote(data) except ValueError: ... # Handle error. ``` ## Stack filtering The stack trace is intended to show the user implemented code in `run_remote` (and user implemented helper functions). Under the hood, the calls from one Chainlet to another go through an HTTP connection, managed by the Chains framework. And each Chainlet itself is run as a FastAPI server with several layers of request handling code "above". To provide concise, readable stacks, all of this non-user code is filtered out. # Your first Chain Source: https://docs.baseten.co/development/chain/getting-started Build and deploy two example Chains This quickstart guide contains instructions for creating two Chains: 1. A simple CPU-only “hello world”-Chain. 2. A Chain that implements Phi-3 Mini and uses it to write poems. ## Prerequisites You need [uv](https://docs.astral.sh/uv/) installed and a [Baseten account](https://app.baseten.co/signup) with an [API key](https://app.baseten.co/settings/account/api_keys). ## Example: Hello World Chains are written in Python files. In your working directory, create `hello_chain/hello.py`: ```sh theme={"system"} mkdir hello_chain cd hello_chain touch hello.py ``` In the file, we'll specify a basic Chain. It has two Chainlets: * `HelloWorld`, the entrypoint, which handles the input and output. * `RandInt`, which generates a random integer. It is used a as a dependency by `HelloWorld`. Via the entrypoint, the Chain takes a maximum value and returns the string " Hello World!" repeated a variable number of times. ```python hello.py theme={"system"} import random import truss_chains as chains class RandInt(chains.ChainletBase): async def run_remote(self, max_value: int) -> int: return random.randint(1, max_value) @chains.mark_entrypoint class HelloWorld(chains.ChainletBase): def __init__(self, rand_int=chains.depends(RandInt, retries=3)) -> None: self._rand_int = rand_int async def run_remote(self, max_value: int) -> str: num_repetitions = await self._rand_int.run_remote(max_value) return "Hello World! " * num_repetitions ``` ### The Chainlet class-contract Exactly one Chainlet must be marked as the entrypoint with the [`@chains.mark_entrypoint`](/reference/sdk/chains#truss-chains-mark-entrypoint) decorator. This Chainlet is responsible for handling public-facing input and output for the whole Chain in response to an API call. A Chainlet class has a single public method, [`run_remote()`](/development/chain/concepts#run-remote-chaining-chainlets), which is the API endpoint for the entrypoint Chainlet and the function that other Chainlets can use as a dependency. The [`run_remote()`](/development/chain/concepts#run-remote-chaining-chainlets) method must be fully type-annotated with primitive python types or [pydantic models](https://docs.pydantic.dev/latest/). Chainlets cannot be naively instantiated. The only correct usages are: 1. Make one Chainlet depend on another one via the [`chains.depends()`](/reference/sdk/chains#truss-chains-depends) directive as an `__init__`-argument as shown above for the `RandInt` Chainlet. 2. In the [local debugging mode](/development/chain/localdev#test-a-chain-locally). Beyond that, you can structure your code as you like, with private methods, imports from other files, and so forth. Keep in mind that Chainlets are intended for distributed, replicated, remote execution, so using global variables, global state, and certain Python features like importing modules dynamically at runtime should be avoided as they may not work as intended. ### Deploy your Chain to Baseten To deploy your Chain to Baseten, run: ```bash theme={"system"} truss chains push --watch hello.py ``` The deploy command results in an output like this: ``` ⛓️ HelloWorld - Chainlets ⛓️ ╭──────────────────────┬─────────────────────────┬─────────────╮ │ Status │ Name │ Logs URL │ ├──────────────────────┼─────────────────────────┼─────────────┤ │ 💚 ACTIVE │ HelloWorld (entrypoint) │ https://... │ ├──────────────────────┼─────────────────────────┼─────────────┤ │ 💚 ACTIVE │ RandInt (dep) │ https://... │ ╰──────────────────────┴─────────────────────────┴─────────────╯ Deployment succeeded. You can run the chain with: curl -X POST 'https://chain-.../run_remote' \ -H "Authorization: Api-Key $BASETEN_API_KEY" \ -d '' ``` Wait for the status to turn to `ACTIVE` and test invoking your Chain (replace `$INVOCATION_URL` in below command): ```bash theme={"system"} curl -X POST $INVOCATION_URL \ -H "Authorization: Api-Key $BASETEN_API_KEY" \ -d '{"max_value": 10}' # "Hello World! Hello World! Hello World! " ``` ## Example: Poetry with LLMs Our second example also has two Chainlets, but is somewhat more complex and realistic. The Chainlets are: * `PoemGenerator`, the entrypoint, which handles the input and output and orchestrates calls to the LLM. * `PhiLLM`, which runs inference on Phi-3 Mini. This Chain takes a list of words and returns a poem about each word, written by Phi-3. Here's the architecture: We build this Chain in a new working directory (if you are still inside `hello_chain/`, go up one level with `cd ..` first): ```sh theme={"system"} mkdir poetry_chain cd poetry_chain touch poems.py ``` A similar end-to-end code example, using Mistral as an LLM, is available in the [examples repo](https://github.com/basetenlabs/model/tree/main/truss-chains/examples/mistral). ### Building the LLM Chainlet The main difference between this Chain and the previous one is that we now have an LLM that needs a GPU and more complex dependencies. Copy the following code into `poems.py`: ```python poems.py theme={"system"} import asyncio from typing import List import pydantic import truss_chains as chains from truss import truss_config PHI_HF_MODEL = "microsoft/Phi-3-mini-4k-instruct" # This configures to cache model weights from the Hugging Face repo # in the Docker image used for deploying the Chainlet. PHI_CACHE = truss_config.ModelRepo( repo_id=PHI_HF_MODEL, allow_patterns=["*.json", "*.safetensors", ".model"], use_volume=False, ) class Messages(pydantic.BaseModel): messages: List[dict[str, str]] class PhiLLM(chains.ChainletBase): # `remote_config` defines the resources required for this chainlet. remote_config = chains.RemoteConfig( docker_image=chains.DockerImage( # The phi model needs some extra python packages. pip_requirements=[ "accelerate==0.30.1", "einops==0.8.0", "transformers==4.41.2", "torch==2.3.0", ] ), # The phi model needs a GPU and more CPUs. compute=chains.Compute(cpu_count=2, gpu="T4"), # Cache the model weights in the image assets=chains.Assets(cached=[PHI_CACHE]), ) def __init__(self) -> None: # Note the imports of the *specific* python requirements are # pushed down to here. This code will only be executed on the # remotely deployed Chainlet, not in the local environment, # so we don't need to install these packages in the local # dev environment. import torch import transformers self._model = transformers.AutoModelForCausalLM.from_pretrained( PHI_HF_MODEL, torch_dtype=torch.float16, device_map="auto", ) self._tokenizer = transformers.AutoTokenizer.from_pretrained( PHI_HF_MODEL, ) self._generate_args = { "max_new_tokens" : 512, "temperature" : 1.0, "top_p" : 0.95, "top_k" : 50, "repetition_penalty" : 1.0, "no_repeat_ngram_size": 0, "use_cache" : True, "do_sample" : True, "eos_token_id" : self._tokenizer.eos_token_id, "pad_token_id" : self._tokenizer.pad_token_id, } async def run_remote(self, messages: Messages) -> str: import torch model_inputs = self._tokenizer.apply_chat_template( messages, tokenize=False, add_generation_prompt=True ) inputs = self._tokenizer(model_inputs, return_tensors="pt") input_ids = inputs["input_ids"].to("cuda") with torch.no_grad(): outputs = self._model.generate( input_ids=input_ids, **self._generate_args) output_text = self._tokenizer.decode( outputs[0], skip_special_tokens=True) return output_text ``` ### Building the entrypoint Now that we have an LLM, we can use it in a poem generator Chainlet. Add the following code to `poems.py`: ```python poems.py theme={"system"} import asyncio @chains.mark_entrypoint class PoemGenerator(chains.ChainletBase): def __init__(self, phi_llm: PhiLLM = chains.depends(PhiLLM)) -> None: self._phi_llm = phi_llm async def run_remote(self, words: list[str]) -> list[str]: tasks = [] for word in words: messages = Messages( messages=[ { "role" : "system", "content": ( "You are poet who writes short, " "lighthearted, amusing poetry." ), }, {"role": "user", "content": f"Write a poem about {word}"}, ] ) tasks.append( asyncio.create_task(self._phi_llm.run_remote(messages))) await asyncio.sleep(0) # Yield to event loop, to allow starting tasks. return list(await asyncio.gather(*tasks)) ``` Note that we use `asyncio.create_task` around each RPC to the LLM chainlet. This makes the current python process start these remote calls concurrently, that is, the next call is started before the previous one has finished and we can minimize our overall runtime. To await the results of all calls, `asyncio.gather` is used which gives us back normal python objects. If the LLM is hit with many concurrent requests, it can auto-scale up (if autoscaling is configured). More advanced LLM models have batching capabilities, so for those even a single instance can serve concurrent request. ### Deploy your Chain to Baseten To deploy your Chain to Baseten, run: ```bash theme={"system"} truss chains push --watch poems.py ``` Wait for the status to turn to `ACTIVE` and test invoking your Chain (replace `$INVOCATION_URL` in below command): ```bash theme={"system"} curl -X POST $INVOCATION_URL \ -H "Authorization: Api-Key $BASETEN_API_KEY" \ -d '{"words": ["bird", "plane", "superman"]}' #[[ #" [INST] Generate a poem about: bird [/INST] In the quiet hush of...", #" [INST] Generate a poem about: plane [/INST] In the vast, boundless...", #" [INST] Generate a poem about: superman [/INST] In the realm where..." #]] ``` # Invocation Source: https://docs.baseten.co/development/chain/invocation Call your deployed Chain Once your Chain is deployed, you can call it via its API endpoint. Chains use the same inference API as models: * [Environment endpoint](/reference/inference-api/predict-endpoints/environments-run-remote) * [Development endpoint](/reference/inference-api/predict-endpoints/development-run-remote) * [Endpoint by ID](/reference/inference-api/predict-endpoints/deployment-run-remote) Here's an example which calls the development deployment: ```python call_chain.py theme={"system"} import requests import os # From the Chain overview page on Baseten # E.g. "https://chain-.api.baseten.co/development/run_remote" CHAIN_URL = "" baseten_api_key = os.environ["BASETEN_API_KEY"] # JSON keys and types match the `run_remote` method signature. data = {...} resp = requests.post( CHAIN_URL, headers={"Authorization": f"Api-Key {baseten_api_key}"}, json=data, ) print(resp.json()) ``` ### How to pass chain input The data schema of the inference request corresponds to the function signature of [`run_remote()`](/development/chain/concepts#run-remote-chaining-chainlets) in your entrypoint Chainlet. For example, for the Hello Chain, `HelloAll.run_remote()`: ```python theme={"system"} async def run_remote(self, names: list[str]) -> str: ``` You'd pass the following JSON payload: ```json theme={"system"} { "names": ["Marius", "Sid", "Bola"] } ``` That is, the keys in the JSON record, match the argument names and values match the types of`run_remote.` ### Async chain inference Like Truss models, Chains support async invocation. The [guide for models](/inference/async) applies largely. In particular for how to wrap the input and set up the webhook to process results. The following additional points are chains specific: * Use chain-based URLS: * `https://chain-{chain}.api.baseten.co/production/async_run_remote` * `https://chain-{chain}.api.baseten.co/development/async_run_remote` * `https://chain-{chain}.api.baseten.co/deployment/{deployment}/async_run_remote`. * `https://chain-{chain}.api.baseten.co/environments/{env_name}/async_run_remote`. * Only the entrypoint is invoked asynchronously. Internal Chainlet-Chainlet calls run synchronously. # Local Development Source: https://docs.baseten.co/development/chain/localdev Iterating, Debugging, Testing, Mocking Chains are designed for production in replicated remote deployments. But alongside that production-ready power, we offer great local development and deployment experiences. Chains exists to help you build multi-step, multi-model pipelines. The abstractions that Chains introduces are based on six opinionated principles: three for architecture and three for developer experience. **Architecture principles** Each step in the pipeline can set its own hardware requirements and software dependencies, separating GPU and CPU workloads. Each component has independent autoscaling parameters for targeted resource allocation, removing bottlenecks from your pipelines. Components specify a single public interface for flexible-but-safe composition and are reusable between projects **Developer experience principles** Eliminate entire taxonomies of bugs by writing typed Python code and validating inputs, outputs, module initializations, function signatures, and even remote server configurations. Seamless local testing and cloud deployments: test Chains locally with support for mocking the output of any step and simplify your cloud deployment loops by separating large model deployments from quick updates to glue code. Use Chains to orchestrate existing model deployments, like pre-packaged models from Baseten’s model library, alongside new model pipelines built entirely within Chains. Locally, a Chain is just Python files in a source tree. While that gives you a lot of flexibility in how you structure your code, there are some constraints and rules to follow to ensure successful distributed, remote execution in production. The best thing you can do while developing locally with Chains is to run your code frequently, even if you do not have a `__main__` section: the Chains framework runs various validations at module initialization to help you catch issues early. Additionally, running `mypy` and fixing reported type errors can help you find problems early in a rapid feedback loop, before attempting a (much slower) deployment. Complementary to the purely local development Chains also has a "watch" mode, like Truss, see the [watch guide](/development/chain/watch). ## Test a Chain locally Let's revisit our "Hello World" Chain: ```python hello_chain/hello.py theme={"system"} import asyncio import truss_chains as chains # This Chainlet does the work class SayHello(chains.ChainletBase): async def run_remote(self, name: str) -> str: return f"Hello, {name}" # This Chainlet orchestrates the work @chains.mark_entrypoint class HelloAll(chains.ChainletBase): def __init__(self, say_hello_chainlet=chains.depends(SayHello)) -> None: self._say_hello = say_hello_chainlet async def run_remote(self, names: list[str]) -> str: tasks = [] for name in names: tasks.append(asyncio.create_task( self._say_hello.run_remote(name))) return "\n".join(await asyncio.gather(*tasks)) # Test the Chain locally if __name__ == "__main__": with chains.run_local(): hello_chain = HelloAll() result = asyncio.run(hello_chain.run_remote(["Marius", "Sid", "Bola"])) print(result) ``` When the `__main__()` module is run, local instances of the Chainlets are created, allowing you to test functionality of your chain just by executing the Python file: ```bash theme={"system"} cd hello_chain python hello.py # Hello, Marius # Hello, Sid # Hello, Bola ``` ## Mock execution of GPU Chainlets Using `run_local()` to run your code locally requires that your development environment have the compute resources and dependencies that each Chainlet needs. But that often isn't possible when building with AI models. Chains offers a workaround, mocking, to let you test the coordination and business logic of your multi-step inference pipeline without worrying about running the model locally. The second example in the [getting started guide](/development/chain/getting-started) implements a Truss Chain for generating poems with Phi-3. This Chain has two Chainlets: 1. The `PhiLLM` Chainlet, which can run on NVIDIA GPUs such as the L4. 2. The `PoemGenerator` Chainlet, which easily runs on a CPU. If you have an NVIDIA T4 under your desk, good for you. For the rest of us, we can mock the `PhiLLM` Chainlet that is infeasible to run locally so that we can quickly test the `PoemGenerator` Chainlet. To do this, we define a mock Phi-3 model in our `__main__` module and give it a [`run_remote()`](/development/chain/concepts#run-remote-chaining-chainlets) method that produces a test output that matches the output type we expect from the real Chainlet. Then, we inject an instance of this mock Chainlet into our Chain: ```python poems.py theme={"system"} if __name__ == "__main__": class FakePhiLLM: async def run_remote(self, prompt: str) -> str: return f"Here's a poem about {prompt.split(" ")[-1]}" with chains.run_local(): poem_generator = PoemGenerator(phi_llm=FakePhiLLM()) result = asyncio.run(poem_generator.run_remote(words=["bird", "plane", "superman"])) print(result) ``` And run your Python file: ```bash theme={"system"} python poems.py # ['Here's a poem about bird', 'Here's a poem about plane', 'Here's a poem about superman'] ``` ### Typing of mocks You may notice that the argument `phi_llm` expects a type `PhiLLM`, while we pass an instance of `FakePhiLLM`. These aren't the same, which is formally a type error. However, this works at runtime because we constructed `FakePhiLLM` to implement the same *protocol* as the real thing. We can make this explicit by defining a `Protocol` as a type annotation: ```python theme={"system"} from typing import Protocol class PhiProtocol(Protocol): def run_remote(self, data: str) -> str: ... ``` and changing the argument type in `PoemGenerator`: ```python theme={"system"} @chains.mark_entrypoint class PoemGenerator(chains.ChainletBase): def __init__(self, phi_llm: PhiProtocol = chains.depends(PhiLLM)) -> None: self._phi_llm = phi_llm ``` This is a bit more work and not needed to execute the code, but it shows how typing consistency can be achieved if desired. # Overview Source: https://docs.baseten.co/development/chain/overview Chains is a framework for building robust, performant multi-step and multi-model inference pipelines and deploying them to production. It addresses the common challenges of managing latency, cost and dependencies for complex workflows, while leveraging Truss’ existing battle-tested performance, reliability and developer toolkit.