# AI tools
Source: https://docs.baseten.co/ai-tools
Connect AI tools to Baseten documentation for context-aware assistance with deploying and serving models.
Baseten docs are optimized for AI tools. Connect your assistants, coding tools, and agents directly to the docs so they have up-to-date context when helping you build on Baseten.
Every page includes a contextual menu (the icon in the top-right corner of any page) with shortcuts to copy content and connect your MCP server.
## MCP server
The Model Context Protocol (MCP) connects AI tools directly to Baseten documentation. When connected, your AI tool searches the docs in real time while generating responses, so you get answers grounded in current documentation rather than stale training data.
The Baseten docs MCP server is available at:
```
https://docs.baseten.co/mcp
```
Add the MCP server to Claude Code:
```bash theme={"system"}
claude mcp add --transport http baseten-docs https://docs.baseten.co/mcp
```
Claude Code searches Baseten docs automatically when relevant to your prompts.
Navigate to the **Connectors** page in Claude settings.
Select **Add custom connector**, then enter:
* **Name:** Baseten Docs
* **URL:** `https://docs.baseten.co/mcp`
Select **Add**.
When starting a conversation, select the attachments button (the plus icon) and choose the Baseten Docs connector. Claude searches the docs as needed while responding.
Use `Cmd + Shift + P` (macOS) or `Ctrl + Shift + P` (Windows/Linux) to open the command palette. Search for **"Open MCP settings"**.
Select **Add custom MCP**. This opens your `mcp.json` file. Add the Baseten docs server:
```json mcp.json theme={"system"}
{
"mcpServers": {
"baseten-docs": {
"type": "http",
"url": "https://docs.baseten.co/mcp"
}
}
}
```
Create or update `.vscode/mcp.json` in your project:
```json .vscode/mcp.json theme={"system"}
{
"servers": {
"baseten-docs": {
"type": "http",
"url": "https://docs.baseten.co/mcp"
}
}
}
```
### Other MCP clients
Any MCP-compatible tool (Goose, ChatGPT, Windsurf, and others) can connect using the server URL `https://docs.baseten.co/mcp`. Refer to your tool's documentation for how to add an MCP server.
You can also use `npx add-mcp` to auto-detect supported AI tools on your system and configure them:
```bash theme={"system"}
npx add-mcp https://docs.baseten.co
```
## Skills
The skills file describes what AI agents can accomplish with Baseten, including required inputs and constraints. AI coding tools use this file to understand Baseten capabilities without reading every documentation page.
Install the Baseten docs skill into your AI coding tool:
```bash theme={"system"}
npx skills add https://docs.baseten.co
```
This gives your AI tool structured knowledge of Baseten's capabilities so it can help you deploy models, configure autoscaling, set up inference endpoints, and more with product-aware guidance.
View the skill file directly at [docs.baseten.co/skill.md](https://docs.baseten.co/skill.md).
Skills and MCP serve complementary purposes. **Skills** tell an AI tool *what Baseten can do* and how to do it. **MCP** lets the tool *search current documentation* for specific details. For the best results, install both.
## llms.txt
The `llms.txt` file is an industry-standard directory that helps LLMs index documentation efficiently, similar to how `sitemap.xml` helps search engines. Baseten docs automatically host two versions:
* [docs.baseten.co/llms.txt](https://docs.baseten.co/llms.txt): a structured list of all pages with descriptions.
* [docs.baseten.co/llms-full.txt](https://docs.baseten.co/llms-full.txt): the full text content of all pages.
These files stay up to date automatically and require no configuration.
AI tools and search engines like ChatGPT, Perplexity, and Google AI Overviews use them to understand and cite Baseten documentation.
## Markdown access
Every documentation page is available as Markdown by appending `.md` to the URL. For example:
```
https://docs.baseten.co/quickstart.md
```
AI agents receive page content as Markdown instead of HTML, which reduces token usage and improves processing speed. You can use this to quickly copy any page's content into an AI conversation.
## Contextual menu reference
The contextual menu on each page provides one-click access to these integrations. Select the menu icon in the top-right corner of any page.
| Option | Description |
| ------------------- | --------------------------------------------------------- |
| Copy page | Copies the page as Markdown for pasting into any AI tool. |
| View as Markdown | Opens the page as raw Markdown in a new tab. |
| Copy MCP server URL | Copies the MCP server URL to your clipboard. |
| Connect to Cursor | Installs the MCP server in Cursor. |
| Connect to VS Code | Installs the MCP server in VS Code. |
# How Baseten works
Source: https://docs.baseten.co/concepts/howbasetenworks
Follow a model from truss push to a running endpoint: the build pipeline, request routing, autoscaling, and deployment lifecycle.
The [overview](/overview) covers Baseten's capabilities. This page covers the underlying mechanics: how a config file becomes a running endpoint, how Baseten routes requests to your model, how the autoscaler manages capacity, and how you promote a model from development to production.
## Multi-cloud Capacity Management (MCM)
Behind every Baseten deployment is our Multi-cloud Capacity Management (MCM) system. MCM acts as the infrastructure control plane, unifying thousands of GPUs across 10+ cloud service providers and multiple geographic regions.
When you request a resource (an H100 in US-East-1 or a cluster of B200s in a private region), MCM provisions the hardware, configures networking, and monitors health. It abstracts differences between cloud providers to ensure the Baseten Inference Stack runs identically on any underlying infrastructure.
This system powers Baseten's high availability by enabling active-active deployments across different clouds. If a region or provider faces a capacity crunch or outage, MCM rapidly re-routes and re-provisions workloads to maintain service continuity.
## The build pipeline
When you run `truss push`, the CLI validates your `config.yaml`, archives your project directory, and uploads it to cloud storage. Baseten receives the archive and starts the build.
For [Engine-Builder-LLM](/engines/engine-builder-llm/overview), Baseten downloads model weights from the source repository (Hugging Face, S3, or GCS) and compiles them with TensorRT-LLM. The compilation step builds optimized CUDA kernels for the target GPU architecture, applies quantization (FP8, FP4) if configured, and sets up tensor parallelism across multiple GPUs.
Baseten packages the compiled engine, runtime configuration, and serving infrastructure into a container, deploys it to GPU infrastructure, and exposes it as an API endpoint.
The `truss push` command returns once the upload finishes. For engine-based deployments, compilation can take several minutes. Watch progress in the deployment logs or check the dashboard, which shows "Active" when the endpoint is ready for requests.
For [custom model code](/development/model/custom-model-code) deployments, the build is faster: Baseten installs your Python dependencies, packages your `Model` class into a container, and deploys it. You remain responsible for any inference optimization in custom builds.
## Request routing
Each deployment gets a dedicated subdomain: `https://model-{model_id}.api.baseten.co/`. The URL path determines which deployment handles the request. Requests to `/production/predict` go to the production environment, while `/development/predict` goes to the development deployment. You can also target a specific deployment by ID or a custom environment by name.
Once the environment is resolved, the load balancer routes the request to an active replica. If the model has scaled to zero, Baseten spins up a replica and queues the request until the model loads and becomes ready. The caller receives the response regardless of whether the model was warm or cold.
Engine-based deployments serve an [OpenAI-compatible API](/reference/inference-api/chat-completions) at the `/v1/chat/completions` path, so any code written for the OpenAI SDK works without modification. Custom model deployments use the [predict API](/reference/inference-api/overview), which accepts and returns arbitrary JSON.
For long-running workloads, [async requests](/inference/async) return a request ID immediately. The request enters a queue managed by an async request service. A background worker then calls your model and delivers the result via webhook. Sync requests take priority over async requests when competing for concurrency slots to prevent background work from starving real-time traffic.
## Autoscaling
Baseten's autoscaler watches in-flight request counts and adjusts replicas to maintain each one near its [concurrency target](/deployment/autoscaling/overview).
Scaling up is immediate. When average utilization crosses the target threshold (default 70%) within the autoscaling window (default 60 seconds), the autoscaler adds replicas up to the configured maximum.
Scaling down is deliberately slow. When traffic drops, the autoscaler flags excess replicas for removal but keeps them alive for a configurable delay (default 900 seconds). It uses exponential backoff: removing half the excess replicas, waiting, and then removing half again. This prevents the cluster from thrashing during bursty traffic.
Setting `min_replica` to 0 enables scale-to-zero. The model stops incurring GPU cost when idle, but the next request triggers a cold start. Setting `min_replica` to 1 or higher keeps warm capacity ready at all times, trading cost for lower latency.
## Cold starts and the weight delivery network
The slowest part of a cold start is loading model weights, which can reach hundreds of gigabytes. Baseten addresses this with the [Baseten Delivery Network (BDN)](/development/model/bdn), a multi-tier caching system for model weights.
When you first deploy, BDN mirrors your model weights from the source repository to Baseten's own blob storage. After that, no cold start depends on an upstream service like Hugging Face or S3. When a new replica starts, the BDN agent on the node fetches a manifest for the weights, downloads them through an in-cluster cache (shared across all pods in the cluster), and stores them in a node-level cache (shared across all replicas on the same node). Identical files across different models are deduplicated, so a GLM fine-tune that shares most weights with the base model only downloads the delta.
Subsequent cold starts on the same node or in the same cluster are significantly faster than the first. Container images use streaming, so the model begins loading weights before the image download completes.
## Environments and promotion
Every model starts with a development deployment: a single replica with scale-to-zero enabled and live reload for fast iteration. When the model is ready for production traffic, promote it to an environment.
The [production environment](/deployment/environments) exists by default. You can create additional environments (staging, shadow, or canary) for testing and gradual rollouts. Each environment has a stable endpoint URL, its own autoscaling settings, and dedicated metrics. The endpoint URL remains constant when you promote new deployments, so your application code doesn't need to change.
Promotion replaces the current deployment in an environment with the new one. The new deployment inherits the environment's autoscaling settings. Baseten demotes the previous deployment and scales it to zero, allowing you to roll back by re-promoting it. You can also push directly to an environment with `truss push --environment staging` to skip the development stage.
Only one promotion can be active per environment at a time to prevent conflicting updates.
# Why Baseten
Source: https://docs.baseten.co/concepts/whybaseten
Mission-critical inference with dedicated infrastructure, global scale, and full control.
Baseten provides high-performance inference for teams that have outgrown shared API endpoints. We deliver the performance of custom-built infrastructure with the ease of a managed platform, allowing you to deploy and scale any model behind a production-grade API.
## Mission-critical inference
Inference is the core of your application. When it fails, your product stops working. We built Baseten to handle mission-critical workloads, offering 99.99% uptime and low-latency performance at any scale.
Operating thousands of GPUs across multiple regions and cloud providers exposes the limits of traditional deployment. Single points of failure, regional capacity constraints, and the overhead of managing heterogeneous clouds create significant operational risk. We solved these problems with our Multi-cloud Capacity Management (MCM) system.
## Multi-cloud Capacity Management (MCM)
MCM is a unified control layer that provisions and scales resources across 10+ clouds and regions. It handles the complexity of cloud-agnostic orchestration, giving you a single pane of glass for your entire inference fleet.
Whether you run in our cloud, yours, or both, the experience is identical. MCM enables three deployment modes, all sharing the same high-performance inference stack:
### Baseten Cloud
Fully managed, multi-cloud inference. This is the fastest path to production, offering limitless scale and global latency optimization. We manage the infrastructure so you can focus on your models.
### Baseten Self-hosted
The full Baseten stack inside your own VPC. Use this when you have strict data security, privacy, or sovereignty requirements. You maintain complete control over your data and networking while benefiting from Baseten’s autoscaling and performance optimizations.
### Baseten Hybrid
The best of both worlds. Run core workloads in your VPC for maximum control and burst to Baseten Cloud on demand. This approach eliminates the trade-off between strict compliance and the need for elastic flex capacity.
## The Baseten advantage
ML teams at Abridge, Writer, and Patreon use Baseten to serve millions of users. Our platform is built on four pillars that ensure your success in production:
* **Model performance:** Our engineers apply the latest research in custom kernels and runtimes, delivering low latency and high throughput out of the box.
* **Reliable infrastructure:** Deploy across clusters and clouds with active-active reliability and built-in redundancy.
* **Operational control:** Use deep observability, secret management, and fine-grained autoscaling to maintain your SLAs.
* **Compliance by design:** SOC 2 Type II, HIPAA, and GDPR compliance ensure that your deployments meet the highest standards for data security.
## Comparison of deployment options
| Feature | Baseten Cloud | Self-hosted | Hybrid |
| :----------------- | :--------------------- | :----------------- | :----------------------- |
| **Scaling** | Unlimited, multi-cloud | Within your VPC | VPC with Cloud spillover |
| **Data Residency** | Region-locked options | Full local control | Local with Cloud options |
| **Compliance** | SOC 2, HIPAA, GDPR | Your compliance | Hybrid compliance |
| **Time to Market** | Hours | Days | Days |
Baseten gives you the visibility and control of your own infrastructure without the operational burden. Whether you're deploying a single LLM or an entire library of models, you can start with a managed solution and transition to self-hosted or hybrid modes as your requirements evolve.
# Cold starts
Source: https://docs.baseten.co/deployment/autoscaling/cold-starts
Understand cold starts and how to minimize their impact on your deployments.
A *cold start* is the time required to initialize a new replica when scaling
up. Cold starts affect the latency of requests that trigger new replica
creation.
***
## When cold starts happen
Cold starts occur in two scenarios:
1. **Scale-from-zero**: When a deployment with zero active replicas receives its first request.
2. **Scaling events**: When traffic increases and the autoscaler adds new replicas.
***
## What contributes to cold start time
Cold start duration depends on several factors:
| Factor | Impact |
| -------------- | ---------------------------------------------------------------------- |
| Model loading | Loading model weights (10s–100s of GBs), typically the dominant factor |
| Container pull | Downloading Docker image layers |
| Initialization | Running your model's setup code |
For large models, cold starts can take minutes. Model weight downloads are usually the bottleneck. Even with optimizations, the physics of moving hundreds of gigabytes of data creates inherent lag.
***
## Minimizing cold starts
### Keep replicas warm
Set [`min_replica`](/reference/management-api/deployments/autoscaling/updates-a-deployments-autoscaling-settings) to always have at least one replica ready to serve requests. This eliminates cold starts for the first request but increases cost.
```json theme={"system"}
{
"min_replica": 1
}
```
For production redundancy, set `min_replica ≥ 2` so one replica can fail during maintenance without causing cold starts.
### Pre-warm before expected traffic
For predictable traffic spikes, increase min replicas before the expected load:
```bash theme={"system"}
# 10-15 minutes before expected spike
curl -X PATCH \
https://api.baseten.co/v1/models/{model_id}/deployments/{deployment_id}/autoscaling_settings \
-H "Authorization: Api-Key $BASETEN_API_KEY" \
-d '{"min_replica": 5}'
```
After traffic stabilizes, reset to your normal minimum.
### Use longer scale-down delay
A longer scale-down delay keeps replicas warm during temporary traffic dips:
```json theme={"system"}
{
"scale_down_delay": 900
}
```
This prevents cold starts when traffic returns within the delay window.
***
## Platform optimizations
Baseten automatically applies several optimizations to reduce cold start times:
**Baseten Delivery Network (Recommended)**: The [`weights`](/development/model/bdn) configuration optimizes cold starts by mirroring weights to Baseten's infrastructure and caching them close to your model pods. See [Baseten Delivery Network (BDN)](/development/model/bdn) for full configuration options.
**Network accelerator (Legacy)**: Parallelized byte-range downloads speed up model loading from Hugging Face, S3, GCS, and R2.
Network Acceleration is deprecated in favor of the new `weights` configuration, which provides superior cold start performance through multi-tier caching. See [Baseten Delivery Network (BDN)](/development/model/bdn) for the recommended approach.
**Image streaming**: Optimized images stream into nodes, allowing model loading to begin before the full download completes:
```
Successfully pulled streaming-enabled image in 15.851s. Image size: 32 GB.
```
These optimizations are applied automatically.
***
## The tradeoff
Cold starts create a fundamental tradeoff between **cost** and **latency**:
| Approach | Cost | Latency |
| -------------------------------- | ----------------------------- | ------------------------------------------ |
| Scale to zero (`min_replica: 0`) | Lower: no cost when idle | Higher: first request waits for cold start |
| Always on (`min_replica: ≥1`) | Higher: pay for idle replicas | Lower: no cold starts |
For latency-sensitive production workloads, the cost of keeping replicas warm is usually justified. For batch workloads or development, scale-to-zero often makes sense.
***
## Next steps
* [Autoscaling](/deployment/autoscaling/overview): Configure min replicas and scale-down delay.
* [Traffic patterns](/deployment/autoscaling/traffic-patterns): Pre-warming strategies for different traffic types.
* [Troubleshooting](/troubleshooting/deployments#autoscaling-issues): Diagnose cold start issues.
# Autoscaling
Source: https://docs.baseten.co/deployment/autoscaling/overview
Configure autoscaling to dynamically adjust replicas based on traffic while minimizing idle compute costs.
Autoscaling is a control loop that adjusts the number of **replicas** backing a
deployment based on demand. The goal is to balance **performance** (latency and
throughput) against **cost** (GPU hours). Autoscaling is reactive by nature.
Baseten provides default settings that work for most workloads.
Tune your autoscaling settings based on your model and traffic.
| Parameter | Default | Range | What it controls |
| ------------------ | ------- | -------- | ---------------------------------------- |
| Min replicas | 0 | ≥ 0 | Baseline capacity (0 = scale to zero). |
| Max replicas | 1 | ≥ 1 | Cost/capacity ceiling. |
| Autoscaling window | 60s | 10–3600s | Time window for traffic analysis. |
| Scale-down delay | 900s | 0–3600s | Wait time before removing idle replicas. |
| Concurrency target | 1 | ≥ 1 | Requests per replica before scaling. |
| Target utilization | 70% | 1–100% | Headroom before scaling triggers. |
Configure autoscaling settings through the Baseten UI or API:
1. Select your deployment.
2. Under **Replicas** for your production environment, choose **Configure**.
3. Configure the autoscaling settings and choose **Update**.
```bash theme={"system"}
curl -X PATCH \
https://api.baseten.co/v1/models/{model_id}/deployments/{deployment_id}/autoscaling_settings \
-H "Authorization: Api-Key $BASETEN_API_KEY" \
-d '{
"min_replica": 2,
"max_replica": 10,
"concurrency_target": 32,
"target_utilization_percentage": 70,
"autoscaling_window": 60,
"scale_down_delay": 900
}'
```
For more information, see the [API reference](/reference/management-api/deployments/autoscaling/updates-a-deployments-autoscaling-settings).
```python theme={"system"}
import requests
import os
API_KEY = os.environ.get("BASETEN_API_KEY")
response = requests.patch(
"https://api.baseten.co/v1/models/{model_id}/deployments/{deployment_id}/autoscaling_settings",
headers={"Authorization": f"Api-Key {API_KEY}"},
json={
"min_replica": 2,
"max_replica": 10,
"concurrency_target": 32,
"target_utilization_percentage": 70,
"autoscaling_window": 60,
"scale_down_delay": 900
}
)
print(response.json())
```
For more information, see the [API reference](/reference/management-api/deployments/autoscaling/updates-a-deployments-autoscaling-settings).
***
## How autoscaling works
When the **average requests per active replica** exceeds the **concurrency target × target utilization** within the **autoscaling window**, more replicas are created until:
* The concurrency target is met.
* The maximum replica count is reached.
When traffic drops below the concurrency target, excess replicas are flagged for removal.
The **scale-down delay** ensures replicas are not removed prematurely:
* If traffic returns before the delay ends, replicas remain active.
* Scale-down uses exponential back-off: cut half the excess replicas, wait, then cut half again.
* Scaling stops when the minimum replica count is reached.
***
## Replicas
Replicas are individual instances of your model, each capable of serving requests independently. The autoscaler adjusts the number of replicas based on traffic, but you control the boundaries with minimum and maximum replica settings.
The floor for your deployment's capacity. The autoscaler won't scale below this number.
**Range:** ≥ 0
The default of 0 enables *scale-to-zero*: your deployment costs nothing when idle, but the first request triggers a [cold start](/deployment/autoscaling/cold-starts). For large models, cold starts can take minutes.
For production deployments, set `min_replica` to at least 2. This provides redundancy if one replica fails and eliminates cold starts.
The ceiling for your deployment's capacity.
The autoscaler won't scale above this number.
**Range:** ≥ 1
This setting protects against runaway scaling and unexpected costs.
If traffic exceeds max replica capacity, requests queue rather than triggering new replicas.
The default of 1 means no autoscaling, exactly one replica regardless of load.
Estimate max replicas:
$$
(peak\_requests\_per\_second / throughput\_per\_replica) + buffer
$$
For high-volume workloads requiring guaranteed capacity, [contact Baseten](mailto:support@baseten.co) about reserved capacity options.
***
## Scaling triggers
Scaling triggers determine when the autoscaler adds or removes capacity. The two key settings: **concurrency target** and **target utilization** work together to define when your deployment needs more or fewer replicas.
How many requests each replica can handle simultaneously. This directly determines replica count for a given load.
**Range:** ≥ 1
The autoscaler calculates desired replicas:
$$
ceiling(in\_flight\_requests / (concurrency\_target \times target\_utilization))
$$
*In-flight requests* are requests sent to your model that haven't returned a response (for streaming, until the stream completes). This count is exposed as [`baseten_concurrent_requests`](/observability/export-metrics/supported-metrics#baseten_concurrent_requests) in the metrics dashboard and metrics export.
The default of 1 is appropriate for models that process one request at a time (like image generation consuming all GPU memory). For models with batching (LLMs, embeddings), higher values reduce cost.
**Tradeoff:** Higher concurrency = fewer replicas (lower cost) but more per-replica queueing (higher latency). Lower concurrency = more replicas (higher cost) but less queueing (lower latency).
**Starting points by model type:**
| Model type | Starting concurrency |
| ----------------------- | -------------------- |
| Standard Truss model | 1 |
| vLLM / LLM inference | 32–128 |
| SGLang | 32 |
| Text embeddings (TEI) | 32 |
| BEI embeddings | 96+ (min ≥ 8) |
| Whisper (async batch) | 256 |
| Image generation (SDXL) | 1 |
For engine-specific guidance, see [Autoscaling engines](/engines/performance-concepts/autoscaling-engines).
**Concurrency target** controls requests sent *to* a replica and triggers autoscaling.
**predict\_concurrency** (Truss config.yaml) controls requests processed *inside* the container.
Concurrency target should be less than or equal to predict\_concurrency.
See the `predict_concurrency` field in the [Truss configuration reference](/reference/truss-configuration) for details.
Headroom before scaling triggers.
The autoscaler scales when utilization reaches this percentage of the concurrency target.
**Range:** 1–100%
The effective threshold is:
$$
concurrency\_target × target\_utilization
$$
With concurrency target 10 and utilization 70%, scaling triggers at 7 concurrent requests (10 × 0.70), leaving 30% headroom.
Lower values (50–60%) provide more headroom for spikes but cost more. Higher values (80%+) are cost-efficient for steady traffic but absorb spikes less effectively.
Target utilization is **not** GPU utilization. It measures request slot usage relative to your concurrency target, not hardware utilization.
***
## Scaling dynamics
Scaling dynamics control how quickly and smoothly the autoscaler responds to traffic changes. These settings help you balance responsiveness against stability.
How far back (in seconds) the autoscaler looks when measuring traffic. Traffic is averaged over this window to make scaling decisions.
**Range:** 10–3600 seconds
A 60-second window considers average load over the past minute, smoothing out momentary spikes. Shorter windows (30–60s) react quickly to traffic changes. Longer windows (2–5 min) ignore short-lived fluctuations and prevent chasing noise.
How long (in seconds) the autoscaler waits after load drops before removing replicas. Prevents premature scale-down during temporary dips.
**Range:** 0–3600 seconds
When load drops, the autoscaler starts a countdown. If load stays low for the full delay, it removes replicas using exponential back-off (half the excess, wait, half again).
This is your primary lever for preventing *oscillation* (thrashing). If replicas repeatedly scale up and down, increase this first.
A **short window** with a **long delay** gives you fast scale-up while maintaining capacity during temporary dips. This is a good starting configuration for most workloads.
***
## Development deployments
Development deployments have fixed replica limits but allow modification of other autoscaling settings.
The replica constraints are optimized for the development workflow rapid iteration with live reloading using the [`truss watch`](/reference/cli/truss/watch) command, rather than production traffic handling.
| Setting | Value | Modifiable |
| ------------------ | ----------- | ---------- |
| Min replicas | 0 | No |
| Max replicas | 1 | No |
| Autoscaling window | 60 seconds | Yes |
| Scale-down delay | 900 seconds | Yes |
| Concurrency target | 1 | Yes |
| Target utilization | 70% | Yes |
The single-replica limit means development deployments aren't suitable for load testing or handling real traffic.
To enable full autoscaling with configurable replica settings, [promote the deployment to production](/deployment/deployments).
***
## Next steps
Identify your traffic pattern and get recommended starting settings.
Understand cold starts and how to minimize their impact.
Tune autoscaling settings for your traffic pattern.
Complete autoscaling API documentation.
***
## Troubleshooting
Having issues with autoscaling? See [Autoscaling troubleshooting](/troubleshooting/deployments#autoscaling-issues) for solutions to common problems like oscillation, slow scale-up, and unexpected costs.
# Traffic patterns
Source: https://docs.baseten.co/deployment/autoscaling/traffic-patterns
Identify your traffic pattern and configure autoscaling settings to match.
Different traffic patterns require different autoscaling configurations.
Identify your pattern below for recommended starting settings.
These are **starting points**, not final answers. Monitor your
deployment's performance and adjust based on observed behavior. See
[Autoscaling](/deployment/autoscaling/overview) for parameter details.
***
## Jittery traffic
Small, frequent spikes that quickly return to baseline.
### Characteristics
* Baseline replica count is steady, but **spikes up by 2x several times per hour**.
* Spikes are short-lived and return to baseline quickly.
* Often not real load growth, just temporary surges causing overreaction.
### Common causes
* Consumer products with intermittent usage bursts.
* Traffic splitting or A/B testing with low percentages.
* Polling clients with synchronized intervals.
### Recommended settings
| Parameter | Value | Why |
| ------------------ | ----------------- | ----------------------------------------------- |
| Autoscaling window | **2-5 minutes** | Smooth out noise, avoid reacting to every spike |
| Scale-down delay | **300-600s** | Moderate stability |
| Target utilization | **70%** | Default is fine |
| Concurrency target | Benchmarked value | Start conservative |
A longer autoscaling window averages out the jitter so the autoscaler doesn't chase every small spike. You're trading reaction speed for stability, which is acceptable when the spikes aren't sustained load increases.
If you're still seeing oscillation with these settings, increase the scale-down delay before lowering target utilization.
***
## Bursty traffic
### Characteristics
* Traffic **jumps sharply** (2x+ within 60 seconds).
* Stays high for a sustained period before dropping.
* The "pain" is queueing and latency spikes while new replicas start.
### Common causes
* Daily morning ramp-up (users starting their day).
* Marketing events, product launches, viral moments.
* Top-of-hour scheduled jobs or cron-triggered traffic.
### Recommended settings
| Parameter | Value | Why |
| ------------------ | ---------- | --------------------------------------------- |
| Autoscaling window | **30-60s** | React quickly to genuine load increases |
| Scale-down delay | **900s+** | Handle back-to-back waves without thrashing |
| Target utilization | **50-60%** | More headroom absorbs the burst while scaling |
| Min replicas | **≥2** | Redundancy + reduces cold start impact |
Short window means fast reaction. Long delay prevents scaling down between waves. Lower utilization gives you buffer capacity while new replicas start.
### Pre-warming for predictable bursts
If your bursts are predictable (morning ramp, scheduled events), pre-warm by bumping min replicas before the expected spike:
```bash theme={"system"}
curl -X PATCH \
https://api.baseten.co/v1/models/{model_id}/deployments/{deployment_id}/autoscaling_settings \
-H "Authorization: Api-Key $BASETEN_API_KEY" \
-d '{"min_replica": 5}'
```
After the burst subsides, reset to your normal minimum:
```bash theme={"system"}
curl -X PATCH \
https://api.baseten.co/v1/models/{model_id}/deployments/{deployment_id}/autoscaling_settings \
-H "Authorization: Api-Key $BASETEN_API_KEY" \
-d '{"min_replica": 2}'
```
Automate pre-warming with cron jobs or your orchestration system.
Bumping min replicas 10-15 minutes before known peaks avoids cold starts for the first requests after the spike.
***
## Scheduled traffic
### Characteristics
* **Long periods of low or zero traffic**.
* Large bursts tied to job schedules (hourly, daily, weekly).
* Traffic patterns are predictable but infrequent.
### Common causes
* ETL pipelines and data processing jobs.
* Embedding backfills and batch inference.
* Periodic evaluation or testing jobs.
* Document processing triggered by user uploads.
### Recommended settings
| Parameter | Value | Why |
| ------------------ | --------------------------------------------------------------- | ----------------------------------------- |
| Min replicas | **0** (if cold starts acceptable) or **1** (during job windows) | Cost savings when idle |
| Scale-down delay | **Moderate to high** | Jobs often come in waves |
| Autoscaling window | **60-120s** | Don't overreact to the first few requests |
| Target utilization | **70%** | Default is fine |
Scale-to-zero saves significant cost during idle periods. The moderate window prevents overreacting to the initial requests of a batch. If jobs come in waves, a longer delay keeps replicas warm between them.
### Scheduled pre-warming
For predictable batch jobs, use cron + API to pre-warm.
5 minutes before the hourly job, scale up:
```bash theme={"system"}
0 * * * * curl -X PATCH \
https://api.baseten.co/v1/models/{model_id}/deployments/{deployment_id}/autoscaling_settings \
-H "Authorization: Api-Key $BASETEN_API_KEY" \
-d '{"min_replica": 3}'
```
30 minutes after the job completes, scale back down:
```bash theme={"system"}
30 * * * * curl -X PATCH \
https://api.baseten.co/v1/models/{model_id}/deployments/{deployment_id}/autoscaling_settings \
-H "Authorization: Api-Key $BASETEN_API_KEY" \
-d '{"min_replica": 0}'
```
If you use scale-to-zero, the first request of each batch will experience a [cold start](/deployment/autoscaling/cold-starts). For latency-sensitive batch jobs, keep min replicas at 1 during expected job windows.
***
## Steady traffic
### Characteristics
* Traffic **rises and falls gradually** over the day.
* Classic diurnal pattern with no sharp edges.
* Predictable, cyclical behavior.
### Common causes
* Always-on inference APIs with consistent user base.
* B2B applications with business-hours usage.
* Production workloads with stable, mature traffic.
### Recommended settings
| Parameter | Value | Why |
| ------------------ | ------------ | ------------------------------ |
| Target utilization | **70-80%** | Can run replicas hotter safely |
| Autoscaling window | **60-120s** | Moderate reaction speed |
| Scale-down delay | **300-600s** | Moderate |
| Min replicas | **≥2** | Redundancy for production |
Without sudden spikes, you don't need as much headroom. You can run replicas at higher utilization (lower cost) because load changes are gradual and predictable. The autoscaler has time to react.
Smooth traffic is the easiest to tune. Start with defaults, monitor for a week, then optimize for cost by gradually raising target utilization while watching p95 latency.
***
## Identifying your pattern
Not sure which pattern you have? Check your metrics:
1. Go to your model's **Metrics** tab in the Baseten dashboard
2. Look at **Inference volume** and **Replicas** over the past week
3. Compare to the patterns above
| You see... | Your pattern is... |
| ----------------------------------------------------- | ------------------ |
| Frequent small spikes that quickly return to baseline | Jittery |
| Sharp jumps that stay high for a while | Bursty |
| Long flat periods with occasional large bursts | Scheduled |
| Gradual rises and falls, smooth curves | Steady |
Some workloads are a mix of patterns. If your traffic has both smooth diurnal patterns AND occasional bursts, optimize for the bursts (they cause the most pain) and accept slightly higher cost during steady periods.
***
## Next steps
* [Autoscaling](/deployment/autoscaling/overview): Full parameter documentation.
* [Troubleshooting autoscaling](/troubleshooting/deployments#autoscaling-issues): Diagnose and fix common problems.
* [Truss configuration reference](/reference/truss-configuration): Configure predict\_concurrency in your model.
# Concepts
Source: https://docs.baseten.co/deployment/concepts
Deployments, environments, resources, and autoscaling on Baseten.
When you run `truss push`, Baseten creates a [deployment](/deployment/deployments): a running instance of your model on GPU infrastructure with an API endpoint. This page explains how deployments are managed, versioned, and scaled.
## Deployments
A [deployment](/deployment/deployments) is a single version of your model running on specific hardware. Every `truss push` creates a new deployment. You can have multiple deployments of the same model running simultaneously, which is how you test new versions without affecting production traffic. Deployments can be deactivated to stop serving (and stop incurring cost) or deleted permanently when no longer needed.
## Environments
As your model matures, you need a way to manage releases. [Environments](/deployment/environments) provide stable endpoints that persist across deployments. A typical setup has a development environment for testing and a production environment for live traffic. Each environment maintains its own autoscaling settings, metrics, and endpoint URL. When a new deployment is ready, you promote it to an environment, and traffic shifts to the new version without changing the endpoint your application calls.
## Resources
Every deployment runs on a specific [instance type](/deployment/resources) that defines its GPU, CPU, and memory allocation. Choosing the right instance balances inference speed against cost. You set the instance type in your `config.yaml` before deployment, or adjust it later in the dashboard. Smaller models run well on an L4 (24 GB VRAM), while large LLMs may need A100s or H100s with tensor parallelism across multiple GPUs.
## Autoscaling
You don't manage replicas manually. [Autoscaling](/deployment/autoscaling/overview) adjusts the number of running instances based on incoming traffic. You configure a minimum and maximum replica count, a concurrency target, and a scale-down delay. When traffic drops, replicas scale down (optionally to zero, eliminating all cost). When traffic spikes, new replicas come up within seconds. [Cold start optimization](/deployment/autoscaling/cold-starts) and network acceleration keep response times fast even when scaling from zero.
# Deployments
Source: https://docs.baseten.co/deployment/deployments
Deploy, manage, and scale machine learning models with Baseten
A **deployment** in Baseten is a **containerized instance of a model** that serves inference requests via an API endpoint. Deployments exist independently but can be **promoted to an environment** for structured access and scaling.
Baseten **automatically wraps every deployment in a REST API**. Once deployed, models can be queried with a simple HTTP request:
```python theme={"system"}
import requests
resp = requests.post(
"https://model-{modelID}.api.baseten.co/deployment/[{deploymentID}]/predict",
headers={"Authorization": "Api-Key YOUR_API_KEY"},
json={'text': 'Hello my name is {MASK}'},
)
print(resp.json())
```
[Learn more about running inference on your deployment](/inference/calling-your-model)
***
# Development deployment
A **development deployment** is a mutable instance designed for rapid iteration. Create one with `truss push --watch` (for models) or `truss chains push --watch` (for Chains). It is always in the **development state** and cannot be renamed or detached from it.
Key characteristics:
* **Live reload** enables direct updates without redeployment.
* **Single replica, scales to zero** when idle to conserve compute resources.
* **No autoscaling or zero-downtime updates.**
* **Can be promoted** to create a persistent deployment.
Once promoted, the development deployment transitions to a **deployment** and can optionally be promoted to an environment.
***
# Environments and promotion
Environments provide **logical isolation** for managing deployments but are **not required** for a deployment to function. You can execute a deployment independently or promoted to an environment for controlled traffic allocation and scaling.
* The **production environment** exists by default.
* **Custom environments** (e.g., staging) can be created for specific workflows.
* **Promoting a deployment doesn't modify its behavior**, only its routing and lifecycle management.
## Rolling deployments
Rolling deployments replace replicas incrementally when promoting a deployment to an environment. Instead of swapping all traffic at once, rolling deployments scale up the candidate, shift traffic proportionally, and scale down the previous deployment in controlled steps. You can pause, resume, cancel, or force-complete a rolling deployment at any point.
See [Rolling deployments](/deployment/rolling-deployments) for configuration, control actions, and status reference.
## Canary deployments (deprecated)
Canary deployments are deprecated. Use [rolling deployments](/deployment/rolling-deployments) for incremental traffic shifting with finer control over replica provisioning and rollback.
Canary deployments support incremental traffic shifting to a new deployment in 10 evenly distributed stages over a configurable time window. Canary rollouts can be enabled or canceled via the UI or [REST API](/reference/management-api/environments/update-an-environments-settings).
***
# Managing deployments
## Naming deployments
By default, deployments of a model are named `deployment-1`, `deployment-2`, and so forth sequentially. You can instead give deployments custom names via two methods:
1. While creating the deployment, using a [command line argument in truss push](/reference/sdk/truss#deploying-a-model).
2. After creating the deployment, in the model management page within your Baseten dashboard.
Renaming deployments is purely aesthetic and does not affect model management API paths, which work via model and deployment IDs.
## Deactivating a deployment
Deactivate a deployment to suspend inference execution while preserving configuration.
* **Remains visible in the dashboard.**
* **Consumes no compute resources** but can be reactivated anytime.
* **API requests return a 404 error while deactivated.**
For demand-driven deployments, consider [autoscaling with scale to zero](/reference/management-api/deployments/autoscaling/updates-a-deployments-autoscaling-settings).
## Deleting deployments
You can **permanently delete** deployments, but production deployments must be replaced before deletion.
* **Deleted deployments are purged from the dashboard** but retained in usage logs.
* **All associated compute resources are released.**
* **API requests return a 404 error post-deletion.**
Deletion is irreversible. Use deactivation if retention is required.
# Environments
Source: https://docs.baseten.co/deployment/environments
Manage your model’s release cycles with environments.
Environments provide structured management for deployments, ensuring controlled rollouts, stable endpoints, and autoscaling. They help teams stage, test, and release models without affecting production traffic.
Deployments can be promoted to an environment (e.g., "staging") to validate outputs before moving to production, allowing for safer model iteration and evaluation.
***
## Using environments to manage deployments
Environments support **structured validation** before promoting a deployment, including:
* **Automated tests and evaluations**
* **Manual testing in pre-production**
* **Gradual traffic shifts with canary deployments**
* **Shadow serving for real-world analysis**
Promoting a deployment ensures it inherits **environment-specific scaling and monitoring settings**, such as:
* **Dedicated API endpoint** → [Predict API Reference](/reference/inference-api/overview#predict-endpoints)
* **Autoscaling controls** → Scale behavior is managed per environment.
* **Traffic ramp-up** → Enable [canary rollouts](/deployment/deployments#canary-deployments) or [rolling deployments](/deployment/rolling-deployments).
* **Monitoring and metrics** → [Export environment metrics](/observability/export-metrics/overview).
A **production environment** operates like any other environment but has restrictions:
* **It can't be deleted** unless the entire model is removed.
* **You can't create additional environments named "production."**
***
## Creating custom environments
In addition to the standard **production** environment, you can create as many custom environments as needed. There are two ways to create a custom environment:
1. In the model management page on the Baseten dashboard.
2. Via the [create environment endpoint](/reference/management-api/environments/create-an-environment) in the model management API.
***
## Promoting deployments to environments
When you promote a deployment, Baseten follows a **three-step process**:
1. A **new deployment** is created with a unique deployment ID.
2. The deployment **initializes resources** and becomes active.
3. The new deployment **replaces the existing deployment** in that environment.
* If there was **no previous deployment, default autoscaling settings** are applied.
* If a **previous deployment existed**, the new one **inherits autoscaling settings**, and the old deployment is **demoted and scales to zero**.
### Promoting a published deployment
If a **published deployment** (not a development deployment) is promoted:
* Its **autoscaling settings are updated** to match the environment.
* If **inactive**, it must be **activated** before promotion.
Previous deployments are **demoted but remain in the system**, retaining their **deployment ID and scaling behavior**.
***
## Deploying directly to an environment
You can deploy directly to a named environment by specifying `--environment` in `truss push`:
```sh theme={"system"}
cd my_model/
truss push --environment {environment_name}
```
Only one active promotion per environment is allowed at a time.
***
## Accessing environments in your code
The **environment name** is available in `model.py` via the `environment` keyword argument:
```python theme={"system"}
def __init__(self, **kwargs):
self._environment = kwargs["environment"]
```
To ensure the **environment variable remains updated**, enable\*\* "Re-deploy when promoting" \*\*in the UI or via the [REST API](/reference/management-api/environments/update-an-environments-settings). This guarantees the environment is fully initialized after a promotion.
***
## Regional environments
Regional environments restrict inference traffic to a specific geographic region for data residency compliance. When your organization enables regional environments, each environment gets a dedicated regional endpoint that routes directly to infrastructure in the designated region.
Regional environments are configured at the organization level. Contact your Baseten account team to enable regional environments.
### Regional endpoint format
Regional endpoints embed the environment name in the hostname instead of the URL path:
Call a model's regional endpoint with `/predict` or `/async_predict`.
```
https://model-{model_id}-{env_name}.api.baseten.co/predict
```
For example, a model with ID `abc123` in the `prod-us` environment:
```
https://model-abc123-prod-us.api.baseten.co/predict
```
Call a chain's regional endpoint with `/run_remote` or `/async_run_remote`.
```
https://chain-{chain_id}-{env_name}.api.baseten.co/run_remote
```
Connect to a regional WebSocket endpoint for models or chains.
```
wss://model-{model_id}-{env_name}.api.baseten.co/websocket
wss://chain-{chain_id}-{env_name}.api.baseten.co/websocket
```
Connect to a regional gRPC endpoint using the `grpc.api.baseten.co` subdomain.
```
model-{model_id}-{env_name}.grpc.api.baseten.co:443
```
The regional endpoint URL appears in your model's API endpoint section in the Baseten dashboard once regional environments are configured for your organization.
### API restrictions on regional endpoints
Regional endpoints derive the environment exclusively from the hostname. Path-based routing (`/environments/`, `/production/`, `/deployment/`) is rejected. For gRPC, do not set `x-baseten-environment` or `x-baseten-deployment` metadata headers.
***
## Deleting environments
You can delete environments, **except for production**. To remove a **production deployment**, first **promote another deployment to production** or delete the entire model.
* **Deleted environments are removed from the overview** but remain in billing history.
* **They do not consume resources** after deletion.
* **API requests to a deleted environment return a 404 error.**
Deletion is permanent - consider deactivation instead.
# Resources
Source: https://docs.baseten.co/deployment/resources
Manage and configure model resources
Every AI/ML model on Baseten runs on an **instance**, a dedicated set of hardware allocated to the model server. Selecting the right instance type ensures **optimal performance** while controlling **compute costs**.
* **Insufficient resources**: Slow inference or failures.
* **Excess resources**: Higher costs without added benefit.
## Instance type resource components
* **Instance**: The allocated hardware for inference.
* **Node**: The compute unit within an instance, comprising 8 GPUs with associated vCPU, RAM, and VRAM.
* **vCPU**: Virtual CPU cores for general computing.
* **RAM**: Memory available to the CPU.
* **GPU**: Specialized hardware for accelerated ML workloads.
* **VRAM**: Dedicated GPU memory for model execution.
***
# Configuring model resources
Define resources **before deployment** in Truss or **adjust them later** via the Baseten UI.
### Defining resources in Truss
Define resource requirements in [`config.yaml`](/development/model/configuration) before running `truss push`.
* **Published deployment** (`truss push`): Creates a new deployment (named sequentially: `deployment-1`, `deployment-2`, etc.) using the resources in [`config.yaml`](/development/model/configuration).
* **Development deployment** (`truss push --watch`): Overwrites the existing development deployment with the specified resource configuration and starts watching for changes. Use [`truss watch`](/development/model/deploy-and-iterate) to resume watching an existing development deployment.
* **Production deployment** (`truss push --promote`): Creates a new deployment and promotes it to production, replacing the active deployment.
* **Environment deployment** (`truss push --environment `): Deploys directly to a [custom environment](/deployment/environments) like staging.
Changes to `config.yaml` only affect new deployments. To update resources on an existing published deployment, edit resources in the [Baseten UI](#updating-resources-in-the-baseten-ui).
You can configure resources in two ways:
**Option 1: Specify individual resource fields**
```yaml config.yaml theme={"system"}
resources:
accelerator: L4
cpu: "4"
memory: 16Gi
```
Baseten provisions the **smallest instance that meets the specified constraints**:
* cpu: "3" or "4" → Maps to a 4-core instance.
* cpu: "5" to "8" → Maps to an 8-core instance.
`Gi` in `resources.memory` refers to **Gibibytes**, which are slightly larger
than **Gigabytes**.
**Option 2: Specify an exact instance type**
An instance type is the full SKU name that uniquely identifies a specific hardware configuration. When you specify individual resource fields like `cpu` and `accelerator`, Baseten selects the smallest instance that meets your requirements. With `instance_type`, you specify exactly which instance you want, no guessing required.
Use `instance_type` when you:
* Know the exact hardware configuration you need.
* Want to ensure consistent instance selection across deployments.
* Are following a recommendation for a specific model (for example, "use an L4 with 4 vCPUs and 16 GiB RAM").
```yaml config.yaml theme={"system"}
resources:
instance_type: "L4:4x16"
```
The format encodes the hardware specs: `:x`. For example, `L4:4x16` means an L4 GPU with 4 vCPUs and 16 GiB of RAM. When `instance_type` is specified, other resource fields (`cpu`, `memory`, `accelerator`, `use_gpu`) are ignored.
### Updating resources in the Baseten UI
Once deployed, you can only update resource configurations **through the Baseten UI**. Changing the instance type deploys a copy of the deployment using the specified instance type.
For a list of available instance types, see the [instance type reference](/deployment/resources#instance-type-reference).
***
# Instance type reference
Specs and benchmarks for every Baseten instance type.
Choosing the right instance for model inference means balancing performance and cost. This page lists all available instance types on Baseten to help you deploy and serve models effectively.
## CPU-only instances
Cost-effective options for lighter workloads. No GPU.
* **Starts at**: \$0.00058/min
* **Best for**: Transformers pipelines, small QA models, text embeddings
| Instance | \$/min | vCPU | RAM |
| -------- | --------- | ---- | ------ |
| 1x2 | \$0.00058 | 1 | 2 GiB |
| 1x4 | \$0.00086 | 1 | 4 GiB |
| 2x8 | \$0.00173 | 2 | 8 GiB |
| 4x16 | \$0.00346 | 4 | 16 GiB |
| 8x32 | \$0.00691 | 8 | 32 GiB |
| 16x64 | \$0.01382 | 16 | 64 GiB |
To select a CPU-only instance, use the format `CPU:x` (e.g., `instance_type: "CPU:4x16"`).
**Example workloads:**
* `1x2`: Text classification (e.g., Truss quickstart)
* `4x16`: LayoutLM Document QA
* `4x16+`: Sentence Transformers embeddings on larger corpora
## GPU instances
Accelerated inference for LLMs, diffusion models, and Whisper.
| Instance | \$/min | vCPU | RAM | GPU | VRAM |
| -------------- | --------- | ---- | -------- | ---------------------- | ------- |
| T4x4x16 | \$0.01052 | 4 | 16 GiB | NVIDIA T4 | 16 GiB |
| T4x8x32 | \$0.01504 | 8 | 32 GiB | NVIDIA T4 | 16 GiB |
| T4x16x64 | \$0.02408 | 16 | 64 GiB | NVIDIA T4 | 16 GiB |
| L4x4x16 | \$0.01414 | 4 | 16 GiB | NVIDIA L4 | 24 GiB |
| L4:2x24x96 | \$0.04002 | 24 | 96 GiB | 2 NVIDIA L4s | 48 GiB |
| L4:4x48x192 | \$0.08003 | 48 | 192 GiB | 4 NVIDIA L4s | 96 GiB |
| A10Gx4x16 | \$0.02012 | 4 | 16 GiB | NVIDIA A10G | 24 GiB |
| A10Gx8x32 | \$0.02424 | 8 | 32 GiB | NVIDIA A10G | 24 GiB |
| A10Gx16x64 | \$0.03248 | 16 | 64 GiB | NVIDIA A10G | 24 GiB |
| A10G:2x24x96 | \$0.05672 | 24 | 96 GiB | 2 NVIDIA A10Gs | 48 GiB |
| A10G:4x48x192 | \$0.11344 | 48 | 192 GiB | 4 NVIDIA A10Gs | 96 GiB |
| A10G:8x192x768 | \$0.32576 | 192 | 768 GiB | 8 NVIDIA A10Gs | 188 GiB |
| A100x12x144 | \$0.10240 | 12 | 144 GiB | 1 NVIDIA A100 | 80 GiB |
| A100:2x24x288 | \$0.20480 | 24 | 288 GiB | 2 NVIDIA A100s | 160 GiB |
| A100:3x36x432 | \$0.30720 | 36 | 432 GiB | 3 NVIDIA A100s | 240 GiB |
| A100:4x48x576 | \$0.40960 | 48 | 576 GiB | 4 NVIDIA A100s | 320 GiB |
| A100:5x60x720 | \$0.51200 | 60 | 720 GiB | 5 NVIDIA A100s | 400 GiB |
| A100:6x72x864 | \$0.61440 | 72 | 864 GiB | 6 NVIDIA A100s | 480 GiB |
| A100:7x84x1008 | \$0.71680 | 84 | 1008 GiB | 7 NVIDIA A100s | 560 GiB |
| A100:8x96x1152 | \$0.81920 | 96 | 1152 GiB | 8 NVIDIA A100s | 640 GiB |
| H100 | \$0.10833 | - | - | 1 NVIDIA H100 | 80 GiB |
| H100:2 | \$0.21667 | - | - | 2 NVIDIA H100s | 160 GiB |
| H100:4 | \$0.43333 | - | - | 4 NVIDIA H100s | 320 GiB |
| H100:8 | \$0.86667 | - | - | 8 NVIDIA H100s | 640 GiB |
| H100MIG | \$0.06250 | - | - | Fractional NVIDIA H100 | 40 GiB |
To select a GPU instance with `instance_type`:
* **Single GPU**: `:x` (e.g., `"L4:4x16"`).
* **Multi-GPU**: `:xx` (e.g., `"A100:2x24x288"`).
* **H100**: `H100` or `H100:` (e.g., `"H100:2"`).
* **Fractional H100**: `"H100_40GB"`.
## GPU details and workloads
### T4
Turing-series GPU
* 2,560 CUDA / 320 Tensor cores
* 16 GiB VRAM
* **Best for:** Whisper, small LLMs like StableLM 3B
### L4
Ada Lovelace-series GPU
* 7,680 CUDA / 240 Tensor cores
* 24 GiB VRAM, 300 GiB/s
* 121 TFLOPS (fp16)
* **Best for**: Stable Diffusion XL
* **Limit**: Not suitable for LLMs due to bandwidth
### A10G
Ampere-series GPU
* 9,216 CUDA / 288 Tensor cores
* 24 GiB VRAM, 600 GiB/s
* 70 TFLOPS (fp16)
* **Best for**: Mistral 7B, Whisper, Stable Diffusion/SDXL
### A100
Ampere-series GPU
* 6,912 CUDA / 432 Tensor cores
* 80 GiB VRAM, 1.94 TB/s
* 312 TFLOPS (fp16)
* **Best for**: Mixtral, Llama 2 70B (2 A100s), Falcon 180B (5 A100s), SDXL
### H100
Hopper-series GPU
* 16,896 CUDA / 640 Tensor cores
* 80 GiB VRAM, 3.35 TB/s
* 990 TFLOPS (fp16)
* **Best for**: Mixtral 8x7B, Llama 2 70B (2xH100), SDXL
### H100MIG
Fractional H100 (3/7 compute, ½ memory)
* 7,242 CUDA cores, 40 GiB VRAM
* 1.675 TB/s bandwidth
* **Best for**: Efficient LLM inference at lower cost than A100
# Rolling deployments
Source: https://docs.baseten.co/deployment/rolling-deployments
Gradually shift traffic to a new deployment with replica-based rolling deployments.
Rolling deployments replace replicas incrementally when promoting a deployment to an environment.
Instead of swapping all traffic at once, rolling deployments scale up the candidate deployment, shift traffic proportionally, and scale down the previous deployment in controlled steps.
Use rolling deployments when you need zero-downtime updates with the ability to pause, cancel, or force-complete the deployment at any point.
Autoscaling is disabled for the entire duration of a rolling deployment.
Replica counts don't adjust automatically until the deployment reaches a
terminal status (SUCCEEDED, FAILED, or CANCELED). Use the
`replica_overhead_percent` setting to pre-provision additional capacity before
the deployment starts.
## How rolling deployments work
A rolling deployment follows a repeating three-step cycle:
1. **Scale up** candidate deployment replicas by the configured percentage.
2. **Shift traffic** proportionally to match the new replica ratio.
3. **Scale down** the previous deployment replicas by the same percentage.
This cycle repeats until all traffic and replicas run on the candidate deployment, at which point it becomes the active deployment in the environment.
### Provisioning modes
Rolling deployments support two mutually exclusive provisioning modes.
You must configure exactly one:
* `max_surge_percent`: Scales up candidate replicas before scaling down previous replicas.
* `max_unavailable_percent`: Scales down previous replicas before scaling up candidate replicas.
Both can't be non-zero at the same time, and both can't be zero at the same time.
## Enabling rolling deployments
Enable rolling deployments on any environment by updating the environment's promotion settings.
Rolling deployments are disabled by default.
```bash theme={"system"}
curl -X PATCH \
https://api.baseten.co/v1/models/{model_id}/environments/production \
-H "Authorization: Api-Key $BASETEN_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"promotion_settings": {
"rolling_deploy": true,
"rolling_deploy_config": {
"max_surge_percent": 10,
"max_unavailable_percent": 0,
"stabilization_time_seconds": 60,
"replica_overhead_percent": 0
}
}
}'
```
```python theme={"system"}
import requests
import os
API_KEY = os.environ.get("BASETEN_API_KEY")
response = requests.patch(
"https://api.baseten.co/v1/models/{model_id}/environments/production",
headers={"Authorization": f"Api-Key {API_KEY}"},
json={
"promotion_settings": {
"rolling_deploy": True,
"rolling_deploy_config": {
"max_surge_percent": 10,
"max_unavailable_percent": 0,
"stabilization_time_seconds": 60,
"replica_overhead_percent": 0,
},
}
},
)
print(response.json())
```
Once rolling deployments are enabled, any subsequent [promotion to the environment](/reference/management-api/deployments/promote/promotes-a-deployment-to-an-environment) uses the rolling deployment workflow.
## Configuration reference
Configure rolling deployments through the `rolling_deploy_config` object in the environment's `promotion_settings`.
Percentage of additional replicas to provision during each step. Set to `0` to use max unavailable mode instead.
**Range:** 0–50
Percentage of replicas that can be unavailable during each step. Set to `0` to use max surge mode instead.
**Range:** 0–50
Seconds to wait after each traffic shift before proceeding to the next step. Use this to monitor metrics between steps.
**Range:** 0–3600
Percentage of additional replicas to pre-provision on the current deployment before the rolling deployment starts. Compensates for autoscaling being disabled.
**Range:** 0–500
Additional promotion settings configured at the `promotion_settings` level:
Enables rolling deployments for the environment.
## Deployment statuses
The `in_progress_promotion` field on the [environment detail endpoint](/reference/management-api/environments/get-an-environments-details) tracks the current state of a rolling deployment.
| Status | Description |
| -------------- | -------------------------------------------------------------------------------------------------- |
| `RELEASING` | Candidate deployment is building and initializing replicas. |
| `RAMPING_UP` | Scaling up candidate replicas and shifting traffic. |
| `PAUSED` | Rolling deployment is paused at its current traffic split. Replicas stay at their current count. |
| `RAMPING_DOWN` | Graceful cancel in progress. Traffic is shifting back to the previous deployment. |
| `SUCCEEDED` | Rolling deployment completed. The candidate is now the active deployment. Autoscaling resumes. |
| `FAILED` | Rolling deployment failed. Traffic remains on the previous deployment. Autoscaling resumes. |
| `CANCELED` | Rolling deployment was canceled. Traffic returned to the previous deployment. Autoscaling resumes. |
The `in_progress_promotion` object also includes `percent_traffic_to_new_version`, which reports the current percentage of traffic routed to the candidate deployment.
## Deployment control actions
### Pause
Pauses the rolling deployment after the current step completes. Use this to inspect metrics or logs before proceeding.
```bash theme={"system"}
curl -X POST \
https://api.baseten.co/v1/models/{model_id}/environments/production/pause_promotion \
-H "Authorization: Api-Key $BASETEN_API_KEY"
```
```python theme={"system"}
response = requests.post(
"https://api.baseten.co/v1/models/{model_id}/environments/production/pause_promotion",
headers={"Authorization": f"Api-Key {API_KEY}"},
)
print(response.json())
```
### Resume
Resumes a paused rolling deployment from where it left off.
```bash theme={"system"}
curl -X POST \
https://api.baseten.co/v1/models/{model_id}/environments/production/resume_promotion \
-H "Authorization: Api-Key $BASETEN_API_KEY"
```
```python theme={"system"}
response = requests.post(
"https://api.baseten.co/v1/models/{model_id}/environments/production/resume_promotion",
headers={"Authorization": f"Api-Key {API_KEY}"},
)
print(response.json())
```
### Cancel
Gracefully cancels the rolling deployment. Traffic ramps back to the previous deployment and candidate replicas scale down.
```bash theme={"system"}
curl -X POST \
https://api.baseten.co/v1/models/{model_id}/environments/production/cancel_promotion \
-H "Authorization: Api-Key $BASETEN_API_KEY"
```
```python theme={"system"}
response = requests.post(
"https://api.baseten.co/v1/models/{model_id}/environments/production/cancel_promotion",
headers={"Authorization": f"Api-Key {API_KEY}"},
)
print(response.json())
```
Returns a `status` of `CANCELED` (instant cancel for non-rolling deployments) or `RAMPING_DOWN` (graceful rollback for rolling deployments).
### Force cancel
Immediately cancels the rolling deployment and returns all traffic to the previous deployment. Use this when you need to roll back without waiting for the graceful ramp-down.
Force canceling may cause brief service disruption if the previous deployment
is under-provisioned.
```bash theme={"system"}
curl -X POST \
https://api.baseten.co/v1/models/{model_id}/environments/production/force_cancel_promotion \
-H "Authorization: Api-Key $BASETEN_API_KEY"
```
```python theme={"system"}
response = requests.post(
"https://api.baseten.co/v1/models/{model_id}/environments/production/force_cancel_promotion",
headers={"Authorization": f"Api-Key {API_KEY}"},
)
print(response.json())
```
### Force roll forward
Immediately completes the rolling deployment, shifting all traffic to the candidate deployment. This works even if the deployment is in the process of rolling back.
Force rolling forward may promote an under-provisioned deployment if the
candidate has not finished scaling up.
```bash theme={"system"}
curl -X POST \
https://api.baseten.co/v1/models/{model_id}/environments/production/force_roll_forward_promotion \
-H "Authorization: Api-Key $BASETEN_API_KEY"
```
```python theme={"system"}
response = requests.post(
"https://api.baseten.co/v1/models/{model_id}/environments/production/force_roll_forward_promotion",
headers={"Authorization": f"Api-Key {API_KEY}"},
)
print(response.json())
```
## Autoscaling during rolling deployments
To compensate for autoscaling being disabled during rolling deployments:
* Set `replica_overhead_percent` to pre-provision the current deployment before the rolling deployment starts. For example, a value of `50` adds 50% more replicas to the current deployment before any traffic shifts.
* Set `stabilization_time_seconds` to add a wait period between steps, giving you time to monitor metrics before the next traffic shift.
* Factor in expected traffic when setting your environment's `min_replica` and `max_replica` before starting the rolling deployment.
Autoscaling resumes automatically when the rolling deployment reaches a terminal status: `SUCCEEDED`, `FAILED`, or `CANCELED`.
## Deployment cleanup
After a rolling deployment completes, the `promotion_cleanup_strategy` setting controls what happens to the previous deployment.
* `SCALE_TO_ZERO`: Scales the previous deployment to zero replicas. It remains available for reactivation. This is the default.
* `KEEP`: Leaves the previous deployment running at its current replica count.
* `DEACTIVATE`: Deactivates the previous deployment. It stops serving traffic and releases all resources.
Set it alongside your other promotion settings:
```bash theme={"system"}
curl -X PATCH \
https://api.baseten.co/v1/models/{model_id}/environments/production \
-H "Authorization: Api-Key $BASETEN_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"promotion_settings": {
"promotion_cleanup_strategy": "DEACTIVATE"
}
}'
```
```python theme={"system"}
response = requests.patch(
"https://api.baseten.co/v1/models/{model_id}/environments/production",
headers={"Authorization": f"Api-Key {API_KEY}"},
json={
"promotion_settings": {
"promotion_cleanup_strategy": "DEACTIVATE"
}
},
)
print(response.json())
```
# Binary IO
Source: https://docs.baseten.co/development/chain/binaryio
Performant serialization of numeric data
Numeric data or audio/video are most efficiently transmitted as bytes.
Other representations such as JSON or base64 encoding lose precision, add
significant parsing overhead and increase message sizes (for example, \~33% increase
for base64 encoding).
Chains extends the JSON-centred pydantic ecosystem with two ways how you can
include binary data: numpy array support and raw bytes.
## Numpy `ndarray` support
Once you have your data represented as a numpy array, you can easily (and
often without copying) convert it to `torch`, `tensorflow` or other common
numeric library's objects.
To include numpy arrays in a pydantic model, chains has a special field type
implementation `NumpyArrayField`. For example:
```python theme={"system"}
import numpy as np
import pydantic
from truss_chains import pydantic_numpy
class DataModel(pydantic.BaseModel):
some_numbers: pydantic_numpy.NumpyArrayField
other_field: str
...
numbers = np.random.random((3, 2))
data = DataModel(some_numbers=numbers, other_field="Example")
print(data)
# some_numbers=NumpyArrayField(shape=(3, 2), dtype=float64, data=[
# [0.39595027 0.23837526]
# [0.56714894 0.61244946]
# [0.45821942 0.42464844]])
# other_field='Example'
```
`NumpyArrayField` is a wrapper around the actual numpy array. Inside your
python code, you can work with its `array` attribute:
```python theme={"system"}
data.some_numbers.array += 10
# some_numbers=NumpyArrayField(shape=(3, 2), dtype=float64, data=[
# [10.39595027 10.23837526]
# [10.56714894 10.61244946]
# [10.45821942 10.42464844]])
# other_field='Example'
```
The interesting part is, how it serializes when making communicating between
Chainlets or with a client.
It can work in two modes: JSON and binary.
### Binary
As a JSON alternative that supports byte data, Chains uses `msgpack` (with
`msgpack_numpy`) to serialize the dict representation.
For Chainlet-Chainlet RPCs this is done automatically for you by enabling binary
mode of the dependency Chainlets, see
[all options](/reference/sdk/chains#truss-chains-depends):
```python theme={"system"}
import truss_chains as chains
class Worker(chains.ChainletBase):
async def run_remote(self, data: DataModel) -> DataModel:
data.some_numbers.array += 10
return data
class Consumer(chains.ChainletBase):
def __init__(self, worker=chains.depends(Worker, use_binary=True)):
self._worker = worker
async def run_remote(self):
numbers = np.random.random((3, 2))
data = DataModel(some_numbers=numbers, other_field="Example")
result = await self._worker.run_remote(data)
```
Now the data is transmitted in a fast and compact way between Chainlets
which often gives performance increases.
### Binary client
If you want to send such data as input to a chain or parse binary output
from a chain, you have to add the `msgpack` serialization client-side:
```python theme={"system"}
import requests
import msgpack
import msgpack_numpy
msgpack_numpy.patch() # Register hook for numpy.
# Dump to "python" dict and then to binary.
data_dict = data.model_dump(mode="python")
data_bytes = msgpack.dumps(data_dict)
# Set binary content type in request header.
headers = {
"Content-Type": "application/octet-stream", "Authorization": ...
}
response = requests.post(url, data=data_bytes, headers=headers)
response_dict = msgpack.loads(response.content)
response_model = ResponseModel.model_validate(response_dict)
```
The steps of dumping from a pydantic model and validating the response dict
into a pydantic model can be skipped, if you prefer working with raw dicts
on the client.
The implementation of `NumpyArrayField` only needs `pydantic`, no other Chains
dependencies. So you can take that implementation code in isolation and
integrate it in your client code.
Some version combinations of `msgpack` and `msgpack_numpy` give errors, we
know that `msgpack = ">=1.0.2"` and `msgpack-numpy = ">=0.4.8"` work.
### JSON
The JSON-schema to represent the array is a dict of `shape (tuple[int]),
dtype (str), data_b64 (str)`. For example,
```python theme={"system"}
print(data.model_dump_json())
'{"some_numbers":{"shape":[3,2],"dtype":"float64", "data_b64":"30d4/rnKJEAsvm...'
```
The base64 data corresponds to `np.ndarray.tobytes()`.
To get back to the array from the JSON string, use the model's
`model_validate_json` method.
As discussed in the beginning, this schema is not performant for numeric data
and only offered as a compatibility layer (JSON does not allow bytes) -
generally prefer the binary format.
# Simple `bytes` fields
It is possible to add a `bytes` field to a pydantic model used in a chain,
or as a plain argument to `run_remote`. This can be useful to include
non-numpy data formats such as images or audio/video snippets.
In this case, the "normal" JSON representation does not work and all
involved requests or Chainlet-Chainlet-invocations must use binary mode.
The same steps as for arrays [above](#binary-client) apply: construct dicts
with `bytes` values and keys corresponding to the `run_remote` argument
names or the field names in the pydantic model. Then use `msgpack` to
serialize and deserialize those dicts.
Don't forget to add `Content-type` headers and that `response.json()` will
not work.
# Concepts
Source: https://docs.baseten.co/development/chain/concepts
Glossary of Chains concepts and terminology
## Chainlet
A Chainlet is the basic building block of Chains. A Chainlet is a Python class
that specifies:
* A set of compute resources.
* A Python environment with software dependencies.
* A typed interface [
`run_remote()`](/development/chain/concepts#run-remote-chaining-chainlets)
for other Chainlets to call.
This is the simplest possible Chainlet. Only the
[`run_remote()`](/development/chain/concepts#run-remote-chaining-chainlets) method is
required, and we can layer in other concepts to create a more capable Chainlet.
```python theme={"system"}
import truss_chains as chains
class SayHello(chains.ChainletBase):
async def run_remote(self, name: str) -> str:
return f"Hello, {name}"
```
You can modularize your code by creating your own chainlet sub-classes,
refer to our [subclassing guide](/development/chain/subclassing).
### Remote configuration
Chainlets are meant for deployment as remote services. Each Chainlet specifies
its own requirements for compute hardware (CPU count, GPU type and count, etc)
and software dependencies (Python libraries or system packages). This
configuration is built into a Docker image automatically as part of the
deployment process.
When no configuration is provided, the Chainlet will be deployed on a basic
instance with one vCPU, 2GB of RAM, no GPU, and a standard set of Python and
system packages.
Configuration is set using the
[`remote_config`](/reference/sdk/chains#remote-configuration) class variable
within the Chainlet:
```python theme={"system"}
import truss_chains as chains
class MyChainlet(chains.ChainletBase):
remote_config = chains.RemoteConfig(
docker_image=chains.DockerImage(
pip_requirements=["torch==2.3.0", ...]
),
compute=chains.Compute(gpu="H100", ...),
assets=chains.Assets(secret_keys=["hf_access_token"], ...),
)
```
To select an exact instance type instead of specifying individual resource fields, use `instance_type`:
```python theme={"system"}
compute=chains.Compute(instance_type="H100:8x80")
```
When `instance_type` is specified, `cpu_count`, `memory`, and `gpu` fields are ignored.
See the
[remote configuration reference](/reference/sdk/chains#remote-configuration)
for a complete list of options.
### Initialization
Chainlets are implemented as classes because we often want to set up expensive
static resources once at startup and then re-use it with each invocation of the
Chainlet. For example, we only want to initialize an AI model and download its
weights once then re-use it every time we run inference.
We do this setup in `__init__()`, which is run exactly once when the Chainlet is
deployed or scaled up.
```python theme={"system"}
import truss_chains as chains
class PhiLLM(chains.ChainletBase):
def __init__(self) -> None:
import torch
import transformers
self._model = transformers.AutoModelForCausalLM.from_pretrained(
PHI_HF_MODEL,
torch_dtype=torch.float16,
device_map="auto",
)
self._tokenizer = transformers.AutoTokenizer.from_pretrained(
PHI_HF_MODEL,
)
```
Chainlet initialization also has two important features: context and dependency
injection of other Chainlets, explained below.
#### Context (access information)
You can add
[
`DeploymentContext`](/reference/sdk/chains#class-truss-chains-deploymentcontext)
object as an optional argument to the `__init__`-method of a Chainlet.
This allows you to use secrets within your Chainlet, such as using
a `hf_access_token` to access a gated model on Hugging Face (note that when
using secrets, they also need to be added to the `assets`).
```python theme={"system"}
import truss_chains as chains
class MistralLLM(chains.ChainletBase):
remote_config = chains.RemoteConfig(
...
assets = chains.Assets(secret_keys=["hf_access_token"], ...),
)
def __init__(
self,
# Adding the `context` argument, allows us to access secrets
context: chains.DeploymentContext = chains.depends_context(),
) -> None:
import transformers
# Using the secret from context to access a gated model on HF
self._model = transformers.AutoModelForCausalLM.from_pretrained(
"mistralai/Mistral-7B-Instruct-v0.2",
use_auth_token=context.secrets["hf_access_token"],
)
```
#### Depends (call other Chainlets)
The Chains framework uses the
[`chains.depends()`](/reference/sdk/chains#truss-chains-depends) function in
Chainlets' `__init__()` method to track the dependency relationship between
different Chainlets within a Chain.
This syntax, inspired by dependency injection, is used to translate local Python
function calls into calls to the remote Chainlets in production.
Once a dependency Chainlet is added with
[`chains.depends()`](/reference/sdk/chains#truss-chains-depends), its
[`run_remote()`](/development/chain/concepts#run-remote-chaining-chainlets) method can
call this dependency Chainlet, for example, below `HelloAll` we can make calls to
`SayHello`:
```python theme={"system"}
import truss_chains as chains
class HelloAll(chains.ChainletBase):
def __init__(self, say_hello_chainlet=chains.depends(SayHello)) -> None:
self._say_hello = say_hello_chainlet
async def run_remote(self, names: list[str]) -> str:
output = []
for name in names:
output.append(self._say_hello.run_remote(name))
return "\n".join(output)
```
## Run remote (chaining Chainlets)
The `run_remote()` method is run each time the Chainlet is called. It is the
sole public interface for the Chainlet (though you can have as many private
helper functions as you want) and its inputs and outputs must have type
annotations.
In `run_remote()` you implement the actual work of the Chainlet, such as model
inference or data chunking:
```python theme={"system"}
import truss_chains as chains
class PhiLLM(chains.ChainletBase):
async def run_remote(self, messages: Messages) -> str:
import torch
model_inputs = await self._tokenizer.apply_chat_template(
messages, tokenize=False, add_generation_prompt=True
)
inputs = await self._tokenizer(model_inputs, return_tensors="pt")
input_ids = inputs["input_ids"].to("cuda")
with torch.no_grad():
outputs = await self._model.generate(
input_ids=input_ids, **self._generate_args)
output_text = await self._tokenizer.decode(
outputs[0], skip_special_tokens=True)
return output_text
```
We recommend implementing this as an `async` method and using async APIs for
doing all the work (for example, downloads, vLLM or TRT inference).
It is possible to stream results back, see our
[streaming guide](/development/chain/streaming).
If `run_remote()` makes calls to other Chainlets, for example, invoking a dependency
Chainlet for each element in a list, you can benefit from concurrent
execution, by making the `run_remote()` an `async` method and starting the
calls as concurrent tasks
`asyncio.ensure_future(self._dep_chainlet.run_remote(...))`.
## Entrypoint
The entrypoint is called directly from the deployed Chain's API endpoint and
kicks off the entire chain. The entrypoint is also responsible for returning the
final result back to the client.
Using the
[`@chains.mark_entrypoint`](/reference/sdk/chains#truss-chains-mark-entrypoint)
decorator, one Chainlet within a file is set as the entrypoint to the chain.
```python theme={"system"}
@chains.mark_entrypoint
class HelloAll(chains.ChainletBase):
```
Optionally you can also set a Chain display name (not to be confused with
Chainlet display name) with this decorator:
```python theme={"system"}
@chains.mark_entrypoint("My Awesome Chain")
class HelloAll(chains.ChainletBase):
```
## I/O and `pydantic` data types
To make orchestrating multiple remotely deployed services possible, Chains
relies heavily on typed inputs and outputs. Values must be serialized to a safe
exchange format to be sent over the network.
The Chains framework uses the type annotations to infer how data should be
serialized and currently is restricted to types that are JSON compatible. Types
can be:
* Direct type annotations for simple types such as `int`, `float`,
or `list[str]`.
* Pydantic models to define a schema for nested data structures or multiple
arguments.
An example of pydantic input and output types for a Chainlet is given below:
```python theme={"system"}
import enum
import pydantic
class Modes(enum.Enum):
MODE_0 = "MODE_0"
MODE_1 = "MODE_1"
class SplitTextInput(pydantic.BaseModel):
data: str
num_partitions: int
mode: Modes
class SplitTextOutput(pydantic.BaseModel):
parts: list[str]
part_lens: list[int]
```
Refer to the [pydantic docs](https://docs.pydantic.dev/latest/) for more
details on how
to define custom pydantic data models.
Also refer to the [guide](/development/chain/binaryio) about efficient integration
of binary and numeric data.
## Chains compared to Truss
Chains is an alternate SDK for packaging and deploying AI models. It carries over many features and concepts from Truss and gives you access to the benefits of Baseten (resource provisioning, autoscaling, fast cold starts, etc), but it is not a 1-1 replacement for Truss.
Here are some key differences:
* Rather than running `truss init` and creating a Truss in a directory, a Chain
is a single file, giving you more flexibility for implementing multi-step
model inference. Create an example with `truss chains init`.
* Configuration is done inline in typed Python code rather than in a
`config.yaml` file.
* While Chainlets are converted to Truss models when run on Baseten,
`Chainlet != TrussModel`.
Chains is designed for compatibility and incremental adoption, with a stub
function for wrapping existing deployed models.
# Deploy
Source: https://docs.baseten.co/development/chain/deploy
Deploy your Chain on Baseten
Deploying a Chain is an atomic action that deploys every Chainlet
within the Chain. Each Chainlet specifies its own remote
environment: hardware resources, Python and system dependencies, autoscaling
settings.
### Published deployment
By default, pushing a Chain creates a published deployment:
```sh theme={"system"}
truss chains push ./my_chain.py
```
Where `my_chain.py` contains the entrypoint Chainlet for your Chain.
Published deployments have access to full autoscaling settings. Each time you
push, a new deployment is created.
### Development
To create a development deployment for rapid iteration, use `--watch`:
```sh theme={"system"}
truss chains push ./my_chain.py --watch
```
Development deployments are intended for testing and can't scale past one
replica. Each time you make a development deployment, it overwrites the existing
development deployment.
Development deployments support rapid iteration with live code patching. See the
[watch guide](/development/chain/watch).
### Environments
To deploy a Chain to an environment, run:
```sh theme={"system"}
truss chains push ./my_chain.py --environment {env_name}
```
Environments are intended for live traffic and have access to full
autoscaling settings. Each time you deploy to an environment, a new deployment is
created. Once the new deployment is live, it replaces the previous deployment,
which is relegated to the published deployments list.
[Learn more](/deployment/environments) about environments.
# Architecture and design
Source: https://docs.baseten.co/development/chain/design
How to structure your Chainlets
A Chain is composed of multiple connected Chainlets working together to perform
a task.
For example, the Chain in the diagram below takes a large audio file as input.
Then it splits it into smaller chunks, transcribes each chunk in parallel
(reducing the end-to-end latency), and finally aggregates and returns the
results.
To build an efficient Chain, we recommend drafting your high level
structure as a flowchart or diagram. This can help you identifying
parallelizable units of work and steps that need different (model/hardware)
resources.
If one Chainlet creates many "sub-tasks" by calling other dependency
Chainlets (for example, in a loop over partial work items),
these calls should be done as `aynscio`-tasks that run concurrently.
That way you get the most out of the parallelism that Chains offers. This
design pattern is extensively used in the
[audio transcription example](/examples/chains-audio-transcription).
While using `asyncio` is essential for performance, it can also be tricky.
Here are a few caveats to look out for:
* Executing operations in an async function that block the event loop for
more than a fraction of a second. This hinders the "flow" of processing
requests concurrently and starting RPCs to other Chainlets. Ideally use
native async APIs. Frameworks like vLLM or triton server offer such APIs,
similarly file downloads can be made async and you might find
[`AsyncBatcher`](https://github.com/hussein-awala/async-batcher) useful.
If there is no async support, consider running blocking code in a
thread/process pool (as an attribute of a Chainlet).
* Creating async tasks (e.g. with `asyncio.ensure_future`) does not start
the task *immediately*. In particular, when starting several tasks in a loop,
`ensure_future` must be alternated with operations that yield to the event
loop that, so the task can be started. If the loop is not `async for` or
contains other `await` statements, a "dummy" await can be added, for example
`await asyncio.sleep(0)`. This allows the tasks to be started concurrently.
# Engine-Builder LLM Models
Source: https://docs.baseten.co/development/chain/engine-builder-models
Engine-Builder LLM models are pre-trained models that are optimized for specific inference tasks.
Baseten's [Engine-Builder](/engines/engine-builder-llm/overview) enables the deployment of optimized model inference engines. Currently, it supports TensorRT-LLM. Truss Chains allows seamless integration of these engines into structured workflows. This guide provides a quick entry point for Chains users.
## LLama 7B example
Use the `EngineBuilderLLMChainlet` baseclass to configure an LLM engine. The additional `engine_builder_config` field specifies model architecture, repository, and engine parameters and more, the full options are detailed in the [Engine-Builder configuration guide](/engines/engine-builder-llm/engine-builder-config).
```python theme={"system"}
import truss_chains as chains
from truss.base import trt_llm_config, truss_config
class Llama7BChainlet(chains.EngineBuilderLLMChainlet):
remote_config = chains.RemoteConfig(
compute=chains.Compute(gpu=truss_config.Accelerator.H100),
assets=chains.Assets(secret_keys=["hf_access_token"]),
)
engine_builder_config = truss_config.TRTLLMConfiguration(
build=trt_llm_config.TrussTRTLLMBuildConfiguration(
base_model=trt_llm_config.TrussTRTLLMModel.LLAMA,
checkpoint_repository=trt_llm_config.CheckpointRepository(
source=trt_llm_config.CheckpointSource.HF,
repo="meta-llama/Llama-3.1-8B-Instruct",
),
max_batch_size=8,
max_seq_len=4096,
tensor_parallel_count=1,
)
)
```
## Differences from standard Chainlets
* No `run_remote` implementation: Unlike regular Chainlets, `EngineBuilderLLMChainlet` doesn't require users to implement `run_remote()`. Instead, it automatically wires into the deployed engine’s API. All LLM Chainlets have the same function signature: `chains.EngineBuilderLLMInput` as input and a stream (`AsyncIterator`) of strings as output. Likewise `EngineBuilderLLMChainlet`s can only be used as dependencies, but not have dependencies themselves.
* No `run_local` ([guide](/development/chain/localdev)) and `watch` ([guide](/development/chain/watch)) Standard Chains support a local debugging mode and watch. However, when using `EngineBuilderLLMChainlet`, local execution isn't available, and testing must be done after deployment.
For a faster dev loop of the rest of your chain (everything except the engine builder chainlet) you can substitute those chainlets with stubs like you can do for an already deployed truss model \[[guide](/development/chain/stub)].
## Integrate the Engine-Builder chainlet
After defining an `EngineBuilderLLMInput` like `Llama7BChainlet` above, you can use it as a dependency in other conventional chainlets:
```python theme={"system"}
from typing import AsyncIterator
import truss_chains as chains
@chains.mark_entrypoint
class TestController(chains.ChainletBase):
"""Example using the Engine-Builder Chainlet in another Chainlet."""
def __init__(self, llm=chains.depends(Llama7BChainlet)) -> None:
self._llm = llm
async def run_remote(self, prompt: str) -> AsyncIterator[str]:
messages = [{"role": "user", "content": prompt}]
llm_input = chains.EngineBuilderLLMInput(messages=messages)
async for chunk in self._llm.run_remote(llm_input):
yield chunk
```
# Error Handling
Source: https://docs.baseten.co/development/chain/errorhandling
Understanding and handling Chains errors
Error handling in Chains follows the principle that the root cause "bubbles
up" until the entrypoint, which returns an error response. Similarly to how
python stack traces contain all the layers from where an exception was raised
up until the main function.
Consider the case of a Chain where the entrypoint calls `run_remote` of a
Chainlet named `TextToNum` and this in turn invokes `TextReplicator`. The
respective `run_remote` methods might also use other helper functions that
appear in the call stack.
Below is an example stack trace that shows how the root cause (a
`ValueError`) is propagated up to the entrypoint's `run_remote` method (this
is what you would see as an error log):
```
Chainlet-Traceback (most recent call last):
File "/packages/itest_chain.py", line 132, in run_remote
value = self._accumulate_parts(text_parts.parts)
File "/packages/itest_chain.py", line 144, in _accumulate_parts
value += self._text_to_num.run_remote(part)
ValueError: (showing chained remote errors, root error at the bottom)
├─ Error in dependency Chainlet `TextToNum`:
│ Chainlet-Traceback (most recent call last):
│ File "/packages/itest_chain.py", line 87, in run_remote
│ generated_text = self._replicator.run_remote(data)
│ ValueError: (showing chained remote errors, root error at the bottom)
│ ├─ Error in dependency Chainlet `TextReplicator`:
│ │ Chainlet-Traceback (most recent call last):
│ │ File "/packages/itest_chain.py", line 52, in run_remote
│ │ validate_data(data)
│ │ File "/packages/itest_chain.py", line 36, in validate_data
│ │ raise ValueError(f"This input is too long: {len(data)}.")
╰ ╰ ValueError: This input is too long: 100.
```
## Exception handling and retries
Above stack trace is what you see if you don't catch the exception. It is
possible to add error handling around each remote Chainlet invocation.
Chains tries to raise the same exception class on the *caller* Chainlet as was
raised in the *dependency* Chainlet.
* Builtin exceptions (for example, `ValueError`) always work.
* Custom or third-party exceptions (for example, from `torch`) can be only raised
in the caller if they are included in the dependencies of the caller as
well. If the exception class cannot be resolved, a
`GenericRemoteException` is raised instead.
Note that the *message* of re-raised exceptions is the concatenation
of the original message and the formatted stack trace of the dependency
Chainlet.
In some cases it might make sense to simply retry a remote invocation (for example, if it failed due to some transient problems like networking or any "flaky"
parts). `depends` can be configured with additional
[options](/reference/sdk/chains#truss-chains-depends) for that.
Below example shows how you can add automatic retries and error handling for
the call to `TextReplicator` in `TextToNum`:
```python theme={"system"}
import truss_chains as chains
class TextToNum(chains.ChainletBase):
def __init__(
self,
replicator: TextReplicator = chains.depends(TextReplicator, retries=3),
) -> None:
self._replicator = replicator
async def run_remote(self, data: ...):
try:
generated_text = await self._replicator.run_remote(data)
except ValueError:
... # Handle error.
```
## Stack filtering
The stack trace is intended to show the user implemented code in
`run_remote` (and user implemented helper functions). Under the
hood, the calls from one Chainlet to another go through an HTTP
connection, managed by the Chains framework. And each Chainlet itself is
run as a FastAPI server with several layers of request handling code "above".
To provide concise, readable stacks, all of this non-user code is
filtered out.
# Your first Chain
Source: https://docs.baseten.co/development/chain/getting-started
Build and deploy two example Chains
This quickstart guide contains instructions for creating two Chains:
1. A simple CPU-only “hello world”-Chain.
2. A Chain that implements Phi-3 Mini and uses it to write poems.
## Prerequisites
Install [Truss](https://pypi.org/project/truss/):
```bash theme={"system"}
uv venv && source .venv/bin/activate
uv pip install --upgrade truss
```
```bash theme={"system"}
python3 -m venv .venv && source .venv/bin/activate
pip install --upgrade truss
```
```bash theme={"system"}
python3 -m venv .venv && .venv\Scripts\activate
pip install --upgrade truss
```
You also need a [Baseten account](https://app.baseten.co/signup) with an [API key](https://app.baseten.co/settings/account/api_keys).
## Example: Hello World
Chains are written in Python files. In your working directory,
create `hello_chain/hello.py`:
```sh theme={"system"}
mkdir hello_chain
cd hello_chain
touch hello.py
```
In the file, we'll specify a basic Chain. It has two Chainlets:
* `HelloWorld`, the entrypoint, which handles the input and output.
* `RandInt`, which generates a random integer. It is used a as a dependency
by `HelloWorld`.
Via the entrypoint, the Chain takes a maximum value and returns the string "
Hello World!" repeated a
variable number of times.
```python hello.py theme={"system"}
import random
import truss_chains as chains
class RandInt(chains.ChainletBase):
async def run_remote(self, max_value: int) -> int:
return random.randint(1, max_value)
@chains.mark_entrypoint
class HelloWorld(chains.ChainletBase):
def __init__(self, rand_int=chains.depends(RandInt, retries=3)) -> None:
self._rand_int = rand_int
async def run_remote(self, max_value: int) -> str:
num_repetitions = await self._rand_int.run_remote(max_value)
return "Hello World! " * num_repetitions
```
### The Chainlet class-contract
Exactly one Chainlet must be marked as the entrypoint with
the [`@chains.mark_entrypoint`](/reference/sdk/chains#truss-chains-mark-entrypoint)
decorator. This Chainlet is responsible for
handling public-facing input and output for the whole Chain in response to an
API call.
A Chainlet class has a single public method,
[`run_remote()`](/development/chain/concepts#run-remote-chaining-chainlets), which is
the API
endpoint for the entrypoint Chainlet and the function that other Chainlets can
use as a dependency. The
[`run_remote()`](/development/chain/concepts#run-remote-chaining-chainlets)
method must be fully type-annotated
with primitive python
types
or [pydantic models](https://docs.pydantic.dev/latest/).
Chainlets cannot be naively instantiated. The only correct usages are:
1. Make one Chainlet depend on another one via the
[`chains.depends()`](/reference/sdk/chains#truss-chains-depends) directive
as an `__init__`-argument as shown above for the `RandInt` Chainlet.
2. In the [local debugging mode](/development/chain/localdev#test-a-chain-locally).
Beyond that, you can structure your code as you like, with private methods,
imports from other files, and so forth.
Keep in mind that Chainlets are intended for distributed, replicated, remote
execution, so using global variables, global state, and certain Python
features like importing modules dynamically at runtime should be avoided as
they may not work as intended.
### Deploy your Chain to Baseten
To deploy your Chain to Baseten, run:
```bash theme={"system"}
truss chains push --watch hello.py
```
The deploy command results in an output like this:
```
⛓️ HelloWorld - Chainlets ⛓️
╭──────────────────────┬─────────────────────────┬─────────────╮
│ Status │ Name │ Logs URL │
├──────────────────────┼─────────────────────────┼─────────────┤
│ 💚 ACTIVE │ HelloWorld (entrypoint) │ https://... │
├──────────────────────┼─────────────────────────┼─────────────┤
│ 💚 ACTIVE │ RandInt (dep) │ https://... │
╰──────────────────────┴─────────────────────────┴─────────────╯
Deployment succeeded.
You can run the chain with:
curl -X POST 'https://chain-.../run_remote' \
-H "Authorization: Api-Key $BASETEN_API_KEY" \
-d ''
```
Wait for the status to turn to `ACTIVE` and test invoking your Chain (replace
`$INVOCATION_URL` in below command):
```bash theme={"system"}
curl -X POST $INVOCATION_URL \
-H "Authorization: Api-Key $BASETEN_API_KEY" \
-d '{"max_value": 10}'
# "Hello World! Hello World! Hello World! "
```
## Example: Poetry with LLMs
Our second example also has two Chainlets, but is somewhat more complex and
realistic. The Chainlets are:
* `PoemGenerator`, the entrypoint, which handles the input and output and
orchestrates calls to the LLM.
* `PhiLLM`, which runs inference on Phi-3 Mini.
This Chain takes a list of words and returns a poem about each word, written by
Phi-3. Here's the architecture:
We build this Chain in a new working directory (if you are still inside
`hello_chain/`, go up one level with `cd ..` first):
```sh theme={"system"}
mkdir poetry_chain
cd poetry_chain
touch poems.py
```
A similar end-to-end code example, using Mistral as an LLM, is available in
the [examples
repo](https://github.com/basetenlabs/model/tree/main/truss-chains/examples/mistral).
### Building the LLM Chainlet
The main difference between this Chain and the previous one is that we now have
an LLM that needs a GPU and more complex dependencies.
Copy the following code into `poems.py`:
```python poems.py theme={"system"}
import asyncio
from typing import List
import pydantic
import truss_chains as chains
from truss import truss_config
PHI_HF_MODEL = "microsoft/Phi-3-mini-4k-instruct"
# This configures to cache model weights from the hunggingface repo
# in the docker image that is used for deploying the Chainlet.
PHI_CACHE = truss_config.ModelRepo(
repo_id=PHI_HF_MODEL, allow_patterns=["*.json", "*.safetensors", ".model"]
)
class Messages(pydantic.BaseModel):
messages: List[dict[str, str]]
class PhiLLM(chains.ChainletBase):
# `remote_config` defines the resources required for this chainlet.
remote_config = chains.RemoteConfig(
docker_image=chains.DockerImage(
# The phi model needs some extra python packages.
pip_requirements=[
"accelerate==0.30.1",
"einops==0.8.0",
"transformers==4.41.2",
"torch==2.3.0",
]
),
# The phi model needs a GPU and more CPUs.
compute=chains.Compute(cpu_count=2, gpu="T4"),
# Cache the model weights in the image
assets=chains.Assets(cached=[PHI_CACHE]),
)
def __init__(self) -> None:
# Note the imports of the *specific* python requirements are
# pushed down to here. This code will only be executed on the
# remotely deployed Chainlet, not in the local environment,
# so we don't need to install these packages in the local
# dev environment.
import torch
import transformers
self._model = transformers.AutoModelForCausalLM.from_pretrained(
PHI_HF_MODEL,
torch_dtype=torch.float16,
device_map="auto",
)
self._tokenizer = transformers.AutoTokenizer.from_pretrained(
PHI_HF_MODEL,
)
self._generate_args = {
"max_new_tokens" : 512,
"temperature" : 1.0,
"top_p" : 0.95,
"top_k" : 50,
"repetition_penalty" : 1.0,
"no_repeat_ngram_size": 0,
"use_cache" : True,
"do_sample" : True,
"eos_token_id" : self._tokenizer.eos_token_id,
"pad_token_id" : self._tokenizer.pad_token_id,
}
async def run_remote(self, messages: Messages) -> str:
import torch
model_inputs = self._tokenizer.apply_chat_template(
messages, tokenize=False, add_generation_prompt=True
)
inputs = self._tokenizer(model_inputs, return_tensors="pt")
input_ids = inputs["input_ids"].to("cuda")
with torch.no_grad():
outputs = self._model.generate(
input_ids=input_ids, **self._generate_args)
output_text = self._tokenizer.decode(
outputs[0], skip_special_tokens=True)
return output_text
```
### Building the entrypoint
Now that we have an LLM, we can use it in a poem generator Chainlet. Add the
following code to `poems.py`:
```python poems.py theme={"system"}
import asyncio
@chains.mark_entrypoint
class PoemGenerator(chains.ChainletBase):
def __init__(self, phi_llm: PhiLLM = chains.depends(PhiLLM)) -> None:
self._phi_llm = phi_llm
async def run_remote(self, words: list[str]) -> list[str]:
tasks = []
for word in words:
messages = Messages(
messages=[
{
"role" : "system",
"content": (
"You are poet who writes short, "
"lighthearted, amusing poetry."
),
},
{"role": "user", "content": f"Write a poem about {word}"},
]
)
tasks.append(
asyncio.ensure_future(self._phi_llm.run_remote(messages)))
await asyncio.sleep(0) # Yield to event loop, to allow starting tasks.
return list(await asyncio.gather(*tasks))
```
Note that we use `asyncio.ensure_future` around each RPC to the LLM chainlet.
This makes the current python process start these remote calls concurrently,
i.e. the next call is started before the previous one has finished and we can
minimize our overall runtime. To await the results of all calls,
`asyncio.gather` is used which gives us back normal python objects.
If the LLM is hit with many concurrent requests, it can auto-scale up (if
autoscaling is configured). More advanced LLM models have batching capabilities,
so for those even a single instance can serve concurrent request.
### Deploy your Chain to Baseten
To deploy your Chain to Baseten, run:
```bash theme={"system"}
truss chains push --watch poems.py
```
Wait for the status to turn to `ACTIVE` and test invoking your Chain (replace
`$INVOCATION_URL` in below command):
```bash theme={"system"}
curl -X POST $INVOCATION_URL \
-H "Authorization: Api-Key $BASETEN_API_KEY" \
-d '{"words": ["bird", "plane", "superman"]}'
#[[
#" [INST] Generate a poem about: bird [/INST] In the quiet hush of...",
#" [INST] Generate a poem about: plane [/INST] In the vast, boundless...",
#" [INST] Generate a poem about: superman [/INST] In the realm where..."
#]]
```
# Invocation
Source: https://docs.baseten.co/development/chain/invocation
Call your deployed Chain
Once your Chain is deployed, you can call it via its API endpoint. Chains use
the same inference API as models:
* [Environment endpoint](/reference/inference-api/predict-endpoints/environments-run-remote)
* [Development endpoint](/reference/inference-api/predict-endpoints/development-run-remote)
* [Endpoint by ID](/reference/inference-api/predict-endpoints/deployment-run-remote)
Here's an example which calls the development deployment:
```python call_chain.py theme={"system"}
import requests
import os
# From the Chain overview page on Baseten
# E.g. "https://chain-.api.baseten.co/development/run_remote"
CHAIN_URL = ""
baseten_api_key = os.environ["BASETEN_API_KEY"]
# JSON keys and types match the `run_remote` method signature.
data = {...}
resp = requests.post(
CHAIN_URL,
headers={"Authorization": f"Api-Key {baseten_api_key}"},
json=data,
)
print(resp.json())
```
### How to pass chain input
The data schema of the inference request corresponds to the function
signature of [`run_remote()`](/development/chain/concepts#run-remote-chaining-chainlets)
in your entrypoint Chainlet.
For example, for the Hello Chain, `HelloAll.run_remote()`:
```python theme={"system"}
async def run_remote(self, names: list[str]) -> str:
```
You'd pass the following JSON payload:
```json theme={"system"}
{ "names": ["Marius", "Sid", "Bola"] }
```
That is, the keys in the JSON record, match the argument names and values
match the types of`run_remote.`
### Async chain inference
Like Truss models, Chains support async invocation. The [guide for
models](/inference/async) applies largely. In particular for how to wrap the
input and set up the webhook to process results.
The following additional points are chains specific:
* Use chain-based URLS:
* `https://chain-{chain}.api.baseten.co/production/async_run_remote`
* `https://chain-{chain}.api.baseten.co/development/async_run_remote`
* `https://chain-{chain}.api.baseten.co/deployment/{deployment}/async_run_remote`.
* `https://chain-{chain}.api.baseten.co/environments/{env_name}/async_run_remote`.
* Only the entrypoint is invoked asynchronously. Internal Chainlet-Chainlet
calls run synchronously.
# Local Development
Source: https://docs.baseten.co/development/chain/localdev
Iterating, Debugging, Testing, Mocking
Chains are designed for production in replicated remote deployments. But
alongside that production-ready power, we offer great local development and
deployment experiences.
Chains exists to help you build multi-step, multi-model pipelines. The
abstractions that Chains introduces are based on six opinionated principles:
three for architecture and three for developer experience.
**Architecture principles**
Each step in the pipeline can set its own hardware requirements and
software dependencies, separating GPU and CPU workloads.
Each component has independent autoscaling parameters for targeted
resource allocation, removing bottlenecks from your pipelines.
Components specify a single public interface for flexible-but-safe
composition and are reusable between projects
**Developer experience principles**
Eliminate entire taxonomies of bugs by writing typed Python code and
validating inputs, outputs, module initializations, function signatures,
and even remote server configurations.
Seamless local testing and cloud deployments: test Chains locally with
support for mocking the output of any step and simplify your cloud
deployment loops by separating large model deployments from quick
updates to glue code.
Use Chains to orchestrate existing model deployments, like pre-packaged
models from Baseten’s model library, alongside new model pipelines built
entirely within Chains.
Locally, a Chain is just Python files in a source tree. While that gives you a
lot of flexibility in how you structure your code, there are some constraints
and rules to follow to ensure successful distributed, remote execution in
production.
The best thing you can do while developing locally with Chains is to run your
code frequently, even if you do not have a `__main__` section: the Chains
framework runs various validations at
module initialization to help
you catch issues early.
Additionally, running `mypy` and fixing reported type errors can help you
find problems early in a rapid feedback loop, before attempting a (much
slower) deployment.
Complementary to the purely local development Chains also has a "watch" mode,
like Truss, see the [watch guide](/development/chain/watch).
## Test a Chain locally
Let's revisit our "Hello World" Chain:
```python hello_chain/hello.py theme={"system"}
import asyncio
import truss_chains as chains
# This Chainlet does the work
class SayHello(chains.ChainletBase):
async def run_remote(self, name: str) -> str:
return f"Hello, {name}"
# This Chainlet orchestrates the work
@chains.mark_entrypoint
class HelloAll(chains.ChainletBase):
def __init__(self, say_hello_chainlet=chains.depends(SayHello)) -> None:
self._say_hello = say_hello_chainlet
async def run_remote(self, names: list[str]) -> str:
tasks = []
for name in names:
tasks.append(asyncio.ensure_future(
self._say_hello.run_remote(name)))
return "\n".join(await asyncio.gather(*tasks))
# Test the Chain locally
if __name__ == "__main__":
with chains.run_local():
hello_chain = HelloAll()
result = asyncio.get_event_loop().run_until_complete(
hello_chain.run_remote(["Marius", "Sid", "Bola"]))
print(result)
```
When the `__main__()` module is run, local instances of the Chainlets are
created, allowing you to test functionality of your chain just by executing the
Python file:
```bash theme={"system"}
cd hello_chain
python hello.py
# Hello, Marius
# Hello, Sid
# Hello, Bola
```
## Mock execution of GPU Chainlets
Using `run_local()` to run your code locally requires that your development
environment have the compute resources and dependencies that each Chainlet
needs. But that often isn't possible when building with AI models.
Chains offers a workaround, mocking, to let you test the coordination and
business logic of your multi-step inference pipeline without worrying about
running the model locally.
The second example in the [getting started guide](/development/chain/getting-started)
implements a Truss Chain for generating poems with Phi-3.
This Chain has two Chainlets:
1. The `PhiLLM` Chainlet, which can run on NVIDIA GPUs such as the L4.
2. The `PoemGenerator` Chainlet, which easily runs on a CPU.
If you have an NVIDIA T4 under your desk, good for you. For the rest of us, we
can mock the `PhiLLM` Chainlet that is infeasible to run locally so that we can
quickly test the `PoemGenerator` Chainlet.
To do this, we define a mock Phi-3 model in our `__main__` module and give it
a [`run_remote()`](/development/chain/concepts#run-remote-chaining-chainlets) method that
produces a test output that matches the output type we expect from the real
Chainlet. Then, we inject an instance of this mock Chainlet into our Chain:
```python poems.py theme={"system"}
if __name__ == "__main__":
class FakePhiLLM:
async def run_remote(self, prompt: str) -> str:
return f"Here's a poem about {prompt.split(" ")[-1]}"
with chains.run_local():
poem_generator = PoemGenerator(phi_llm=FakePhiLLM())
result = asyncio.get_event_loop().run_until_complete(
poem_generator.run_remote(words=["bird", "plane", "superman"]))
print(result)
```
And run your Python file:
```bash theme={"system"}
python poems.py
# ['Here's a poem about bird', 'Here's a poem about plane', 'Here's a poem about superman']
```
### Typing of mocks
You may notice that the argument `phi_llm` expects a type `PhiLLM`, while we
pass an instance of `FakePhiLLM`. These aren't the same, which is formally a
type error.
However, this works at runtime because we constructed `FakePhiLLM` to
implement the same *protocol* as the real thing. We can make this explicit by
defining a `Protocol` as a type annotation:
```python theme={"system"}
from typing import Protocol
class PhiProtocol(Protocol):
def run_remote(self, data: str) -> str:
...
```
and changing the argument type in `PoemGenerator`:
```python theme={"system"}
@chains.mark_entrypoint
class PoemGenerator(chains.ChainletBase):
def __init__(self, phi_llm: PhiProtocol = chains.depends(PhiLLM)) -> None:
self._phi_llm = phi_llm
```
This is a bit more work and not needed to execute the code, but it shows how
typing consistency can be achieved if desired.
# Overview
Source: https://docs.baseten.co/development/chain/overview
Chains is a framework for building robust, performant multi-step and multi-model
inference pipelines and deploying them to production. It addresses the common
challenges of managing latency, cost and dependencies for complex workflows,
while leveraging Truss’ existing battle-tested performance, reliability and
developer toolkit.
# User guides
Guides focus on specific features and use cases. Also refer to
[getting started](/development/chain/getting-started) and
[general concepts](/development/chain/concepts).
How to structure your Chainlets, concurrency, file structure
Iterating, Debugging, Testing, Mocking
Deploy your Chain on Baseten
Call your deployed Chain
Live-patch deployed code
Modularize and re-use Chainlet implementations
Streaming outputs, reducing latency, SSEs
Performant serialization of numeric data
Understanding and handling Chains errors
Integrate deployed Truss models with stubs
## From model to system
Some models are actually pipelines (for example, invoking a LLM involves sequentially
tokenizing the input, predicting the next token, and then decoding the predicted
tokens). These pipelines generally make sense to bundle together in a monolithic
deployment because they have the same dependencies, require the same compute
resources, and have a robust ecosystem of tooling to improve efficiency and
performance in a single deployment.
Many other pipelines and systems do not share these properties. Some examples
include:
* Running multiple different models in sequence.
* Chunking/partitioning a set of files and concatenating/organizing results.
* Pulling inputs from or saving outputs to a database or vector store.
Each step in these workflows has different hardware requirements, software
dependencies, and scaling needs so it doesn’t make sense to bundle them in a
monolithic deployment. That’s where Chains comes in!
## Six principles behind Chains
Chains exists to help you build multi-step, multi-model pipelines. The
abstractions that Chains introduces are based on six opinionated principles:
three for architecture and three for developer experience.
**Architecture principles**
Each step in the pipeline can set its own hardware requirements and
software dependencies, separating GPU and CPU workloads.
Each component has independent autoscaling parameters for targeted
resource allocation, removing bottlenecks from your pipelines.
Components specify a single public interface for flexible-but-safe
composition and are reusable between projects
**Developer experience principles**
Eliminate entire taxonomies of bugs by writing typed Python code and
validating inputs, outputs, module initializations, function signatures,
and even remote server configurations.
Seamless local testing and cloud deployments: test Chains locally with
support for mocking the output of any step and simplify your cloud
deployment loops by separating large model deployments from quick
updates to glue code.
Use Chains to orchestrate existing model deployments, like pre-packaged
models from Baseten’s model library, alongside new model pipelines built
entirely within Chains.
## Hello World with Chains
Here’s a simple Chain that says “hello” to each person in a list of provided
names:
```python hello_chain/hello.py theme={"system"}
import asyncio
import truss_chains as chains
# This Chainlet does the work.
class SayHello(chains.ChainletBase):
async def run_remote(self, name: str) -> str:
return f"Hello, {name}"
# This Chainlet orchestrates the work.
@chains.mark_entrypoint
class HelloAll(chains.ChainletBase):
def __init__(self, say_hello_chainlet=chains.depends(SayHello)) -> None:
self._say_hello = say_hello_chainlet
async def run_remote(self, names: list[str]) -> str:
tasks = []
for name in names:
tasks.append(asyncio.ensure_future(
self._say_hello.run_remote(name)))
return "\n".join(await asyncio.gather(*tasks))
```
This is a toy example, but it shows how Chains can be used to separate
preprocessing steps like chunking from workload execution steps. If SayHello
were an LLM instead of a simple string template, we could do a much more complex
action for each person on the list.
## What to build with Chains
Connect to vector databases and augment LLM results with additional
context information without introducing overhead to the model inference
step.
Try it yourself: [RAG Chain](/examples/chains-build-rag).
Transcribe large audio files by splitting them into smaller chunks and
processing them in parallel. We've used this approach to process 10-hour
files in minutes.
Try it yourself: [Audio Transcription Chain](/examples/chains-audio-transcription).
Build powerful experiences with optimal scaling in each step like:
* AI phone calling (transcription + LLM + speech synthesis)
* Multi-step image generation (SDXL + LoRAs + ControlNets)
* Multimodal chat (LLM + vision + document parsing + audio)
Since each stage runs on its hardware with independent auto-scaling,
you can achieve better hardware utilization and save costs.
Get started by
[building and deploying your first chain](/development/chain/getting-started).
# Streaming
Source: https://docs.baseten.co/development/chain/streaming
Streaming outputs, reducing latency, SSEs
Streaming outputs is useful for returning partial results to the client, before
all data has been processed.
For example LLM text generation happens in incremental text chunks, so the
beginning of the reply can already be sent to the client before the whole
prediction is complete.
Similarly, transcribing audio to text happens in \~30 second chunks and the
first ones can be returned before all completed.
In general, this does not reduce the overall processing time (still the same
amount of work must be done), but the initial latency to get some response
can be reduced significantly.
In some cases it might even reduce overall time, when streaming results
internally in a Chain, allows to start subsequent processing steps sooner -
i.e. pipelining the operations in a more efficient way.
# Low-level streaming
Low-level, streaming works by sending byte chunks (unicode strings will be
implicitly encoded) via HTTP. The most primitive way of doing this in Chains
is by implementing `run_remote` as a bytes- or string-iterator, for example:
```python theme={"system"}
from typing import AsyncIterator
import truss_chains as chains
class Streamlet(chains.ChainletBase):
async def run_remote(self, inputs: ...) -> AsyncIterator[str]:
async for text_chunk in make_incremental_outputs(inputs):
yield text_chunk
```
You are free to chose what data to represent in the byte/string chunks, it
could be raw text generated by an LLM, it could be JSON string, bytes or
anything else.
# Server-sent events (SSEs)
A possible choice is to generate chunks that comply with the
[specification](https://html.spec.whatwg.org/multipage/server-sent-events.html)
of server-sent events.
Concretely, sending JSON strings with `data`, `event` and potentially
other fields and content-type `text/event-stream` .
However, the SSE specification is not opinionated regarding what exactly is
encoded in `data` and what `event`-types exist. You have to make up your schema
that is useful for the client that consumes the data.
# Pydantic and Chainlet-Chainlet-streams
While above low-level streaming is stable, the following helper APIs for typed
streaming are only stable for intra-Chain streaming.
If you want to use them for end clients, please reach out to Baseten support,
so we can discuss the stable solutions.
Unlike above "raw" stream example, Chains takes the general opinion that
input and output types should be definite, so that divergence and type
errors can be avoided.
Just like you type-annotate Chainlet inputs and outputs in the non-streaming
case, and use pydantic to manage more complex data structures, we built
tooling to bring the same benefits to streaming.
## Headers and footers
This also helps to solve another challenge of streaming: you might want to
send data of different kinds at the beginning or end of a stream than in
the "main" part.
For example if you transcribe an audio file, you might want
to send many transcription segments in a stream and at the end send some
aggregate information such as duration, detected languages etc.
We model typed streaming like this:
* \[optionally] send a chunk that conforms to the schema of a `Header` pydantic
model.
* Send 0 to N chunks each conforming to the schema of an `Item` pydantic
model.
* \[optionally] send a chunk that conforms to the schema of a `Footer` pydantic
model.
## APIs
### StreamTypes
To have a single source of truth for the types that can be shared between
the producing Chainlet and the consuming client (either a Chainlet in the
Chain or an external client), the chains framework uses a `StreamType`-object:
```python theme={"system"}
import pydantic
from truss_chains import streaming
class MyDataChunk(pydantic.BaseModel):
words: list[str]
STREAM_TYPES = streaming.stream_types(
MyDataChunk, header_type=..., footer_type=...)
```
Note that header and footer types are optional and can be left out:
```python theme={"system"}
STREAM_TYPES = streaming.stream_types(MyDataChunk)
```
### StreamWriter
Use the `STREAM_TYPES` to create a matching stream writer:
```python theme={"system"}
from typing import AsyncIterator
import pydantic
import truss_chains as chains
from truss_chains import streaming
class MyDataChunk(pydantic.BaseModel):
words: list[str]
STREAM_TYPES = streaming.stream_types(MyDataChunk)
class Streamlet(chains.ChainletBase):
async def run_remote(self, inputs: ...) -> AsyncIterator[bytes]:
stream_writer = streaming.stream_writer(STREAM_TYPES)
async for item in make_pydantic_items(inputs):
yield stream_writer.yield_item(item)
```
If your stream types have header or footer types, corresponding
`yield_header` and `yield_footer` methods are available on the writer.
The writer serializes the pydantic data to `bytes`, so you can also
efficiently represent numeric data (see the
[binary IO guide](/development/chain/binaryio)).
### StreamReader
To consume the stream on either another Chainlet or in the external client, a
matching `StreamReader` is created form your `StreamTypes`. Besides the
types, you connect the reader to the bytes generator that you obtain from the
remote invocation of the streaming Chainlet:
```python theme={"system"}
import truss_chains as chains
from truss_chains import streaming
class Consumer(chains.ChainletBase):
def __init__(self, streamlet=chains.depends(Streamlet)):
self._streamlet = streamlet
async def run_remote(self, data: ...):
byte_stream = self._streamlet.run_remote(data)
reader = streaming.stream_reader(STREAM_TYPES, byte_stream)
chunks = []
async for data in reader.read_items():
chunks.append(data)
```
If you use headers or footers, the reader has async `read_header` and
`read_footer` methods.
Note that the stream can only be consumed once and you have to consume
header, items and footer in order.
The implementation of `StreamReader` only needs `pydantic`, no other Chains
dependencies. So you can take that implementation code in isolation and
integrate it in your client code.
# Truss Integration
Source: https://docs.baseten.co/development/chain/stub
Integrate deployed Truss models with stubs
Chains can be combined with existing Truss models using Stubs.
A Stub acts as a substitute (client-side proxy) for a remotely deployed
dependency, either a Chainlet or a Truss model. The Stub performs the remote
invocations as if it were local by taking care of the transport layer,
authentication, data serialization and retries.
Stubs can be integrated into Chainlets by passing in a URL of the deployed
model. They also require
[`context`](/development/chain/concepts#context-access-information) to be initialized
(for authentication).
```python theme={"system"}
import truss_chains as chains
class LLMClient(chains.StubBase):
async def run_remote(self, prompt: str) -> str:
# Call the deployed model
resp = await self.predict_async(inputs={
"messages": [{"role": "user", "content": prompt}],
"stream" : False
})
# Return a string with the model output
return resp["output"]
LLM_URL = ...
class MyChainlet(chains.ChainletBase):
def __init__(
self,
context: chains.DeploymentContext = chains.depends_context(),
):
self._llm = LLMClient.from_url(LLM_URL, context)
```
There are various ways how you can make a call to the other deployment:
* Input as JSON dict (like above) or pydantic model.
* Automatic parsing of the response into an pydantic model using the
`output_model` argument.
* `predict_async` (recommended) or `predict_async`.
* Streaming responses using `predict_async_stream` which returns an async
bytes iterator.
* Customized with `RPCOptions`.
See the
[StubBase reference](/reference/sdk/chains#class-truss-chains-stubbase)
for all APIS.
# Subclassing
Source: https://docs.baseten.co/development/chain/subclassing
Modularize and re-use Chainlet implementations
Sometimes you want to write one "main" implementation of a complicated inference
task, but then re-use it for similar variations. For example:
* Deploy it on different hardware and with different concurrency.
* Replace a dependency (for example, silence detection in audio files) with a
different implementation of that step - while keeping all other processing
the same.
* Deploy the same inference flow, but exchange the model weights used. For example, for
a large and small version of an LLM or different model weights fine-tuned to\
domains.
* Add an adapter to convert between a different input/output schema.
In all of those cases, you can create lightweight subclasses of your main
chainlet.
Below are some example code snippets. They can all be combined with each other!
### Example base class
```python theme={"system"}
import asyncio
import truss_chains as chains
class Preprocess2x(chains.ChainletBase):
async def run_remote(self, number: int) -> int:
return 2 * number
class MyBaseChainlet(chains.ChainletBase):
remote_config = chains.RemoteConfig(
compute=chains.Compute(cpu_count=1, memory="100Mi"),
options=chains.ChainletOptions(enable_b10_tracing=True),
)
def __init__(self, preprocess=chains.depends(Preprocess2x)):
self._preprocess = preprocess
async def run_remote(self, number: int) -> float:
return 1.0 / await self._preprocess.run_remote(number)
# Assert base behavior.
with chains.run_local():
chainlet = MyBaseChainlet()
result = asyncio.get_event_loop().run_until_complete(chainlet.run_remote(4))
assert result == 1 / (4 * 2)
```
### Adapter for different I/O
The base class `MyBaseChainlet` works with integer inputs and returns floats. If
you want to reuse the computation, but provide an alternative interface (e.g.
for a different client with different request/response schema), you can create
a subclass which does the I/O conversion. The actual computation is delegated to
the base classes above.
```python theme={"system"}
class ChainletStringIO(MyBaseChainlet):
async def run_remote(self, number: str) -> str:
return str(await super().run_remote(int(number)))
# Assert new behavior.
with chains.run_local():
chainlet_string_io = ChainletStringIO()
result = asyncio.get_event_loop().run_until_complete(
chainlet_string_io.run_remote("4"))
assert result == "0.125"
```
### Chain with substituted dependency
The base class `MyBaseChainlet` uses preprocessing that doubles the input. If
you want to use a different variant of preprocessing - while keeping
`MyBaseChainlet.run_remote` and everything else as is - you can define a shallow
subclass of `MyBaseChainlet` where you use a different dependency
`Preprocess8x`, which multiplies by 8 instead of 2.
```python theme={"system"}
class Preprocess8x(chains.ChainletBase):
async def run_remote(self, number: int) -> int:
return 8 * number
class Chainlet8xPreprocess(MyBaseChainlet):
def __init__(self, preprocess=chains.depends(Preprocess8x)):
super().__init__(preprocess=preprocess)
# Assert new behavior.
with chains.run_local():
chainlet_8x_preprocess = Chainlet8xPreprocess()
result = asyncio.get_event_loop().run_until_complete(
chainlet_8x_preprocess.run_remote(4))
assert result == 1 / (4 * 8)
```
### Override remote config.
If you want to re-deploy a chain, but change some deployment options, for example, run
on different hardware, you can create a subclass and override `remote_config`.
```python theme={"system"}
class Chainlet16Core(MyBaseChainlet):
remote_config = chains.RemoteConfig(
compute=chains.Compute(cpu_count=16, memory="100Mi"),
options=chains.ChainletOptions(enable_b10_tracing=True),
)
```
Be aware that `remote_config` is a class variable. In the example above we
created a completely new `RemoteConfig` value, because changing fields
*inplace* would also affect the base class.
If you want to share config between the base class and subclasses, you can
define them in additional variables e.g. for the image:
```python theme={"system"}
DOCKER_IMAGE = chains.DockerImage(pip_requirements=[...], ...)
class MyBaseChainlet(chains.ChainletBase):
remote_config = chains.RemoteConfig(docker_image=DOCKER_IMAGE, ...)
class Chainlet16Core(MyBaseChainlet):
remote_config = chains.RemoteConfig(docker_image=DOCKER_IMAGE, ...)
```
# Watch
Source: https://docs.baseten.co/development/chain/watch
Live-patch deployed code
The [watch command](/reference/cli/chains/chains-cli#watch) (`truss chains watch`) combines
the best of local development and full deployment. `watch` lets you run on an
exact copy of the production hardware and interface but gives you live code
patching that lets you test changes in seconds without creating a new
deployment.
To use `truss chains watch`:
1. Push a chain in development mode with `truss chains push --watch SOURCE`.
This creates a development deployment and starts watching in one step.
You can also create the deployment separately and then run
`truss chains watch SOURCE` to attach the watcher.
2. Each time you edit a file and save the changes, the watcher patches the
remote deployments. Updating the deployments might take a moment, but is
generally *much* faster than creating a new deployment.
3. You can call the chain with test data via `cURL` or the playground dialogue
in the UI and observe the result and logs.
4. Iterate steps 2. and 3. until your chain behaves in the desired way.
### Selective Watch
Some large ML models might have a slow cycle time to reload (e.g. if the
weights are huge). For this case, we provide a "selective" watch option. For
example if your chain has such a heavy model Chainlet and other Chainlets
that contain only business logic, you can iterate on those, while not patching
and reloading the heavy model Chainlet.
This feature is really useful for advanced use case, but must be used with
caution.
If you change the code of a Chainlet not watched, in particular I/O types,
you get an inconsistent deployment.
Add the Chainlet names you want to watch as a comma separated list:
```shell theme={"system"}
truss chains watch ... --experimental-chainlet-names=ChainletA,ChainletB
```
# Concepts
Source: https://docs.baseten.co/development/concepts
Choose between Model APIs, self-deployed models, and Chains, and learn the development cycle that applies to all three.
Baseten gives you three ways to run inference, each suited to a different stage of a project. You can start with a hosted model, deploy your own when you need control, and compose multiple models into a pipeline when the problem demands it.
## Choose your approach
**Model APIs** are the fastest path to inference. You call a hosted open-source model through an OpenAI-compatible endpoint. There's no deployment step, no GPU selection, and no scaling configuration. If the model you need is in the [supported list](/development/model-apis/overview), you can make your first call in under a minute.
**Self-deployed models** give you dedicated GPUs and full control over the serving stack. You point Baseten at a model on Hugging Face, choose a GPU, and `truss push` builds an optimized container with an API endpoint. For models that need custom preprocessing, postprocessing, or architectures that the config-only path doesn't support, you write a Python `Model` class with your own inference logic. Self-deployed models support [engine selection](/engines), [autoscaling](/deployment/autoscaling/overview), and [environment promotion](/deployment/environments).
**Chains** let you orchestrate multi-step inference across independent services. Each step in a Chain runs on its own hardware with its own scaling rules. A Chain can call self-deployed models, external APIs, or any Python code. Use Chains when your workflow involves multiple models (like a RAG pipeline with retrieval and generation) or when different steps need different hardware (like CPU for preprocessing and GPU for inference).
These three approaches aren't mutually exclusive. Many projects start with a Model API call during prototyping, move to a self-deployed model for customization, and eventually wrap the model in a Chain as the system grows.
## The development cycle
Self-deployed models and Chains share the same iteration workflow. You push a development deployment, make changes with live reload, and publish when you're ready for production traffic.
1. **Push to development.** Run `truss push --watch` to create a development deployment. This is a single-replica instance with live reload enabled, designed for fast iteration rather than production traffic.
2. **Iterate with live reload.** Run `truss watch` to start a file watcher that syncs local changes to your development deployment in seconds, without rebuilding the container. You edit code, save, and see the result in the deployment logs.
3. **Publish to production.** Run `truss push` to create an immutable, production-ready deployment with full autoscaling. Promote it to an [environment](/deployment/environments) for a stable endpoint URL that doesn't change between versions.
Development deployments have slightly lower performance than published deployments and are limited to one replica. They exist to give you a fast feedback loop, not to serve real traffic.
Call hosted models through OpenAI-compatible endpoints.
Deploy your own model to dedicated GPU infrastructure.
Orchestrate multi-step inference across independent services.
# Deprecation
Source: https://docs.baseten.co/development/model-apis/deprecation
Baseten's deprecation policy for Model APIs
As open source models advance rapidly, Baseten prioritizes serving the highest quality models and deprecates specific Model APIs when stronger alternatives are available. When a model is selected for deprecation, Baseten follows this process:
1. **Announcement**
* Deprecations are announced approximately two weeks before the deprecation date.
* Documentation is updated to identify the model being deprecated and recommend a replacement.
* Affected users are contacted via email.
2. **Transition**
* The deprecated model remains fully functional until the deprecation date. You have approximately two weeks to transition using one of these options:
1. Migrate to a dedicated deployment with the deprecated model weights. [Contact us](https://www.baseten.co/talk-to-us/deprecation-inquiry/) for assistance.
2. Update your code to use an active model (a recommendation is provided in the deprecation announcement).
3. **Deprecation date**
* The model ID for the deprecated model becomes inactive and returns an error for all requests.
* A changelog notification is published with the recommended replacement.
## Planned deprecations
| Deprecation Date | Model | Recommended Replacement | Dedicated Available |
| :--------------- | :--------------- | :---------------------------------------------------- | :-----------------: |
| 2026-03-06 | Kimi K2 0905 | [Kimi K2.5](https://www.baseten.co/library/kimi-k25/) | ✅ |
| 2026-03-06 | Kimi K2 Thinking | [Kimi K2.5](https://www.baseten.co/library/kimi-k25/) | ✅ |
| 2026-03-06 | DeepSeek V3.2 | [Kimi K2.5](https://www.baseten.co/library/kimi-k25/) | ✅ |
# Model APIs
Source: https://docs.baseten.co/development/model-apis/overview
OpenAI-compatible endpoints for high-performance LLMs
Model APIs provide instant access to high-performance LLMs through OpenAI-compatible endpoints. Point your existing OpenAI SDK at Baseten's inference endpoint and start making calls, no model deployment required.
Unlike [self-deployed models](/development/model/build-your-first-model), where you configure hardware, engines, and scaling yourself, Model APIs run on shared infrastructure that Baseten manages. You get a fixed set of popular models with optimized serving out of the box. When you need a model that isn't in the supported list, or want dedicated GPUs with custom scaling, deploy your own with [Truss](/development/model/overview).
## Supported models
Enable a model from the [Model APIs page](https://app.baseten.co/model-apis/create) in the Baseten dashboard.
| Model | Slug | Context | Max output |
| ------------------- | ------------------------------ | ------- | ---------- |
| DeepSeek V3 0324 | `deepseek-ai/DeepSeek-V3-0324` | 164k | 131k |
| DeepSeek V3.1 | `deepseek-ai/DeepSeek-V3.1` | 164k | 131k |
| GLM 4.6 | `zai-org/GLM-4.6` | 200k | 200k |
| GLM 4.7 | `zai-org/GLM-4.7` | 200k | 200k |
| GLM 5 | `zai-org/GLM-5` | 203k | 203k |
| Kimi K2.5 | `moonshotai/Kimi-K2.5` | 262k | 262k |
| Minimax M2.5 | `MiniMaxAI/MiniMax-M2.5` | 204k | 204k |
| OpenAI GPT OSS 120B | `openai/gpt-oss-120b` | 128k | 128k |
## Pricing
Pricing is per million tokens.
| Model | Input | Output |
| ------------------- | -----: | -----: |
| OpenAI GPT OSS 120B | \$0.10 | \$0.50 |
| Minimax M2.5 | \$0.30 | \$1.20 |
| DeepSeek V3.1 | \$0.50 | \$1.50 |
| GLM 4.6 | \$0.60 | \$2.20 |
| GLM 4.7 | \$0.60 | \$2.20 |
| Kimi K2.5 | \$0.60 | \$3.00 |
| DeepSeek V3 0324 | \$0.77 | \$0.77 |
| GLM 5 | \$0.95 | \$3.15 |
Query the [`/v1/models`](#list-available-models) endpoint for current pricing.
## Feature support
All models support [tool calling](/engines/performance-concepts/function-calling).
Support for other features varies by model. See [Reasoning](/development/model-apis/reasoning) for configuration details.
| Model | JSON mode | Structured outputs | Reasoning | Vision |
| ------------------- | :-------: | :----------------: | ------------------ | :----: |
| DeepSeek V3 0324 | Yes | Yes | Enabled by default | No |
| DeepSeek V3.1 | No | No | Enabled by default | No |
| GLM 4.6 | Yes | Yes | Opt-in | No |
| GLM 4.7 | Yes | Yes | Opt-in | No |
| GLM 5 | Yes | Yes | No | No |
| Kimi K2.5 | Yes | Yes | Opt-in | Yes |
| Minimax M2.5 | Yes | Yes | Enabled by default | No |
| OpenAI GPT OSS 120B | Yes | Yes | Enabled by default | No |
GLM models also support `top_p` and `top_k` sampling parameters.
## Create a chat completion
If you've already completed the [quickstart](/quickstart), you have a working client. The examples below show a multi-turn conversation with a system message, which you can adapt for your application.
```python theme={"system"}
from openai import OpenAI
import os
client = OpenAI(
base_url="https://inference.baseten.co/v1",
api_key=os.environ["BASETEN_API_KEY"],
)
response = client.chat.completions.create(
model="deepseek-ai/DeepSeek-V3.1",
messages=[
{"role": "system", "content": "You are a concise technical writer."},
{"role": "user", "content": "What is gradient descent?"},
{"role": "assistant", "content": "An optimization algorithm that iteratively adjusts model parameters by moving in the direction of steepest decrease in the loss function."},
{"role": "user", "content": "How does the learning rate affect it?"}
],
)
print(response.choices[0].message.content)
```
```javascript theme={"system"}
import OpenAI from "openai";
const client = new OpenAI({
baseURL: "https://inference.baseten.co/v1",
apiKey: process.env.BASETEN_API_KEY,
});
const response = await client.chat.completions.create({
model: "deepseek-ai/DeepSeek-V3.1",
messages: [
{ role: "system", content: "You are a concise technical writer." },
{ role: "user", content: "What is gradient descent?" },
{ role: "assistant", content: "An optimization algorithm that iteratively adjusts model parameters by moving in the direction of steepest decrease in the loss function." },
{ role: "user", content: "How does the learning rate affect it?" }
],
});
console.log(response.choices[0].message.content);
```
```bash theme={"system"}
curl https://inference.baseten.co/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Api-Key $BASETEN_API_KEY" \
-d '{
"model": "deepseek-ai/DeepSeek-V3.1",
"messages": [
{"role": "system", "content": "You are a concise technical writer."},
{"role": "user", "content": "What is gradient descent?"},
{"role": "assistant", "content": "An optimization algorithm that iteratively adjusts model parameters by moving in the direction of steepest decrease in the loss function."},
{"role": "user", "content": "How does the learning rate affect it?"}
]
}'
```
Replace the model slug with any model from the supported models table.
## Features
Model APIs are compatible with the OpenAI Chat Completions API. Available features include [structured outputs](/engines/performance-concepts/structured-outputs), [tool calling](/engines/performance-concepts/function-calling), [reasoning](/development/model-apis/reasoning), [vision](/development/model-apis/vision), and streaming (`stream: true`). Not all models support every feature. See [feature support](#feature-support) for per-model availability.
For the complete parameter reference, see the [Chat Completions API documentation](/reference/inference-api/chat-completions).
## List available models
Query the `/v1/models` endpoint for the current list of models with metadata including pricing, context lengths, and supported features.
```bash theme={"system"}
curl https://inference.baseten.co/v1/models \
-H "Authorization: Api-Key $BASETEN_API_KEY"
```
## Migrate from OpenAI
To migrate existing OpenAI code to Baseten, change three values:
1. Replace your API key with a [Baseten API key](https://app.baseten.co/settings/api_keys).
2. Change the base URL to `https://inference.baseten.co/v1`.
3. Update the model name to a Baseten model slug.
```python theme={"system"}
from openai import OpenAI
import os
client = OpenAI(
base_url="https://inference.baseten.co/v1", # [!code ++]
api_key=os.environ["BASETEN_API_KEY"] # [!code ++]
)
response = client.chat.completions.create(
model="deepseek-ai/DeepSeek-V3.1", # [!code ++]
messages=[{"role": "user", "content": "Hello"}]
)
```
## Handle errors
Model APIs return standard HTTP error codes:
| Code | Meaning |
| ---- | --------------------------------------- |
| 400 | Invalid request (check your parameters) |
| 401 | Invalid or missing API key |
| 402 | Payment required |
| 404 | Model not found |
| 429 | Rate limit exceeded |
| 500 | Internal server error |
The response body contains details about the error and suggested resolutions.
## Next steps
Control extended thinking for complex tasks
Send images and videos alongside text
Understand and configure rate limits
Complete parameter documentation
# Rate limits and budgets
Source: https://docs.baseten.co/development/model-apis/rate-limits-and-budgets
Rate limits and usage budgets for Model APIs
Baseten enforces two rate limits to ensure fair use and system stability:
* **Request rate limits**: Maximum API requests per minute.
* **Token rate limits**: Maximum tokens processed per minute (input + output combined).
Default limits vary by account status.
| Account | RPM | TPM |
| :--------------------- | ------------------------------------------: | ------------------------------------------: |
| **Basic** (unverified) | 15 | 100,000 |
| **Basic** (verified) | 120 | 500,000 |
| **Pro** | 120 | 1,000,000 |
| **Enterprise** | [Custom](https://www.baseten.co/talk-to-us) | [Custom](https://www.baseten.co/talk-to-us) |
If you exceed these limits, the API returns a `429 Too Many Requests` error.
To request a rate limit increase, [contact us](https://www.baseten.co/talk-to-us/increase-rate-limits/).
***
## Set budgets
Budgets let you control Model API usage and avoid unexpected costs. Budgets apply only to Model APIs, not dedicated deployments. Your team receives email notifications at 75%, 90%, and 100% of budget.
### Enforce budgets
Budgets can be enforced or non-enforced:
* **Enforced**: Requests are rejected when the budget is reached.
* **Not enforced**: You receive notifications but remain responsible for costs over the budget.
# Reasoning
Source: https://docs.baseten.co/development/model-apis/reasoning
Control extended thinking for reasoning-capable models
Some Model APIs support *extended thinking*, where the model reasons through a problem before producing a final answer.
The reasoning process generates additional tokens that appear in a separate `reasoning_content` field, distinct from the final response.
## Supported models
| Model | Slug | Reasoning |
| ------------------- | ------------------------------ | ------------------------------- |
| DeepSeek V3.1 | `deepseek-ai/DeepSeek-V3.1` | Enabled by default |
| DeepSeek V3 0324 | `deepseek-ai/DeepSeek-V3-0324` | Enabled by default |
| Minimax M2.5 | `MiniMaxAI/MiniMax-M2.5` | Enabled by default |
| OpenAI GPT OSS 120B | `openai/gpt-oss-120b` | Enabled by default |
| Kimi K2.5 | `moonshotai/Kimi-K2.5` | Opt-in via `chat_template_args` |
| GLM 4.7 | `zai-org/GLM-4.7` | Opt-in via `chat_template_args` |
| GLM 4.6 | `zai-org/GLM-4.6` | Opt-in via `chat_template_args` |
GPT OSS 120B also supports [`reasoning_effort`](#control-reasoning-depth).
Models not listed here don't support reasoning.
## Enable thinking
Enable thinking for Kimi K2.5 and GLM models by passing `chat_template_args`.
Pass `chat_template_args` through `extra_body` since it extends the standard OpenAI API:
```python theme={"system"}
response = client.chat.completions.create(
model="moonshotai/Kimi-K2.5",
messages=[{"role": "user", "content": "What is the sum of the first 100 prime numbers?"}],
extra_body={"chat_template_args": {"enable_thinking": True}},
max_tokens=4096,
stream=True,
)
```
Include `chat_template_args` directly in the request options:
```javascript theme={"system"}
const response = await client.chat.completions.create({
model: "moonshotai/Kimi-K2.5",
messages: [{ role: "user", content: "What is the sum of the first 100 prime numbers?" }],
chat_template_args: { enable_thinking: true },
max_tokens: 4096,
stream: true,
});
```
Include `chat_template_args` in the JSON request body:
```bash theme={"system"}
curl https://inference.baseten.co/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Api-Key $BASETEN_API_KEY" \
-d '{
"model": "moonshotai/Kimi-K2.5",
"messages": [{"role": "user", "content": "What is the sum of the first 100 prime numbers?"}],
"chat_template_args": {"enable_thinking": true},
"max_tokens": 4096,
"stream": false
}'
```
## Control reasoning depth
The `reasoning_effort` parameter controls how thoroughly the model reasons through a problem.
Currently, only GPT OSS 120B supports this parameter.
| Value | Behavior |
| -------- | ----------------------------------------- |
| `low` | Faster responses, less thorough reasoning |
| `medium` | Balanced (default) |
| `high` | Slower responses, more thorough reasoning |
Pass `reasoning_effort` through `extra_body` since it extends the standard OpenAI API:
```python theme={"system"}
from openai import OpenAI
import os
client = OpenAI(
base_url="https://inference.baseten.co/v1",
api_key=os.environ.get("BASETEN_API_KEY")
)
response = client.chat.completions.create(
model="openai/gpt-oss-120b",
messages=[
{"role": "user", "content": "What is the sum of the first 100 prime numbers?"}
],
extra_body={"reasoning_effort": "high"} # [!code ++]
)
print(response.choices[0].message.content)
```
Include `reasoning_effort` directly in the request options:
```javascript theme={"system"}
import OpenAI from "openai";
const client = new OpenAI({
baseURL: "https://inference.baseten.co/v1",
apiKey: process.env.BASETEN_API_KEY,
});
const response = await client.chat.completions.create({
model: "openai/gpt-oss-120b",
messages: [
{ role: "user", content: "What is the sum of the first 100 prime numbers?" }
],
reasoning_effort: "high" // [!code ++]
});
console.log(response.choices[0].message.content);
```
Include `reasoning_effort` in the JSON request body:
```bash theme={"system"}
curl https://inference.baseten.co/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Api-Key $BASETEN_API_KEY" \
-d '{
"model": "openai/gpt-oss-120b",
"messages": [{"role": "user", "content": "What is the sum of the first 100 prime numbers?"}],
"reasoning_effort": "high"
}'
```
### Parse the response
The model's thinking process appears in `reasoning_content`, separate from the final answer in `content`. Both fields are returned on the message object.
```json theme={"system"}
{
"choices": [
{
"message": {
"role": "assistant",
"content": "The sum of the first 100 prime numbers is 24,133.",
"reasoning_content": "Let me work through this step by step. The first prime number is 2..."
}
}
],
"usage": {
"prompt_tokens": 90,
"completion_tokens": 3423,
"total_tokens": 3513
}
}
```
Reasoning tokens are included in `completion_tokens` and count toward your total usage and billing.
### Decide when to reason
Reasoning improves quality for tasks that benefit from step-by-step thinking: mathematical calculations, multi-step logic problems, code generation with complex requirements, and analysis requiring multiple considerations.
For straightforward tasks like simple Q\&A or text generation, reasoning adds latency and token cost without improving quality. In these cases, use a model without reasoning support or set `reasoning_effort` to `low`.
# Vision
Source: https://docs.baseten.co/development/model-apis/vision
Send images and videos alongside text to vision-capable models
Model APIs support both text and vision inputs, but multimodal capability
depends on the underlying model. Vision-capable models accept images alongside
text in the same request, using the OpenAI-compatible `image_url` content type.
The model processes both modalities together, so it can answer questions about
image content, compare multiple images, or extract structured data from
screenshots.
Not all models support vision. Check the table below before sending image
inputs.
## Supported models
| Model | Slug |
| --------- | ---------------------- |
| Kimi K2.5 | `moonshotai/Kimi-K2.5` |
## Send a vision request
Use the `image_url` content type to include images in your messages.
```python theme={"system"}
from openai import OpenAI
import os
client = OpenAI(
base_url="https://inference.baseten.co/v1",
api_key=os.environ["BASETEN_API_KEY"],
)
response = client.chat.completions.create(
model="moonshotai/Kimi-K2.5",
messages=[
{
"role": "user",
"content": [
{
"type": "text",
"text": "Describe the natural environment in the image.",
},
{
"type": "image_url",
"image_url": {
"url": "https://huggingface.co/datasets/YiYiXu/testing-images/resolve/main/seashore.png"
},
},
],
}
],
)
print(response.choices[0].message.content)
```
```javascript theme={"system"}
import OpenAI from "openai";
const client = new OpenAI({
baseURL: "https://inference.baseten.co/v1",
apiKey: process.env.BASETEN_API_KEY,
});
const response = await client.chat.completions.create({
model: "moonshotai/Kimi-K2.5",
messages: [
{
role: "user",
content: [
{
type: "text",
text: "Describe the natural environment in the image.",
},
{
type: "image_url",
image_url: {
url: "https://huggingface.co/datasets/YiYiXu/testing-images/resolve/main/seashore.png",
},
},
],
},
],
});
console.log(response.choices[0].message.content);
```
```bash theme={"system"}
curl https://inference.baseten.co/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Api-Key $BASETEN_API_KEY" \
-d '{
"model": "moonshotai/Kimi-K2.5",
"messages": [
{
"role": "user",
"content": [
{
"type": "text",
"text": "Describe the natural environment in the image."
},
{
"type": "image_url",
"image_url": {
"url": "https://huggingface.co/datasets/YiYiXu/testing-images/resolve/main/seashore.png"
}
}
]
}
]
}'
```
## Image constraints
Pass images as URLs or as base64-encoded data.
| Constraint | Limit |
| -------------------------------------- | ----- |
| Max size per image (URL) | 10 MB |
| Max total media size per request (URL) | 50 MB |
| Max images per request | 8 |
| Max request size (base64) | 50 MB |
## Pricing
There is no additional per-image fee. Images are converted to input tokens and priced at the model's standard input rate. Higher resolution images produce more tokens and cost more to process.
The exact conversion from pixels to tokens depends on the model. For example, Kimi K2.5 divides each image into 14×14 pixel tiles where each tile becomes one input token. At Kimi K2.5's input rate of \$0.60 per million tokens:
| Image resolution | Tiles | Input tokens | Cost at \$0.60/M |
| ---------------- | -----: | -----------: | ---------------: |
| 256×256 | 324 | 324 | \$0.0002 |
| 512×512 | 1,296 | 1,296 | \$0.0008 |
| 1024×1024 | 5,329 | 5,329 | \$0.0032 |
| 1920×1080 | 10,234 | 10,234 | \$0.0061 |
For videos, token count scales with both resolution and the number of sampled frames.
# b10cache
Source: https://docs.baseten.co/development/model/b10cache
Persist data across replicas or deployments
### Deprecated
b10cache is deprecated. For model weight caching, use the new [`weights`](/development/model/bdn) configuration which offers faster cold starts through multi-tier caching.
For `torch.compile` caching, see [Torch Compile Cache](/development/model/torch-compile-cache).
### Early Access
Please contact our [support team](mailto:support@baseten.co) for access to b10cache.
Deployments sometimes have cache or other files that are useful to other replicas. Using `torch.compile` results in a cache that can speed up future `torch.compile` on the same function. This can speed up other replicas' cold start times.
**These files can be stored via b10cache**. b10cache is a volume mounted over the network onto each of your pods. There are two ways files can be stored:
#### 1. `/cache/org/`
This directory is shared, and can be written to or accessed by every pod you deploy. Simply move a file into here and it will be accessible.
#### 2. `/cache/model/`
This directory is shared by every pod within the scope of your deployment. This is excellent for keeping filesystems clean and limiting access.
### Not a persistent object storage
While b10cache is very reliable, it should not be used as a persistent object storage or database. **It should be considered a cache** that can be shared by deployments, meaning there should always be a fallback plan if the b10cache path does not exist.
See two features built on b10cache:
1. [*model cache*](/development/model/model-cache)
2. [*torch compile cache*](/development/model/torch-compile-cache)
# Base Docker images
Source: https://docs.baseten.co/development/model/base-images
A guide to configuring a base image for your truss
Truss uses containerized environments to ensure consistent model execution across deployments. While the default Truss image works for most cases, you may need a custom base image to meet specific package or system requirements.
## Setting a base image in`config.yaml`
Specify a custom base image in `config.yaml`:
```yaml config.yaml theme={"system"}
base_image:
image:
python_executable_path:
```
* `image`: The Docker image to use.
* `python_executable_path`: The path to the Python binary inside the container.
### Example: NVIDIA NeMo model
Using a custom image to deploy [NVIDIA NeMo TitaNet](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/nemo/models/titanet_large) model:
```yaml config.yaml theme={"system"}
base_image:
image: nvcr.io/nvidia/nemo:23.03
python_executable_path: /usr/bin/python
apply_library_patches: true
requirements:
- PySoundFile
resources:
accelerator: T4
cpu: 2500m
memory: 4512Mi
use_gpu: true
secrets: {}
system_packages:
- python3.8-venv
```
## Using private base images
If your base image is private, ensure that you have configured your model to use a [private registry](/development/model/private-registries)
## Creating a custom base image
You can build a new base image using Truss’s base images as a foundation. Available images are listed on [Docker Hub](https://hub.docker.com/r/baseten/truss-server-base/tags).
#### Example: Customizing a Truss base image
```Dockerfile Dockerfile theme={"system"}
FROM baseten/truss-server-base:3.11-gpu-v0.7.16
RUN pip uninstall cython -y
RUN pip install cython==0.29.30
```
#### Building & pushing your custom image
Ensure Docker is installed and running. Then, build, tag, and push your image:
```sh theme={"system"}
docker build -t my-custom-base-image:0.1 .
docker tag my-custom-base-image:0.1 your-docker-username/my-custom-base-image:0.1
docker push your-docker-username/my-custom-base-image:0.1
```
# Baseten Delivery Network
Source: https://docs.baseten.co/development/model/bdn
Optimize cold starts with multi-tier caching and data delivery
Baseten Delivery Network (BDN) reduces cold start times by mirroring your model weights to Baseten's infrastructure and caching them close to your pods.
Instead of downloading hundreds of gigabytes from Hugging Face, S3, or GCS on every scale-up, BDN mirrors weights once and serves them from multi-tier caches.
Configure BDN using the `weights` key in your config.
This works with both `Model` class deployments and [custom Docker images](/development/model/custom-server).
Add weights to a new model
Use with vLLM, SGLang, and more
Move from `model_cache`
## Quick start
Add a `weights` section to your `config.yaml`:
```yaml config.yaml theme={"system"}
weights:
- source: "hf://meta-llama/Llama-3.1-8B@main"
mount_location: "/models/llama"
allow_patterns: ["*.safetensors", "config.json"]
ignore_patterns: ["*.md", "*.txt"]
```
| Field | Description |
| ----------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `source` | Where to fetch weights from. Supports [Hugging Face](#hugging-face), [S3](#aws-s3), [GCS](#google-cloud-storage), and [R2](#cloudflare-r2). |
| `mount_location` | Absolute path where weights appear in your container. |
| `allow_patterns` | Optional. Only download files matching these patterns. Useful for skipping large files you don't need. See [filtering files](#filter-files-with-patterns). |
| `ignore_patterns` | Optional. Exclude files matching these patterns. Useful for skipping documentation or unused formats. |
For private or gated models, add an `auth` section to reference a [Baseten secret](/development/model/secrets) with your credentials.
### Accessing weights in your model
When your model starts, weights are already downloaded and available at your `mount_location`.
The directory structure from the source is preserved:
```
/models/llama/ # Your mount_location
├── config.json
├── model-00001-of-00004.safetensors
├── model-00002-of-00004.safetensors
├── ...
├── model.safetensors.index.json
├── tokenizer.json
├── tokenizer_config.json
└── original/ # Subfolders are preserved
├── consolidated.00.pth
└── params.json
```
Load weights directly from this path in your `load()` method. No download code needed:
```python model.py theme={"system"}
from transformers import AutoModelForCausalLM
class Model:
def load(self):
# Weights are already available at mount_location
self._model = AutoModelForCausalLM.from_pretrained(
"/models/llama",
torch_dtype=torch.float16,
device_map="auto"
)
```
The mount is read-only.
Weights are fetched during `truss push` and cached, so cold starts only read from local or nearby caches.
***
## Configuration reference
### `weights`
A list of weight sources to mount into your model container.
```yaml config.yaml theme={"system"}
weights:
- source: "hf://meta-llama/Llama-3.1-8B@main"
mount_location: "/models/llama"
auth:
auth_method: CUSTOM_SECRET
auth_secret_name: "hf_access_token"
allow_patterns: ["*.safetensors", "config.json"]
ignore_patterns: ["*.md", "*.txt"]
```
URI specifying where to fetch weights from. Supported schemes:
* `hf://`: Hugging Face Hub.
* `s3://`: AWS S3.
* `gs://`: Google Cloud Storage.
* `r2://`: Cloudflare R2.
For Hugging Face sources, specify a revision using `@revision` suffix (branch, tag, or commit SHA).
Absolute path where weights will be mounted in your container. **Must start with `/`**.
```yaml theme={"system"}
mount_location: "/models/llama" # Correct
mount_location: "models/llama" # Wrong - not absolute
```
Authentication configuration for accessing private weight sources. See [Source types and authentication](#source-types-and-authentication) for the expected format for each source type.
* `auth_method`: The authentication method. Use `CUSTOM_SECRET` for secret-based auth, `AWS_OIDC` for AWS OIDC, or `GCP_OIDC` for GCP OIDC.
* `auth_secret_name`: Name of a [Baseten secret](/development/model/secrets) containing credentials (required for `CUSTOM_SECRET`).
File patterns to include. Uses Unix shell-style wildcards. Only matching files will be downloaded.
```yaml theme={"system"}
allow_patterns:
- "*.safetensors"
- "config.json"
- "tokenizer.*"
```
File patterns to exclude. Uses Unix shell-style wildcards. Matching files will be skipped.
```yaml theme={"system"}
ignore_patterns:
- "*.md"
- "*.txt"
- "*.bin" # Skip PyTorch .bin files if using safetensors
```
***
## Source types and authentication
For private weight sources, create a [Baseten secret](/development/model/secrets) with the appropriate credentials.
Manage secrets in your [Baseten settings](https://app.baseten.co/settings/secrets).
### Hugging Face
Download weights from Hugging Face Hub repositories.
```yaml config.yaml theme={"system"}
weights:
- source: "hf://meta-llama/Llama-3.1-8B@main"
mount_location: "/models/llama"
auth:
auth_method: CUSTOM_SECRET
auth_secret_name: "hf_access_token" # Required for private/gated repos
allow_patterns: ["*.safetensors", "config.json"]
```
**Format:** `hf://owner/repo@revision`
* `owner/repo`: The Hugging Face repository.
* `@revision`: Branch, tag, or commit SHA.
**Revision pinning:** When you use a branch name like `@main`, Baseten resolves it to the specific commit SHA at deploy time and mirrors those exact files. Your deployment stays pinned to that version. Subsequent scale-ups won't pick up new commits. To update to newer weights, push a new deployment.
**Authentication:** Hugging Face API token (plain text)
| Secret Name | Secret Value |
| ----------------- | ------------------------ |
| `hf_access_token` | `hf_xxxxxxxxxxxxxxxx...` |
Get your token from [Hugging Face settings](https://huggingface.co/settings/tokens).
### AWS S3
Download weights from an S3 bucket.
AWS supports using either [IAM credentials](https://docs.aws.amazon.com/IAM/latest/UserGuide/introduction.html) or OIDC for S3 authentication.
#### AWS OIDC (Recommended)
OIDC provides short-lived, narrowly scoped tokens for secure authentication without managing long-lived credentials.
1. [Configure AWS to trust the Baseten OIDC provider](/organization/oidc#aws-setup) and create an IAM role with S3 permissions.
2. Add the OIDC configuration to your `config.yaml`:
```yaml config.yaml theme={"system"}
weights:
- source: "s3://my-bucket/models/custom-weights"
mount_location: "/models/custom"
auth:
auth_method: AWS_OIDC
aws_oidc_role_arn: arn:aws:iam:::role/baseten-s3-access
aws_oidc_region: us-west-2
```
No secrets needed! The `aws_oidc_role_arn` and `aws_oidc_region` are not sensitive and can be committed to your repository.
See the [OIDC authentication guide](/organization/oidc) for detailed setup instructions and best practices.
#### IAM credentials
```yaml config.yaml theme={"system"}
weights:
- source: "s3://my-bucket/models/custom-weights"
mount_location: "/models/custom"
auth:
auth_method: CUSTOM_SECRET
auth_secret_name: "aws_credentials"
```
**Format:** `s3://bucket/path`
**Authentication:** JSON with AWS credentials
| Secret Name | Secret Value |
| ----------------- | --------------------------------------------------------------------------------------------- |
| `aws_credentials` | `{"aws_access_key_id": "AKIA...", "aws_secret_access_key": "...", "aws_region": "us-west-2"}` |
All three fields (`aws_access_key_id`, `aws_secret_access_key`, and `aws_region`) are required.
### Google Cloud Storage
Download weights from a GCS bucket.
GCP supports using either [service accounts](https://cloud.google.com/iam/docs/service-account-overview) or OIDC for GCS authentication.
#### GCP OIDC (Recommended)
OIDC provides short-lived, narrowly scoped tokens for secure authentication without managing long-lived credentials.
1. [Configure GCP Workload Identity](/organization/oidc#google-cloud-setup) to trust the Baseten OIDC provider and grant GCS permissions.
2. Add the OIDC configuration to your `config.yaml`:
```yaml config.yaml theme={"system"}
weights:
- source: "gs://my-bucket/models/weights"
mount_location: "/models/gcs-weights"
auth:
auth_method: GCP_OIDC
gcp_oidc_service_account: baseten-oidc@my-project.iam.gserviceaccount.com
gcp_oidc_workload_id_provider: projects/123456789/locations/global/workloadIdentityPools/baseten-pool/providers/baseten-provider
```
No secrets needed! The service account and workload identity provider are not sensitive and can be committed to your repository.
See the [OIDC authentication guide](/organization/oidc) for detailed setup instructions and best practices.
#### Service account
```yaml config.yaml theme={"system"}
weights:
- source: "gs://my-bucket/models/weights"
mount_location: "/models/gcs-weights"
auth:
auth_method: CUSTOM_SECRET
auth_secret_name: "gcp_service_account"
```
**Format:** `gs://bucket/path`
**Authentication:** GCP service account JSON key
| Secret Name | Secret Value |
| --------------------- | ------------------------------------------------------- |
| `gcp_service_account` | `{"type": "service_account", "project_id": "...", ...}` |
Download from GCP Console under IAM & Admin > Service Accounts.
### Cloudflare R2
Download weights from a Cloudflare R2 bucket.
```yaml config.yaml theme={"system"}
weights:
- source: "r2://abc123def.my-bucket/models/weights"
mount_location: "/models/r2-weights"
auth:
auth_method: CUSTOM_SECRET
auth_secret_name: "r2_credentials"
```
**Format:** `r2://account_id.bucket/path`
* `account_id`: Your Cloudflare account ID.
* `bucket`: R2 bucket name, separated from account\_id by a period.
* `path`: Path prefix within the bucket.
**Authentication:** JSON with R2 API credentials
| Secret Name | Secret Value |
| ---------------- | -------------------------------------------------------------- |
| `r2_credentials` | `{"aws_access_key_id": "...", "aws_secret_access_key": "..."}` |
Get your R2 API tokens from the Cloudflare dashboard under R2 > Manage R2 API Tokens.
***
## Migration from `model_cache`
`model_cache` is deprecated. Migrate to `weights` for faster cold starts through multi-tier caching.
### Automated migration with `truss migrate`
The `truss migrate` CLI command automatically converts `model_cache` configurations:
```bash theme={"system"}
# Run in your Truss directory
truss migrate
# Or specify a directory
truss migrate /path/to/truss
```
The command will:
1. Show a colorized diff of the proposed changes.
2. Prompt for confirmation before applying.
3. Create a backup of your original `config.yaml`.
4. Warn about any `model.py` path changes needed.
### Manual migration reference
**From `model_cache` to `weights`:**
| `model_cache` | `weights` |
| ----------------------- | ----------------------------------------- |
| `repo_id: "owner/repo"` | `source: "hf://owner/repo@rev"` |
| `revision: "main"` | Included in source URI as `@main` |
| `kind: "s3"` | Prefix: `s3://bucket/path` |
| `kind: "gcs"` | Prefix: `gs://bucket/path` |
| `kind: "r2"` | Prefix: `r2://account_id.bucket/path` |
| `volume_folder: "name"` | `mount_location: "/app/model_cache/name"` |
| `runtime_secret_name` | `auth.auth_secret_name` |
| `allow_patterns` | `allow_patterns` (same) |
| `ignore_patterns` | `ignore_patterns` (same) |
**Example migration:**
```yaml config.yaml theme={"system"}
weights:
- source: "hf://meta-llama/Llama-3.1-8B@main"
mount_location: "/app/model_cache/llama"
allow_patterns:
- "*.safetensors"
- "config.json"
auth:
auth_method: CUSTOM_SECRET
auth_secret_name: hf_access_token
```
```yaml config.yaml theme={"system"}
model_cache:
- repo_id: meta-llama/Llama-3.1-8B
revision: main
use_volume: true
volume_folder: llama
allow_patterns:
- "*.safetensors"
- "config.json"
runtime_secret_name: hf_access_token
```
### Chains migration
For Truss Chains, update `Assets.cached` to `Assets.weights` in your Python code:
```python theme={"system"}
import truss_chains as chains
from truss.base import truss_config
class MyChainlet(chains.ChainletBase):
remote_config = chains.RemoteConfig(
assets=chains.Assets(
weights=[
truss_config.WeightsSource(
source="hf://meta-llama/Llama-3.1-8B@main",
mount_location="/app/model_cache/llama",
auth_secret_name="hf_access_token",
allow_patterns=["*.safetensors", "config.json"],
)
],
secret_keys=["hf_access_token"],
),
)
```
```python theme={"system"}
import truss_chains as chains
from truss.base import truss_config
class MyChainlet(chains.ChainletBase):
remote_config = chains.RemoteConfig(
assets=chains.Assets(
cached=[
truss_config.ModelRepo(
repo_id="meta-llama/Llama-3.1-8B",
revision="main",
use_volume=True,
volume_folder="llama",
allow_patterns=["*.safetensors", "config.json"],
runtime_secret_name="hf_access_token",
)
],
secret_keys=["hf_access_token"],
),
)
```
**Key changes:**
* `ModelRepo` → `WeightsSource`.
* `repo_id` + `revision` → `source` URI with `@revision` suffix.
* `volume_folder` → `mount_location` (must be absolute path).
* `runtime_secret_name` → `auth.auth_secret_name` (inside an `auth` block with `auth_method: CUSTOM_SECRET`).
* Remove `use_volume` and `kind` (inferred from URI scheme).
### Custom server migration
If you're using a [custom server](/examples/docker) with `model_cache`, you'll need to make additional changes when migrating to `weights`:
1. **Remove `truss-transfer-cli`** from your `start_command`. With `weights`, files are pre-mounted before your container starts.
2. **Update file paths** from `/app/model_cache/{volume_folder}` to your new `mount_location`.
```yaml config.yaml theme={"system"}
docker_server:
# No truss-transfer-cli needed - weights are pre-mounted
start_command: text-embeddings-router --port 7997
--model-id /models/jina --max-client-batch-size 128
weights:
- source: "hf://jinaai/jina-embeddings-v2-base-code@516f4baf..."
mount_location: "/models/jina"
ignore_patterns: ["*.onnx"]
```
```yaml config.yaml theme={"system"}
docker_server:
# Required truss-transfer-cli to download weights
start_command: bash -c "truss-transfer-cli && text-embeddings-router --port 7997
--model-id /app/model_cache/my_jina --max-client-batch-size 128"
model_cache:
- repo_id: jinaai/jina-embeddings-v2-base-code
revision: 516f4baf13dec4ddddda8631e019b5737c8bc250
use_volume: true
volume_folder: my_jina
ignore_patterns: ["*.onnx"]
```
***
## Best practices
### Pin to specific commits
Avoid using branch names like `@main` in production. While Baseten pins to the commit SHA at deploy time, using `@main` means each new deployment may get different weights, making debugging and rollbacks difficult.
Always pin to a specific commit SHA for reproducible deployments:
```yaml theme={"system"}
# Recommended - reproducible across deploys
weights:
- source: "hf://meta-llama/Llama-3.1-8B@5206a32e7b8a9f1c..."
mount_location: "/models/llama"
# Not recommended for production - each new deployment resolves to a different commit
weights:
- source: "hf://meta-llama/Llama-3.1-8B@main"
mount_location: "/models/llama"
```
To find the current commit SHA for a Hugging Face repo:
```bash theme={"system"}
# Using the Hugging Face CLI
huggingface-cli repo-info meta-llama/Llama-3.1-8B --revision main
```
### Filter files with patterns
Only download what you need to minimize cold start time:
```yaml theme={"system"}
weights:
- source: "hf://meta-llama/Llama-3.1-8B@main"
mount_location: "/models/llama"
allow_patterns:
- "*.safetensors" # Model weights
- "config.json" # Model config
- "tokenizer.*" # Tokenizer files
ignore_patterns:
- "*.bin" # Skip PyTorch format if using safetensors
- "*.md" # Skip documentation
- "*.txt" # Skip text files
```
### Use absolute mount paths
The `mount_location` must be an absolute path (starting with `/`):
```yaml theme={"system"}
# Correct
mount_location: "/models/llama"
mount_location: "/app/model_cache/my-model"
# Wrong - will fail validation
mount_location: "models/llama"
mount_location: "./my-model"
```
### Keep mount locations unique
Each weight source must have a unique `mount_location`:
```yaml theme={"system"}
# Correct - different paths
weights:
- source: "hf://meta-llama/Llama-3.1-8B@main"
mount_location: "/models/llama"
- source: "hf://sentence-transformers/all-MiniLM-L6-v2@main"
mount_location: "/models/embeddings"
# Wrong - duplicate paths will fail
weights:
- source: "hf://model-a@main"
mount_location: "/models/shared"
- source: "hf://model-b@main"
mount_location: "/models/shared"
```
### When weights are re-mirrored
Baseten caches weights based on a hash of their configuration and reuses cached weights when possible to avoid redundant downloads.
**Deduplication and mutation detection:**
Baseten deduplicates files based on their etag (a content hash), not just filename, and only re-mirrors files that have been mutated since the last pull. Unchanged files are reused from blob storage, even across deployments.
**Changes that trigger re-mirroring:**
| Field | Re-mirrors? | Why |
| ----------------- | ----------- | --------------------------------------- |
| `source` | ✅ Yes | Different repository, revision, or path |
| `allow_patterns` | ✅ Yes | Different files will be downloaded |
| `ignore_patterns` | ✅ Yes | Different files will be downloaded |
**Changes that do NOT trigger re-mirroring:**
| Field | Re-mirrors? | Why |
| ---------------- | ----------- | --------------------------------------------------- |
| `auth` | ❌ No | Credentials don't affect which files are mirrored |
| `mount_location` | ❌ No | Only affects where weights appear in your container |
To force a fresh download of weights that haven't changed, modify the `source` to point to a specific commit SHA instead of a branch name, or add a trivial change to `allow_patterns`.
***
## How it works
### What happens when you `truss push`
```mermaid theme={"system"}
sequenceDiagram
participant User
participant Baseten as Baseten API
participant Mirror as Mirroring Worker Pool
participant Sources as Weight Sources
participant Storage as Baseten Blob Storage
participant WP as Workload Plane
User->>Baseten: truss push (with weights config)
Baseten->>User: Deployment created (returns immediately)
Baseten->>Mirror: Initiate mirroring job
Mirror->>Sources: Download weights (HF/S3/GCS/R2)
Mirror->>Storage: Upload mirrored weights (deduplicated)
Mirror->>Baseten: Mirroring complete
Note over Baseten,WP: Deploy to Workload Plane blocked until mirroring completes
Baseten->>WP: Deploy model (weights ready)
```
Your `truss push` command returns immediately after the deployment is created in Baseten. The mirroring process runs in the background, but your model will not be deployed to the Workload Plane until mirroring completes. This ensures weights are available before your model pod starts.
### What happens on cold start
Baseten runs multiple Workload Planes across regions and clusters. Each Workload Plane has its own in-cluster cache for fast weight delivery:
```mermaid theme={"system"}
flowchart TD
subgraph ControlPlane[Control Plane]
Storage[Baseten Blob Storage]
end
subgraph WP1[Workload Plane 1]
Cache1[In-Cluster Cache]
subgraph Node1[Node]
subgraph Agent1[BDN Agent]
NodeCache1[Node Cache]
end
Pod1[Model Pod]
end
end
subgraph WP2[Workload Plane 2]
Cache2[In-Cluster Cache]
subgraph Node2[Node]
subgraph Agent2[BDN Agent]
NodeCache2[Node Cache]
end
Pod2[Model Pod]
end
end
Storage --> Cache1
Storage --> Cache2
Cache1 --> Agent1
Cache2 --> Agent2
Agent1 --> Pod1
Agent2 --> Pod2
```
When your model pod starts:
1. The **BDN Agent** on the node fetches the manifest for your weights.
2. Weights are downloaded through the **In-Cluster Cache** (shared across pods in the cluster).
3. Weights are stored in the **Node Cache** (part of the BDN Agent, shared across pods on the same node).
4. Weights are mounted read-only to your model pod.
### Key benefits
* **Non-blocking push** → `truss push` returns immediately; mirroring happens in the background.
* **One-time mirroring** → Weights are mirrored to Baseten storage once, not on every cold start.
* **No upstream dependency at runtime** → Once mirrored, scale-ups and inference never contact the original source.
* **Multi-tier caching** → In-cluster cache prevents redundant downloads; node cache provides instant access for subsequent replicas.
* **Deduplication** → Identical weight files are stored once and shared via hardlinks.
* **Parallel downloads** → Large models download faster with concurrent chunk fetching.
***
## Next steps
* [Secrets](/development/model/secrets): Store credentials for private weight sources.
* [Custom Docker images](/development/model/custom-server): Deploy vLLM, SGLang, and other inference servers.
* [Autoscaling](/deployment/autoscaling): Configure replica scaling and cold start behavior.
* [Configuration reference](/reference/truss-configuration#weights): Full list of `weights` options.
# Custom build commands
Source: https://docs.baseten.co/development/model/build-commands
How to run your own docker commands during the build stage
The `build_commands` feature allows you to **run custom Docker commands** during the **build stage**, enabling **advanced caching**, **dependency management**, **and environment setup**.
**Use Cases:**
* Clone GitHub repositories
* Install dependencies
* Create directories
* Pre-download model weights
## 1. Using build commands in `config.yaml`
Add `build_commands` to your `config.yaml`:
```yaml theme={"system"}
build_commands:
- git clone https://github.com/comfyanonymous/ComfyUI.git
- cd ComfyUI && git checkout b1fd26fe9e55163f780bf9e5f56bf9bf5f035c93 && pip install -r requirements.txt
model_name: Build Commands Demo
python_version: py310
resources:
accelerator: A100
use_gpu: true
```
**What happens?**
* The GitHub repository is cloned.
* The specified commit is checked out.
* Dependencies are installed.
* **Everything is cached at build time**, reducing deployment cold starts.
## 2. Creating directories in your Truss
Use `build_commands` to **create directories** directly in the container.
```yaml theme={"system"}
build_commands:
- git clone https://github.com/comfyanonymous/ComfyUI.git
- cd ComfyUI && mkdir ipadapter
- cd ComfyUI && mkdir instantid
```
Useful for **large codebases** requiring additional structure.
## 3. Caching model weights efficiently
For large weights (10GB+), use `model_cache` or `external_data`.
For smaller weights, **use** `wget` in `build_commands`:
```yaml theme={"system"}
build_commands:
- git clone https://github.com/comfyanonymous/ComfyUI.git
- cd ComfyUI && pip install -r requirements.txt
- cd ComfyUI/models/controlnet && wget -O control-lora-canny-rank256.safetensors https://huggingface.co/stabilityai/control-lora/resolve/main/control-LoRAs-rank256/control-lora-canny-rank256.safetensors
- cd ComfyUI/models/controlnet && wget -O control-lora-depth-rank256.safetensors https://huggingface.co/stabilityai/control-lora/resolve/main/control-LoRAs-rank256/control-lora-depth-rank256.safetensors
model_name: Build Commands Demo
python_version: py310
resources:
accelerator: A100
use_gpu: true
system_packages:
- wget
```
**Why use this?**
* **Reduces startup time** by **preloading model weights** during the build stage.
* **Ensures availability** without runtime downloads.
## 4. Running any shell command
The `build_commands` feature lets you execute **any** shell command as if running it locally, with the benefit of **caching the results** at build time.
**Key Benefits:**
* **Reduces cold starts** by caching dependencies & data.
* **Ensures reproducibility** across deployments.
* **Optimizes environment setup** for fast execution.
# Your first model
Source: https://docs.baseten.co/development/model/build-your-first-model
Deploy a model to Baseten with just a config file. Pick an open-source model from Hugging Face, choose a GPU, and get an endpoint in minutes.
Baseten deploys models from a single `config.yaml` file. You point to a model on Hugging Face, choose a GPU, and Baseten builds a TensorRT-optimized container with an OpenAI-compatible API. No Python code, no Dockerfile, no container management.
This tutorial deploys [Qwen 2.5 3B Instruct](https://huggingface.co/Qwen/Qwen2.5-3B-Instruct), a small but capable LLM, to a production-ready endpoint on an L4 GPU.
## Set up your environment
Install [Truss](https://pypi.org/project/truss/):
```bash theme={"system"}
uv venv && source .venv/bin/activate
uv pip install --upgrade truss
```
```bash theme={"system"}
python3 -m venv .venv && source .venv/bin/activate
pip install --upgrade truss
```
```bash theme={"system"}
python3 -m venv .venv && .venv\Scripts\activate
pip install --upgrade truss
```
You also need a [Baseten account](https://app.baseten.co/signup) with an [API key](https://app.baseten.co/settings/account/api_keys).
### Log in to Baseten
Generate an API key from [Settings > API keys](https://app.baseten.co/settings/account/api_keys), then authenticate the Truss CLI:
```bash theme={"system"}
truss login
```
Paste your API key when prompted:
```output theme={"system"}
💻 Let's add a Baseten remote!
🤫 Quietly paste your API_KEY:
```
You can skip the interactive prompt by setting `BASETEN_API_KEY` as an environment variable:
```bash theme={"system"}
export BASETEN_API_KEY="paste-your-api-key-here"
```
## Create the config
Create a project directory with a `config.yaml`:
```bash theme={"system"}
mkdir qwen-2.5-3b && cd qwen-2.5-3b
```
Create a `config.yaml` file with the following contents:
```yaml config.yaml theme={"system"}
model_name: Qwen-2.5-3B
resources:
accelerator: L4
use_gpu: true
model_metadata:
tags:
- openai-compatible
trt_llm:
build:
base_model: decoder
checkpoint_repository:
source: HF
repo: "Qwen/Qwen2.5-3B-Instruct"
max_seq_len: 8192
quantization_type: fp8
tensor_parallel_count: 1
```
That's the entire deployment specification. The `resources` section selects an L4 GPU, which has 24 GB of VRAM. The `trt_llm` section tells Baseten to use [Engine-Builder-LLM](/engines/engine-builder-llm/overview), which compiles the model with TensorRT-LLM for optimized inference. The `checkpoint_repository` points to the model weights on Hugging Face (Qwen 2.5 3B Instruct is ungated, so no access token is needed). Setting `quantization_type: fp8` compresses weights to 8-bit floating point, cutting memory usage roughly in half with negligible quality loss.
## Deploy
Push to Baseten:
```bash theme={"system"}
truss push
```
You should see:
```output theme={"system"}
✨ Model Qwen 2.5 3B was successfully pushed ✨
🪵 View logs for your deployment at https://app.baseten.co/models/abc1d2ef/logs/xyz123
```
The logs URL contains your model ID, the string after `/models/` (e.g. `abc1d2ef`). You'll need this to call the model's API. You can also find it in your [Baseten dashboard](https://app.baseten.co/models/).
Baseten now downloads the model weights, compiles them with TensorRT-LLM, and deploys the resulting container to an L4 GPU. You can watch progress in the logs linked above. When the deployment status shows "Active" in the dashboard, it's ready for requests.
New accounts include free credits. This deployment uses an L4 GPU, one of the most cost-effective options available.
## Call your model
Engine-based deployments serve an OpenAI-compatible API, so any code that works with the OpenAI SDK works with your model. Replace `{model_id}` with your model ID from the deployment output.
Install the OpenAI SDK if you don't have it:
```bash theme={"system"}
uv pip install openai
```
Create a chat completion:
```python theme={"system"}
import os
from openai import OpenAI
client = OpenAI(
api_key=os.environ["BASETEN_API_KEY"],
base_url="https://model-{model_id}.api.baseten.co/environments/production/sync/v1",
)
response = client.chat.completions.create(
model="Qwen-2.5-3B",
messages=[
{"role": "user", "content": "What is machine learning?"}
],
)
print(response.choices[0].message.content)
```
```bash theme={"system"}
curl -s https://model-{model_id}.api.baseten.co/environments/production/sync/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Api-Key $BASETEN_API_KEY" \
-d '{
"model": "Qwen-2.5-3B",
"messages": [
{"role": "user", "content": "What is machine learning?"}
]
}'
```
You should see a response like:
```output theme={"system"}
Machine learning is a branch of artificial intelligence where systems learn
patterns from data to make predictions or decisions without being explicitly
programmed for each task...
```
## What just happened
With a 12-line config file, you deployed a production-ready LLM endpoint. Here's what Baseten did:
1. Downloaded the Qwen 2.5 3B Instruct weights from Hugging Face.
2. Compiled the model with TensorRT-LLM, applying FP8 quantization for faster inference and lower memory usage.
3. Packaged everything into a container and deployed it to an L4 GPU.
4. Exposed an OpenAI-compatible API that handles tokenization, batching, and KV cache management automatically.
No `model.py`, no Docker setup, no inference server configuration. This config-only pattern works for most popular open-source LLMs, including Llama, Qwen, Mistral, Gemma, and Phi models.
## Next steps
Tune max sequence length, batch size, quantization, and runtime settings for your deployment.
Add custom Python when you need preprocessing, postprocessing, or unsupported model architectures.
Configure replicas, concurrency targets, and scale-to-zero for production traffic.
Move from development to production with `truss push --promote`.
# Python driven configuration for models
Source: https://docs.baseten.co/development/model/code-first-development
Use code-first development tools to streamline model production.
This feature is still in beta.
In addition to our normal YAML configuration, we support configuring your model using pure Python. This offers the following benefits:
* **Typed configuration via Python code** with IDE autocomplete, instead of a separate `yaml` configuration file
* **Simpler directory structure** that IDEs support for module resolution
In this guide, we go through deploying a simple Model using this new framework.
### Step 1: Initializing your project
We leverage traditional `truss init` functionality with a new flag to create the directory structure:
```bash theme={"system"}
truss init my-new-model --python-config
```
### Step 2: Write your model
To build a model with this new framework, we require two things:
* A class that inherits from `baseten.ModelBase`, which will serve as the entrypoint when invoking `/predict`
* A `predict` method with type hints
That’s it! The following is a contrived example of a complete model that will keep a running total of user provided input:
```python my_model.py theme={"system"}
import truss_chains as baseten
class RunningTotalCalculator(baseten.ModelBase):
def __init__(self):
self._running_total = 0
async def predict(self, increment: int) -> int:
self._running_total += increment
return self._running_total
```
### Step 3: Deploy, patch, and publish your model
To deploy a development version and start watching for changes, run:
```bash theme={"system"}
truss push --watch my_model.py
```
Please note that `push` (as well as all other commands below) will require that you pass the path to the file containing the model as the final argument.
This creates a development deployment and starts watching for changes. You can quickly iterate without building new images every time. To re-attach the watcher later:
```bash theme={"system"}
truss watch my_model.py
```
When you're ready for production, deploy a published version with `truss push my_model.py`.
### Model Configuration
Models can configure requirements for compute hardware (CPU count, GPU type and count, etc) and software dependencies (Python libraries or system packages) via the [`remote_config`](/reference/sdk/chains#remote-configuration) class variable within the model:
```python my_model.py theme={"system"}
class RunningTotalCalculator(baseten.ModelBase):
remote_config: baseten.RemoteConfig = baseten.RemoteConfig(
compute=baseten.Compute(cpu_count=4, memory="1Gi", gpu="T4", gpu_count=2)
)
...
```
See the [remote configuration reference](/reference/sdk/chains#remote-configuration) for a complete list of options.
### Context (access information)
You can add [`DeploymentContext`](/reference/sdk/chains#class-truss-chains-deploymentcontext) object as an optional final argument to the **`__init__`**-method of a Model. This allows you to use secrets within your Model, but note that they’ll also need to be added to the **`assets`**.
We only expose secrets to the model that were explicitly requested in `assets` to comply with best security practices.
```python my_model.py theme={"system"}
class RunningTotalCalculator(baseten.ModelBase):
remote_config: baseten.RemoteConfig = baseten.RemoteConfig(
...
assets=baseten.Assets(secret_keys=["token"])
)
def __init__(self, context: baseten.DeploymentContext = baseten.depends_context()):
...
self._token = context.secrets["token"]
```
### Packages
If you want to include modules in your model, you can easily create them from the root of the project:
```bash theme={"system"}
my-new-model/
module_1/
submodule/
script.py
module_2/
another_script.py
my_model.py
```
With this file structure, you would import in `my_model.py` as follows:
```python my_model.py theme={"system"}
import truss_chains as baseten
from module_1.submodule import script
from module_2 import another_script
class RunningTotalCalculator(baseten.ModelBase):
....
```
### Known limitations
* RemoteConfig doesn't support all the options exposed by the traditional `config.yaml`. If you’re excited about this new development experience but need a specific feature ported over, please reach out to us!
* This new framework doesn't support `preprocess` or `postprocess` hooks. We typically recommend inlining functionality from those functions if easy, or utilizing `chains` if the needs are more complex.
# Configuration
Source: https://docs.baseten.co/development/model/configuration
How to configure your model.
ML models depend on external libraries, data files, and specific hardware configurations.
This guide shows you how to configure your model's dependencies and resources.
The `config.yaml` file defines your model's configuration. Common options include:
# Environment variables
To set environment variables in the model serving environment, use the `environment_variables` key:
```yaml config.yaml theme={"system"}
environment_variables:
MY_ENV_VAR: my_value
```
# Python packages
Specify Python packages in `config.yaml` using either `requirements` (an inline list) or `requirements_file` (a path to a file). These two options are mutually exclusive.
## Inline list
List packages directly in `config.yaml`:
```yaml config.yaml theme={"system"}
requirements:
- package_name
- package_name2
```
Pin package versions with `==`:
```yaml config.yaml theme={"system"}
requirements:
- package_name==1.0.0
- package_name2==2.0.0
```
## Requirements file
Point `requirements_file` at a dependency file. Truss supports three formats:
Use a standard pip requirements file for full control over pip options and repositories.
```yaml config.yaml theme={"system"}
requirements_file: ./requirements.txt
```
Use a `pyproject.toml` to install dependencies from the `[project.dependencies]` table.
```yaml config.yaml theme={"system"}
requirements_file: ./pyproject.toml
```
Truss reads only the `[project.dependencies]` list. Optional dependency groups are ignored.
Use a `uv.lock` file for fully pinned, reproducible installs managed by [uv](https://docs.astral.sh/uv/).
```yaml config.yaml theme={"system"}
requirements_file: ./uv.lock
```
The `uv.lock` file must have a sibling `pyproject.toml` in the same directory. Truss copies both files into the build context.
### Chains
Chains supports the same three formats via `DockerImage.requirements_file`. Use [`make_abs_path_here`](/reference/sdk/chains#function-truss-chains-make-abs-path-here) to resolve the path relative to the source file:
```python theme={"system"}
import truss_chains as chains
class MyChainlet(chains.ChainletBase):
remote_config = chains.RemoteConfig(
docker_image=chains.DockerImage(
requirements_file=chains.make_abs_path_here("requirements.txt"),
),
)
```
`pyproject.toml` and `uv.lock` work the same way:
```python theme={"system"}
docker_image=chains.DockerImage(
requirements_file=chains.make_abs_path_here("pyproject.toml"),
)
```
```python theme={"system"}
docker_image=chains.DockerImage(
requirements_file=chains.make_abs_path_here("uv.lock"),
)
```
`pip_requirements_file` is deprecated. Use `requirements_file` instead. You can't combine `pip_requirements` with `pyproject.toml` or `uv.lock` files -- manage all dependencies in your `pyproject.toml`.
# System packages
Truss also has support for installing apt-installable Debian packages. To add
system packages to your model serving environment, add the following to your
`config.yaml` file:
```yaml config.yaml theme={"system"}
system_packages:
- package_name
- package_name2
```
For example, to install Tesseract OCR:
```yaml config.yaml theme={"system"}
system_packages:
- tesseract-ocr
```
# Resources
Specify hardware resources in the `resources` section.
**Option 1: Specify individual resource fields**
For a CPU model:
```yaml config.yaml theme={"system"}
resources:
cpu: "1"
memory: 2Gi
```
For a GPU model:
```yaml config.yaml theme={"system"}
resources:
accelerator: "L4"
```
When you push your model, it will be assigned an instance type matching the
specifications required.
**Option 2: Specify an exact instance type**
```yaml config.yaml theme={"system"}
resources:
instance_type: "L4:4x16"
```
Using `instance_type` lets you select an exact SKU. When specified, other resource fields are ignored.
See the [Resources](/deployment/resources) page for more information on
options available.
# Advanced configuration
There are numerous other options for configuring your model. See some
of the other guides:
* [Secrets](/development/model/secrets)
* [Data](/development/model/data-directory)
* [Custom Build Commands](/development/model/build-commands)
* [Base Docker Images](/development/model/base-images)
* [Custom Servers](/development/model/custom-server)
* [Custom Health Checks](/development/model/custom-health-checks)
# Custom health checks
Source: https://docs.baseten.co/development/model/custom-health-checks
Customize the health of your deployments.
**Why use custom health checks?**
* **Control traffic and restarts** by configuring failure thresholds to suit your needs.
* **Define replica health with custom logic** (for example, fail after a certain number of 500s or a specific CUDA error).
By default, health checks run every 10 seconds to verify that each replica of
your deployment is running successfully and can receive requests. If a health
check fails for an extended period, one or both of the following actions may
occur:
* Traffic is immediately stopped from reaching the failing replica.
* The failing replica is restarted.
The thresholds for each of these actions are configurable.
## Understanding readiness vs. liveness
Baseten uses two types of Kubernetes health probes that run continuously after
your container starts:
**Readiness probe** answers "Can I handle requests right now?" When it fails,
Kubernetes stops sending traffic to the container but doesn't restart it. Use
this to prevent traffic during startup or temporary unavailability. The failure
threshold is controlled by `stop_traffic_threshold_seconds`.
**Liveness probe** answers "Am I healthy enough to keep running?" When it fails,
Kubernetes restarts the container. Use this to recover from deadlocks or hung
processes. The failure threshold is controlled by `restart_threshold_seconds`.
For most servers, using the same endpoint (like `/health`) for both probes is
sufficient. The key difference is the action taken: readiness controls traffic
routing, while liveness controls container lifecycle.
Both probes wait before starting checks to allow your server time to initialize.
Configure this delay with `restart_check_delay_seconds`.
Custom health checks can be implemented in two ways:
1. [**Configuring thresholds**](#configuring-health-checks) for when health check failures should stop traffic to or restart a replica.
2. [**Writing custom health check logic**](#writing-custom-health-checks) to define how replica health is determined.
## Configuring health checks
### Parameters
You can customize the behavior of health checks on your deployments by setting
the following parameters:
The duration that health checks must continuously fail before traffic to the failing replica is stopped.
`stop_traffic_threshold_seconds` must be between `10` and `3000` seconds, inclusive.
How long to wait before running health checks.
`restart_check_delay_seconds` must be between `0` and `3000` seconds, inclusive.
The duration that health checks must continuously fail before triggering a restart of the failing replica.
`restart_threshold_seconds` must be between `10` and `3000` seconds, inclusive.
The combined value of `restart_check_delay_seconds` and `restart_threshold_seconds` must not exceed `1800` seconds.
### Model and custom server deployments
Configure health checks in your `config.yaml`.
```yaml config.yaml theme={"system"}
runtime:
health_checks:
restart_check_delay_seconds: 60 # Waits 60 seconds after deployment before starting health checks
restart_threshold_seconds: 600 # Triggers a restart if health checks fail for 10 minutes
stop_traffic_threshold_seconds: 300 # Stops traffic if health checks fail for 5 minutes
```
You can also specify custom health check endpoints for custom servers.
[See here](/development/model/custom-server#1-configuring-a-custom-server-in-config-yaml)
for more details.
### Chains
Use `remote_config` to configure health checks for your chainlet classes.
```python chain.py theme={"system"}
class CustomHealthChecks(chains.ChainletBase):
remote_config = chains.RemoteConfig(
options=chains.ChainletOptions(
health_checks=truss_config.HealthChecks(
restart_check_delay_seconds=30, # Waits 30 seconds before starting health checks
restart_threshold_seconds=600, # Restart replicas after 10 minutes of failure
stop_traffic_threshold_seconds=300, # Stop traffic after 5 minutes of failure
)
)
)
```
## Writing custom health checks
You can write custom health checks in both **model deployments** and **chain
deployments**.
Custom health checks are currently not supported in development deployments.
### Custom health checks in models
```python model.py theme={"system"}
class Model:
def is_healthy(self) -> bool:
# Add custom health check logic for your model here
pass
```
### Custom health checks in chains
Health checks can be customized for each chainlet in your chain.
```python chain.py theme={"system"}
@chains.mark_entrypoint
class CustomHealthChecks(chains.ChainletBase):
def is_healthy(self) -> bool:
# Add custom health check logic for your chainlet here
pass
```
## Health checks in action
### Identifying 5xx errors
You might create a custom health check to identify 5xx errors like the following:
```python model.py theme={"system"}
class Model:
def __init__(self):
...
self._is_healthy = True
def load(self):
# Perform load
# Your custom health check won't run until after load completes
...
def is_healthy(self):
return self._is_healthy
def predict(self, input):
try:
# Perform inference
...
except Some5xxError:
self._is_healthy = False
raise
```
Custom health check failures are indicated by the following log:
```md Example health check failure log line theme={"system"}
Jan 27 10:36:03pm md2pg Health check failed.
```
Deployment restarts due to health check failures are indicated by the following log:
```md Example restart log line theme={"system"}
Jan 27 12:02:47pm zgbmb Model terminated unexpectedly. Exit code: 0, reason: Completed, restart count: 1
```
## FAQs
### Is there a rule of thumb for configuring thresholds for stopping traffic and restarting?
It depends on your health check implementation. If your health check relies on conditions that only change during inference (e.g., `_is_healthy` is set in `predict`), restarting before stopping traffic is generally better, as it allows recovery without disrupting traffic.
Stopping traffic first may be preferable if a failing replica is actively degrading performance or causing inference errors, as it prevents the failing replica from affecting the overall deployment while allowing time for debugging or recovery.
### When should I configure `restart_check_delay_seconds`?
Configure `restart_check_delay_seconds` to allow replicas sufficient time to initialize after deployment or a restart. This delay helps reduce unnecessary restarts, particularly for services with longer startup times.
### Why am I seeing two health check failure logs in my logs?
These refer to two separate health checks we run every 10 seconds:
* One to determine when to stop traffic to a replica.
* The other to determine when to restart a replica.
### Does stopped traffic or replica restarts affect autoscaling?
Yes, both can impact autoscaling. If traffic stops or replicas restart, the
remaining replicas handle more load. If the load exceeds the concurrency target
during the autoscaling window, additional replicas are spun up. Similarly, when
traffic stabilizes, excess replicas are scaled down after the scale down delay.
[See here](/deployment/autoscaling/overview#how-autoscaling-works) for more details on
autoscaling.
### How does billing get affected?
You're billed for the uptime of your deployment. This includes the time a
replica is running, even if it is failing health checks, until it scales down.
### Will failing health checks cause my deployment to stay up forever?
No. If your deployment is configured with a scale down delay and the minimum
number of replicas is set to 0, the replicas will scale down once the model is
no longer receiving traffic for the duration of the scale down delay. This
applies even if the replicas are failing health checks.
[See here](/deployment/autoscaling/overview#scale-to-zero) for more details on
autoscaling.
### What happens when my deployment is loading?
When your deployment is loading, your custom health check will not be running.
Once `load()` is completed, we'll start using your custom `is_healthy()` health
check.
# Custom model code
Source: https://docs.baseten.co/development/model/custom-model-code
Deploy a model with custom Python using the Truss Model class.
When you need custom preprocessing, postprocessing, or want to run a model that isn't supported by Baseten's built-in engines, you can write Python code in a `model.py` file. Truss provides a `Model` class with three methods (`__init__`, `load`, and `predict`) that give you full control over how your model initializes, loads weights, and handles requests.
Most deployments don't need custom Python at all. If you're deploying a supported open-source model, see [Your first model](/development/model/build-your-first-model) for the config-only approach. Use custom model code when you need to:
* Run a model architecture that Baseten's engines don't support.
* Add custom preprocessing or postprocessing around inference.
* Combine multiple models or libraries in a single endpoint.
## Prerequisites
Install [Truss](https://pypi.org/project/truss/):
```bash theme={"system"}
uv venv && source .venv/bin/activate
uv pip install --upgrade truss
```
```bash theme={"system"}
python3 -m venv .venv && source .venv/bin/activate
pip install --upgrade truss
```
```bash theme={"system"}
python3 -m venv .venv && .venv\Scripts\activate
pip install --upgrade truss
```
You also need a [Baseten account](https://app.baseten.co/signup) with an [API key](https://app.baseten.co/settings/account/api_keys).
## Initialize your model
Create a new Truss project with `truss init`.
```bash theme={"system"}
$ truss init hello-world
? 📦 Name this model: HelloWorld
Truss HelloWorld was created in ~/hello-world
```
This creates a directory with the following structure:
* `config.yaml`: Configuration for dependencies, resources, and deployment settings.
* `model/model.py`: Your model code.
* `packages/`: Optional local Python packages.
* `data/`: Optional data files bundled with your model.
### config.yaml
The `config.yaml` file configures dependencies, resources, and other settings. Here's the default:
```yaml config.yaml theme={"system"}
build_commands: []
environment_variables: {}
external_package_dirs: []
model_metadata: {}
model_name: HelloWorld
python_version: py311
requirements: []
resources:
accelerator: null
cpu: '1'
memory: 2Gi
use_gpu: false
secrets: {}
system_packages: []
```
The fields you'll use most often:
* `requirements`: Python packages installed at build time (pip format).
* `resources`: CPU, memory, and GPU allocation.
* `secrets`: Secret names your model needs at runtime, such as HuggingFace API keys.
See the [Configuration](/development/model/configuration) page for the full reference.
### model.py
The `model.py` file defines a `Model` class with three methods:
```python theme={"system"}
class Model:
def __init__(self, **kwargs):
pass
def load(self):
pass
def predict(self, model_input):
return model_input
```
* `__init__`: Runs when the class is created. Initialize variables and store configuration here.
* `load`: Runs once at startup, before any requests. Load model weights, tokenizers, and other heavy resources here. Separating this from `__init__` keeps expensive operations out of the request path.
* `predict`: Runs on every API request. Process input, run inference, and return the response.
## Deploy your model
Deploy with `truss push --watch`.
```bash theme={"system"}
$ truss push --watch
```
This packages your code and config, builds a container, and deploys it to Baseten.
## Invoke your model
After deployment, call your model at the invocation URL:
```bash theme={"system"}
$ curl -X POST https://model-{model-id}.api.baseten.co/development/predict \
-H "Authorization: Api-Key $BASETEN_API_KEY" \
-d '"some text"'
```
You should see:
```output theme={"system"}
"some text"
```
## Example: text classification
To see the `Model` class in action, deploy a text classification model from HuggingFace using the `transformers` library.
### config.yaml
Add `transformers` and `torch` as dependencies:
```yaml config.yaml theme={"system"}
requirements:
- transformers
- torch
```
### model.py
Load the classification pipeline in `load` and run it in `predict`:
```python model.py theme={"system"}
from transformers import pipeline
class Model:
def __init__(self, **kwargs):
pass
def load(self):
self._model = pipeline("text-classification")
def predict(self, model_input):
return self._model(model_input)
```
### Deploy and call
Deploy with `truss push --watch`, then call the endpoint:
```bash theme={"system"}
$ truss push --watch
```
```bash theme={"system"}
$ curl -X POST https://model-{model-id}.api.baseten.co/development/predict \
-H "Authorization: Api-Key $BASETEN_API_KEY" \
-d '"some text"'
```
## Next steps
* [Configuration](/development/model/configuration): Full reference for `config.yaml` options.
* [Implementation](/development/model/implementation): Advanced model patterns including streaming, async, and custom health checks.
* [Your first model](/development/model/build-your-first-model): Deploy a model with just a config file, no custom Python needed.
# Deploy custom Docker images
Source: https://docs.baseten.co/development/model/custom-server
Deploy custom Docker images to run inference servers like vLLM, SGLang, Triton, or any containerized application.
When you write a `Model` class, Truss uses the
[Truss server base image](https://hub.docker.com/r/baseten/truss-server-base/tags)
by default. However, you can deploy pre-built containers.
In this guide, you will learn how to set the your configuration file to run a
custom Docker image and deploy it to Baseten using Truss.
## Configuration
To deploy a custom Docker image, set
[`base_image`](/reference/truss-configuration#base-image-image) to your image
and use the `docker_server` argument to specify how to run it.
```yaml config.yaml theme={"system"}
base_image:
image: your-registry/your-image:latest
docker_server:
start_command: your-server-start-command
server_port: 8000
predict_endpoint: /predict
readiness_endpoint: /health
liveness_endpoint: /health
```
* `image`: The Docker image to use.
* `start_command`: The command to start the server.
* `server_port`: The port to listen on.
* `predict_endpoint`: The endpoint to forward requests to.
* `readiness_endpoint`: The endpoint to check if the server is ready.
* `liveness_endpoint`: The endpoint to check if the server is alive.
Port 8080 is reserved by Baseten's internal reverse proxy. If your server binds to port 8080, the deployment fails with `[Errno 98] address already in use`.
For the full list of fields, see the
[configuration reference](/reference/truss-configuration#docker_server).
### Non-root user
If your base image expects a specific non-root UID, set `run_as_user_id` under `docker_server`:
```yaml config.yaml theme={"system"}
base_image:
image: your-registry/your-image:latest
docker_server:
start_command: your-server-start-command
server_port: 8000
predict_endpoint: /predict
readiness_endpoint: /health
liveness_endpoint: /health
run_as_user_id: 1000
```
The UID must already exist in the base image. Values `0` (root) and `60000` (platform default) are not allowed.
Many NVIDIA base images, including NIM and Triton, run as user ID `1000`. Set `run_as_user_id: 1000` when using these images.
Baseten automatically sets ownership of `/app`, `/workspace`, the packages directory, and `$HOME` to this UID. If your server writes to directories outside of these, ensure they are writable by the specified UID in your base image or via `build_commands`.
While `predict_endpoint` maps your server's inference route to Baseten's
`/predict` endpoint, you can access any route in your server using the
[sync endpoint](/inference/calling-your-model#sync-api-endpoints).
| Baseten endpoint | Maps to |
| ------------------------------------------- | ----------------------------- |
| `/environments/production/predict` | Your `predict_endpoint` route |
| `/environments/production/sync/{any/route}` | `/{any/route}` in your server |
**Example:** If you set `predict_endpoint: /v1/chat/completions`:
| Baseten endpoint | Maps to |
| ----------------------------------------- | ---------------------- |
| `/environments/production/predict` | `/v1/chat/completions` |
| `/environments/production/sync/v1/models` | `/v1/models` |
## Deploy Ollama
This example deploys [Ollama](https://ollama.com/) with the TinyLlama model
using a custom Docker image. Ollama is a popular lightweight LLM inference
server, similar to vLLM or SGLang. TinyLlama is small enough to run on a CPU.
### 1. Create the config
Create a `config.yaml` file with the following configuration:
```yaml config.yaml theme={"system"}
model_name: ollama-tinyllama
base_image:
image: python:3.11-slim
build_commands:
- curl -fsSL https://ollama.com/install.sh | sh
docker_server:
start_command: sh -c "ollama serve & sleep 5 && ollama pull tinyllama && wait"
readiness_endpoint: /api/tags
liveness_endpoint: /api/tags
predict_endpoint: /api/generate
server_port: 11434
resources:
cpu: "4"
memory: 8Gi
```
The `base_image` field specifies the Docker image to use as your starting
point, in this case a lightweight Python image. The `build_commands` section
installs Ollama into the container at build time. You can also use this to
install model weights or other dependencies.
The `start_command` launches the Ollama server, waits for it to initialize, and
then pulls the TinyLlama model.
The `readiness_endpoint` and `liveness_endpoint`
both point to `/api/tags`, which returns successfully when Ollama is running.
The `predict_endpoint` maps Baseten's `/predict` route to Ollama's
`/api/generate` endpoint.
Finally, declare your resource requirements. This example only needs 4 CPUs and
8GB of memory. For a complete list of resource options, see the
[Resources](/deployment/resources) page.
### 2. Deploy
To deploy the model, use the following:
```sh theme={"system"}
truss push --watch
```
This will build the Docker image and deploy it to Baseten.
Once the `readiness_endpoint` and `liveness_endpoint` are successful, the model will be ready to use.
### 3. Run inference
Ollama uses OpenAI API compatible endpoints to run inference and calls
`/api/generate` to generate text. Since you mapped the `/predict` route to
Ollama's `/api/generate` endpoint, you can run inference by calling the
`/predict` endpoint.
To run inference with Truss, use the `predict` command:
```sh theme={"system"}
truss predict -d '{"model": "tinyllama", "prompt": "Write a short story about a robot dreaming", "options": {"num_predict": 50}}'
```
To run inference with cURL, use the following command:
```sh theme={"system"}
curl -s -X POST "https://model-MODEL_ID.api.baseten.co/development/predict" \
-H "Authorization: Api-Key $BASETEN_API_KEY" \
-d '{"model": "tinyllama", "prompt": "Write a short story about a robot dreaming", "options": {"num_predict": 50}}' \
| jq -j '.response'
```
To run inference with Python, use the following:
```python theme={"system"}
import os
import requests
model_id = "MODEL_ID"
baseten_api_key = os.environ["BASETEN_API_KEY"]
response = requests.post(
f"https://model-{model_id}.api.baseten.co/development/predict",
headers={"Authorization": f"Api-Key {baseten_api_key}"},
json={
"model": "tinyllama",
"prompt": "Write a short story about a robot dreaming",
"options": {"num_predict": 50},
},
)
print(response.json()["response"])
```
The following is an example of its response:
```output theme={"system"}
It was a dreary, grey day when the robots started to dream.
They had been programmed to think like humans, but it wasn't until they began to dream that they realized just how far apart they actually were.
```
Congratulations! You have successfully deployed and ran inference on a custom Docker image.
## No-build deployment
For security-hardened images that must remain completely unmodified, use [`no_build`](/reference/truss-configuration#no_build) to skip the build step entirely. Baseten copies the image to its container registry without running `docker build`.
No-build is only available for custom server deployments. Your Truss must use `docker_server` configuration. Standard Truss models with a `model.py` don't support `no_build`.
No-build deployments are not enabled by default. [Contact support](mailto:support@baseten.co) to enable this feature for your organization.
```yaml config.yaml theme={"system"}
base_image:
image: your-registry/your-hardened-image:latest
docker_server:
no_build: true
server_port: 8000
predict_endpoint: /predict
readiness_endpoint: /health
liveness_endpoint: /health
```
Set `no_build: true` and configure your server's port and endpoints. Since the image runs unmodified, it must include its own HTTP server and health check endpoints.
`start_command` is optional with `no_build`. If omitted, the image's original `ENTRYPOINT` runs. If your image needs a different startup command, set `start_command` to override the entrypoint.
### Constraints
* Requires a custom server deployment with `docker_server` configuration. Standard Truss models with a `model.py` don't support `no_build`.
* Development mode is not supported. Deploy with `truss push` (published deployments are the default).
* Truss config fields beyond `docker_server`, `base_image`, `environment_variables`, and `secrets` are not available. Pass any additional configuration as environment variables.
* If your image runs as a specific user, set `run_as_user_id` to that UID.
### Pass configuration as environment variables
Since Truss config fields aren't injected into no-build containers, use `environment_variables` to pass configuration:
```yaml config.yaml theme={"system"}
base_image:
image: your-registry/your-hardened-image:latest
docker_server:
no_build: true
server_port: 8000
predict_endpoint: /predict
readiness_endpoint: /health
liveness_endpoint: /health
environment_variables:
MODEL_NAME: my-model
MAX_BATCH_SIZE: "32"
```
Access these in your server code with `os.environ["MODEL_NAME"]`.
## Next steps
* [Private registries](/development/model/private-registries): Pull images from AWS ECR, Google Artifact Registry, or Docker Hub
* [Secrets](/development/model/secrets#custom-docker-images): Access API keys and tokens in your container
* [WebSockets](/development/model/websockets#websocket-usage-with-custom-servers): Enable WebSocket connections
* [vLLM](/examples/vllm), [SGLang](/examples/sglang), [TensorRT-LLM](/examples/tensorrt-llm): Deploy LLMs with popular inference servers
# Data and storage
Source: https://docs.baseten.co/development/model/data-directory
Load model weights without Hugging Face or S3
Model files, such as weights, can be **large** (often **multiple GBs**). Truss supports **multiple ways** to load them efficiently:
* **Public Hugging Face models** (default)
* **Bundled directly in Truss**
### 1. Bundling model weights in Truss
Store model files **inside Truss** using the `data/` directory.
**Example: Stable Diffusion 2.1 Truss structure**
```pssql theme={"system"}
data/
scheduler/
scheduler_config.json
text_encoder/
config.json
diffusion_pytorch_model.bin
tokenizer/
merges.txt
tokenizer_config.json
vocab.json
unet/
config.json
diffusion_pytorch_model.bin
vae/
config.json
diffusion_pytorch_model.bin
model_index.json
```
**Access bundled files in `model.py`:**
```python theme={"system"}
class Model:
def __init__(self, **kwargs):
self._data_dir = kwargs["data_dir"]
def load(self):
self.model = StableDiffusionPipeline.from_pretrained(
str(self._data_dir),
revision="fp16",
torch_dtype=torch.float16,
).to("cuda")
```
Limitation: Large weights increase deployment size, making it slower. Consider
cloud storage instead.
## 2. Loading private model weights from S3
If using **private S3 storage**, first **configure secure authentication**.
### Step 1: Define AWS secrets in `config.yaml`
```yaml theme={"system"}
secrets:
aws_access_key_id: null
aws_secret_access_key: null
aws_region: null # e.g., us-east-1
aws_bucket: null
```
Do not store actual credentials here. Add them securely to [Baseten secrets
manager](https://app.baseten.co/settings/secrets).
### Step 2: Authenticate with AWS in `model.py`
```python theme={"system"}
import boto3
def __init__(self, **kwargs):
self._config = kwargs.get("config")
secrets = kwargs.get("secrets")
self.s3_client = boto3.client(
"s3",
aws_access_key_id=secrets["aws_access_key_id"],
aws_secret_access_key=secrets["aws_secret_access_key"],
region_name=secrets["aws_region"],
)
self.s3_bucket = secrets["aws_bucket"]
```
### Step 3: Deploy
```sh theme={"system"}
truss push --watch
```
# Deploy and iterate
Source: https://docs.baseten.co/development/model/deploy-and-iterate
Use development deployments with live patching for rapid iteration, then promote to production.
Development deployments let you iterate on your model without redeploying from scratch each time you make a change. When you save a file, Truss detects the change, calculates a patch, and applies it to the running deployment in seconds.
## Start a development deployment
Create a development deployment and start watching for changes:
```sh theme={"system"}
truss push --watch
```
Truss creates a development deployment, waits for it to build, and begins watching your project directory for file changes. Once the deployment reaches the `LOADING_MODEL` stage, Truss enters watch mode early so you can start iterating while the model finishes loading.
```output theme={"system"}
🪵 View logs for your deployment at https://app.baseten.co/models/abc1d2ef/logs/xyz123
👀 Watching for changes to truss...
```
## Re-attach to a development deployment
If you stop the watch session (Ctrl+C), re-attach to the existing development deployment with:
```sh theme={"system"}
truss watch
```
You should see:
```output theme={"system"}
🪵 View logs for your development model at https://app.baseten.co/models/abc1d2ef/logs/xyz123
🚰 Attempting to sync truss with remote
No changes observed, skipping patching.
👀 Watching for new changes.
```
`truss watch` syncs any changes made while disconnected, then resumes watching. It requires an existing development deployment. If you don't have one, use `truss push --watch` to create it.
## What gets live-patched
Truss monitors your project directory (respecting `.trussignore` patterns) and applies patches for the following changes without a full rebuild:
| Change type | Examples |
| --------------------- | -------------------------------------------------------------------------------- |
| Model code | Files in the `model/` directory: `model.py`, helper modules, utilities. |
| Bundled packages | Files in the `packages/` directory. |
| Python requirements | Adding, removing, or updating packages in `requirements` or a requirements file. |
| Environment variables | Adding, removing, or updating values in `environment_variables`. |
| External data | Adding or removing entries in `external_data`. |
| Config values | Most `config.yaml` changes (except those listed below). |
## What requires a full redeploy
The patch system doesn't support some changes. When you make these changes, stop the watch session and run `truss push` (or `truss push --watch` to start a new development deployment):
| Change type | Why |
| ----------------------------- | ------------------------------------------------------- |
| `resources` (GPU type, count) | Requires a new instance. |
| `python_version` | Requires a new base image. |
| `system_packages` | Requires apt installation in the container. |
| `live_reload` | Changes the deployment mode. |
| Data directory (`data/`) | The patch system doesn't track file changes in `data/`. |
If a patch fails, Truss prints an error and continues watching. Fix the issue in your source files and save again. For persistent failures, run `truss push --watch` to start fresh.
## Limitations
Development deployments optimize for iteration, not production traffic:
* **Single replica**: Fixed at 0 minimum, 1 maximum. No autoscaling beyond one replica.
* **No gRPC**: Trusses with gRPC transport require a published deployment.
* **No TRT-LLM engine builds**: TRT-LLM build flow requires a published deployment.
See [Development deployments](/deployment/autoscaling/overview#development-deployments) for the full autoscaling constraints.
## Deploy to production
When you're done iterating, deploy a published version:
```sh theme={"system"}
truss push
```
By default, `truss push` creates a published deployment with full autoscaling support. Published deployments can scale to multiple replicas and are suitable for production traffic.
To deploy and promote directly to the production environment:
```sh theme={"system"}
truss push --promote
```
Full list of options for the push command.
Full list of options for the watch command.
Configure replicas, concurrency targets, and scale-to-zero for production.
Manage staging, production, and custom environments.
# Access model environments
Source: https://docs.baseten.co/development/model/environments
A guide to leveraging environments in your models
Model environments help configure behavior based on **deployment stage** (for example, production vs. staging). You can access the environment details via `kwargs` in the `Model` class.
## 1. Retrieve environment variables
Access the environment in `__init__`:
```python theme={"system"}
def __init__(self, **kwargs):
self._environment = kwargs["environment"]
```
## 2. Configure behavior based on environment
Use environment variables in the `load` function:
```python theme={"system"}
def load(self):
if self._environment.get("name") == "production":
# Production setup
self.setup_sentry()
self.setup_logging(level="INFO")
self.load_production_weights()
else:
# Default setup for staging or development deployments
self.setup_logging(level="DEBUG")
self.load_default_weights()
```
**Why use this?**
* **Customize logging levels**
* **Load environment-specific model weights**
* **Enable monitoring tools (for example, Sentry)**
# gRPC
Source: https://docs.baseten.co/development/model/grpc
Invoke your model over gRPC.
## Overview
gRPC is a high-performance, open-source remote procedure call (RPC) framework that uses HTTP/2 for transport and Protocol Buffers for serialization. Unlike traditional HTTP APIs, gRPC provides strong type safety, high performance, and built-in support for streaming and bidirectional communication.
**Why use gRPC with Baseten?**
* **Type safety**: Protocol Buffers ensure strong typing and contract validation between client and server
* **Ecosystem integration**: Easily integrate Baseten with existing gRPC-based services
* **Streaming support**: Built-in support for server streaming, client streaming, and bidirectional streaming
* **Language interoperability**: Generate client libraries for multiple programming languages from a single `.proto` file
## gRPC on Baseten
gRPC support in Baseten is implemented using [Custom Servers](/development/model/custom-server). Unlike standard Truss models that use the `load()`, and `predict()` methods, gRPC models run their own server process that handles gRPC requests directly.
This approach gives developers full control over the gRPC server implementation.
For this to work, you must first package your gRPC server code into a Docker image.
Once that is done, you can set up your Truss `config.yaml` to configure your deployment
and push the server to Baseten.
## Setup
### Installation
1. **Install Truss**:
```bash theme={"system"}
uv venv && source .venv/bin/activate
uv pip install --upgrade truss
```
```bash theme={"system"}
python3 -m venv .venv && source .venv/bin/activate
pip install --upgrade truss
```
```bash theme={"system"}
python3 -m venv .venv && .venv\Scripts\activate
pip install --upgrade truss
```
2. **Install Protocol Buffer compiler:**
```bash theme={"system"}
# On macOS
brew install protobuf
# On Ubuntu/Debian
sudo apt-get install protobuf-compiler
# On other systems, see: https://protobuf.dev/getting-started/
```
3. **Install gRPC tools:**
```bash theme={"system"}
uv pip install grpcio-tools
```
### Protocol Buffer Definition
Your gRPC service starts with a `.proto` file that defines the service interface and message types. Create an `example.proto` file in your project root:
```protobuf example.proto theme={"system"}
syntax = "proto3";
package example;
// The greeting service definition
service Greeter {
// Sends a greeting
rpc SayHello (HelloRequest) returns (HelloReply) {}
}
// The request message containing the user's name
message HelloRequest {
string name = 1;
}
// The response message containing the greeting
message HelloReply {
string message = 1;
}
```
#### Generate Protocol Buffer Code
Generate the Python code from your `.proto` file:
```bash theme={"system"}
python -m grpc_tools.protoc --python_out=. --grpc_python_out=. --proto_path . example.proto
```
This generates the necessary Python files (`example_pb2.py` and `example_pb2_grpc.py`) for your gRPC service. For more information about Protocol Buffers, see the [official documentation](https://protobuf.dev/).
### Model Implementation
Create your gRPC server implementation in a file called `model.py`. Here's a basic example:
```python model.py theme={"system"}
import grpc
from concurrent import futures
import time
import example_pb2
import example_pb2_grpc
from grpc_health.v1 import health_pb2
from grpc_health.v1 import health_pb2_grpc
from grpc_health.v1.health import HealthServicer
class GreeterServicer(example_pb2_grpc.GreeterServicer):
def SayHello(self, request, context):
response = example_pb2.HelloReply()
response.message = f"Hello, {request.name}!"
return response
def serve():
server = grpc.server(futures.ThreadPoolExecutor(max_workers=10))
example_pb2_grpc.add_GreeterServicer_to_server(GreeterServicer(), server)
# The gRPC health check service must be used in order for Baseten
# to consider the gRPC server healthy.
health_servicer = HealthServicer()
health_pb2_grpc.add_HealthServicer_to_server(health_servicer, server)
health_servicer.set(
"example.GreeterService", health_pb2.HealthCheckResponse.SERVING
)
# Ensure the server runs on port 50051
server.add_insecure_port("[::]:50051")
server.start()
print("gRPC server started on port 50051")
# Keep the server running
try:
while True:
time.sleep(86400)
except KeyboardInterrupt:
print("Shutting down server...")
server.stop(0)
if __name__ == "__main__":
serve()
```
## Deployment
### Step 1: Create a Dockerfile
Since gRPC on Baseten requires a custom server setup, you'll need to create a `Dockerfile` that bundles your gRPC server code and dependencies. Here's a basic skeleton:
```dockerfile Dockerfile theme={"system"}
FROM debian:latest
RUN apt-get update && apt-get install -y \
build-essential \
python3 \
python3-pip \
python3-venv \
&& rm -rf /var/lib/apt/lists/*
WORKDIR /app
RUN python3 -m venv /opt/venv
ENV PATH="/opt/venv/bin:$PATH"
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
COPY model.py ./model.py
COPY example_pb2.py example_pb2_grpc.py ./
EXPOSE 8080
CMD ["python", "model.py"]
```
Create a `requirements.txt` file with your gRPC dependencies:
```txt requirements.txt theme={"system"}
grpcio
grpcio-health-checking
grpcio-tools
protobuf
```
### Step 2: Build and push Docker image
Build and push your Docker image to a container registry:
```bash theme={"system"}
docker build -t your-registry/truss-grpc-demo:latest . --platform linux/amd64
docker push your-registry/truss-grpc-demo:latest
```
Replace `your-registry` with your actual container registry (e.g., Docker Hub, Google Container Registry, AWS ECR). You can create a Docker Hub container registry by [following their documentation](https://docs.docker.com/get-started/docker-concepts/the-basics/what-is-a-registry/#try-it-out).
### Step 3: Configure your Truss
Update your `config.yaml` to use the custom Docker image and configure the gRPC server:
```yaml config.yaml theme={"system"}
model_name: "gRPC Model Example"
base_image:
image: your-registry/truss-grpc-demo:latest
docker_server:
start_command: python model.py
# 50051 is the only supported server port.
server_port: 50051
# Note that the _endpoint fields are ignored for gRPC models.
predict_endpoint: /
readiness_endpoint: /
liveness_endpoint: /
resources:
accelerator: L4 # or your preferred GPU
use_gpu: true
runtime:
transport:
kind: "grpc"
```
### Step 4: Deploy with Truss
Deploy your model using the Truss CLI. gRPC models aren't supported in development deployments, so use the default published deployment or `--promote` to also promote to production.
```bash theme={"system"}
truss push --promote
```
For more detailed information about Truss deployment, see the [truss push documentation](/reference/cli/truss/push).
## Calling your model
### Using a gRPC client
Once deployed, you can call your model using any gRPC client. Here's an example Python client:
```python client.py theme={"system"}
import grpc
import example_pb2
import example_pb2_grpc
def run():
channel = grpc.secure_channel(
"model-{MODEL_ID}.grpc.api.baseten.co:443",
grpc.ssl_channel_credentials(),
)
stub = example_pb2_grpc.GreeterStub(channel)
request = example_pb2.HelloRequest(name="World")
metadata = [
("baseten-authorization", "Api-Key {API_KEY}"),
("baseten-model-id", "model-{MODEL_ID}"),
]
response = stub.SayHello(request, metadata=metadata)
print(response.message)
if __name__ == "__main__":
run()
```
### Inference for specific environments and deployments
If you want to perform inference against a specific environment or deployment,
you can do so by adding headers to your gRPC calls:
**Target a specific environment:**
```python theme={"system"}
metadata = [
('authorization', 'Api-Key YOUR_API_KEY'),
('baseten-model-id', 'model-{YOUR_MODEL_ID}'),
('x-baseten-environment', 'staging'),
]
```
**Target a specific deployment ID:**
```python theme={"system"}
metadata = [
('authorization', 'Api-Key YOUR_API_KEY'),
('baseten-model-id', 'model-{YOUR_MODEL_ID}'),
('x-baseten-deployment', 'your-deployment-id'),
]
```
### Inference for regional environments
If your organization uses [regional environments](/deployment/environments#regional-environments), use the regional hostname as the gRPC target. The environment is derived from the hostname, so do not set `x-baseten-environment` or `x-baseten-deployment` headers.
```python theme={"system"}
channel = grpc.secure_channel(
"model-{MODEL_ID}-{ENV_NAME}.grpc.api.baseten.co:443",
grpc.ssl_channel_credentials(),
)
metadata = [
('baseten-authorization', 'Api-Key {API_KEY}'),
('baseten-model-id', 'model-{MODEL_ID}'),
]
```
### Testing your deployment
Run your client to test the deployed model:
```bash theme={"system"}
python client.py
```
# Full Example
See this [Github repository](https://github.com/basetenlabs/truss-examples/tree/main/grpc) for a full example.
# Scaling and monitoring
## Scaling
While many gRPC requests follow the traditional request-response pattern, gRPC also supports
bidirectional streaming and long-lived connections. The implication of this is that
a single long-lived connection, even if no data is being sent, will count
against the concurrency target for the deployment.
## Promotion
Just like with HTTP, you can promote a gRPC deployment to an environment via the REST API or UI.
When promoting a gRPC deployment, new connections will be routed to the new deployment, but existing
connections will remain connected to the current deployment until a termination happens.
Depending on the length of the connection, this could result in old deployments taking longer to scale down
than for HTTP deployments.
# Monitoring
Just like with HTTP deployment, with gRPC, we offer metrics on the performance
of the deployment.
## Inference volume
Inference volume is tracked as the number of RPCs per minute. These
metrics are published *after* the request is complete.
See [gRPC status codes](https://grpc.io/docs/guides/status-codes/) for a full list
of codes.
## End-to-end response time
Measured at different percentiles (p50, p90, p95, p99):
End-to-end response time includes cold starts, queuing, and inference (excludes client-side latency). Reflects real-world performance.
# Implementation
Source: https://docs.baseten.co/development/model/implementation
How to implement your model.
In this section, we'll cover how to implement the actual logic for your model.
As was mentioned in [Your First Model](/development/model/build-your-first-model), the
logic for the model itself is specified in a `model/model.py` file. To recap, the simplest
directory structure for a model is:
```
model/
model.py
config.yaml
```
It's expected that the `model.py` file contains a class with particular methods:
```python model.py theme={"system"}
class Model:
def __init__(self):
pass
def load(self):
pass
def predict(self, input_data):
pass
```
* The `__init__` method is used to initialize the `Model` class, and allows you to read
in configuration parameters and other information.
* The `load` method is where you define the logic for initializing the model. This might
include downloading model weights, or loading them onto a GPU.
* The `predict` method is where you define the logic for inference.
In the next sections, we'll cover each of these methods in more detail.
## **init**
As mentioned above, the `__init__` method is used to initialize the `Model` class, and allows you to
read in configuration parameters and runtime information.
The simplest signature for `__init__` is:
```python model.py theme={"system"}
def __init__(self):
pass
```
If you need more information, however, you have the option to define your **init** method
such that it accepts the following parameters:
```python model.py theme={"system"}
def __init__(self, config: dict, data_dir: str, secrets: dict, environment: str):
pass
```
* `config`: A dictionary containing the config.yaml for the model.
* `data_dir`: A string containing the path to the data directory for the model.
* `secrets`: A dictionary containing the secrets for the model. Note that at runtime,
these will be populated with the actual values as stored on Baseten.
* `environment`: A string containing the environment for the model, if the model has been
deployed to an environment.
You can then make use of these parameters in the rest of your model but saving these as
attributes:
```python model.py theme={"system"}
def __init__(self, config: dict, data_dir: str, secrets: dict, environment: str):
self._config = config
self._data_dir = data_dir
self._secrets = secrets
self._environment = environment
```
## load
The `load` method is where you define the logic for initializing the model. As
mentioned before, this might include downloading model weights or loading them
onto the GPU.
`load`, unlike the other method mentioned, does not accept any parameters:
```python model.py theme={"system"}
def load(self):
pass
```
After deploying your model, the deployment will not be considered "Ready" until `load` has
completed successfully. Note that there is a **timeout of 30 minutes** for this, after which,
if `load` hasn't completed, the deployment will be marked as failed.
## predict
The `predict` method is where you define the logic for performing inference.
The simplest signature for `predict` is:
```python model.py theme={"system"}
def predict(self, input_data) -> str:
return "Hello"
```
The return type of `predict` must be JSON-serializable, so it can be:
* `dict`
* `list`
* `str`
If you would like to return a more strictly typed object, you can return a
`Pydantic` object.
```python model.py theme={"system"}
from pydantic import BaseModel
class Result(BaseModel):
value: str
```
You can then return an instance of this model from `predict`:
```python model.py theme={"system"}
def predict(self, input_data) -> Prediction:
return Result(value="Hello")
```
### Streaming
In addition to supporting a single request/response cycle, Truss also supports streaming.
See the [Streaming](/development/model/streaming) guide for more information.
### Async vs. Sync
Note that the `predict` method is synchronous by default. However, if your model inference
depends on APIs require `asyncio`, `predict` can also be written as a coroutine.
```python model.py theme={"system"}
import asyncio
async def predict(self, input_data) -> dict:
# Async logic here
await asyncio.sleep(1)
return {"value": "Hello"}
```
If you are using `asyncio` in your `predict` method, be sure not to perform any blocking
operations, such as a synchronous file download. This can result in degraded performance.
# Cached weights
Source: https://docs.baseten.co/development/model/model-cache
Accelerate cold starts and availability by prefetching and caching your weights.
### Migrating to `weights` (Recommended for most use cases)
`model_cache` is superseded by the new [BDN (Baseten Delivery Network)](/development/model/bdn), which offers faster cold starts through multi-tier caching (in-cluster + node-level).
Use `truss migrate` to automatically convert your configuration:
```bash theme={"system"}
truss migrate
```
See [Baseten Delivery Network (BDN)](/development/model/bdn) for the new approach.
**When `model_cache` may still be needed:**
* Quantization workflows where you need to process weights after download
* Custom download timing via `lazy_data_resolver.block_until_download_complete()`
* Prototyping and iterating using direct downloads.
### What is a "cold start"?
"Cold start" is a term used to describe the time taken when a request is received when the model is scaled to 0 until it is ready to handle the first request. This process is a critical factor in allowing your deployments to be responsive to traffic while maintaining your SLAs and lowering your costs.
To optimize cold starts, we will go over the following strategies: Downloading them in a background thread in Rust that runs during the module import, caching weights in a distributed filesystem, and moving weights into the docker image.
In practice, this reduces the cold start for large models to just a few seconds. For example, Stable Diffusion XL can take a few minutes to boot up without caching. With caching, it takes just under 10 seconds.
## Enabling prefetching for a model
To enable caching, simply add `model_cache` to your `config.yaml` with a valid `repo_id`. The `model_cache` has a few key configurations:
* `repo_id` (required): The repo name from Hugging Face or bucket/container from GCS, S3, or Azure.
* `revision` (required for Hugging Face): The revision of the huggingface repo, such as the sha or branch name such as `refs/pr/1` or `main`. Not needed for GCS, S3, or Azure.
* `use_volume`: Boolean flag to determine if the weights are downloaded to the Baseten Distributed Filesystem at runtime (recommended) or bundled into the container image (legacy, not recommended).
* `volume_folder`: string, folder name under which the model weights appear. Setting it to `my-llama-model` will mount the repo to `/app/model_cache/my-llama-model` at runtime.
* `allow_patterns`: Only cache files that match specified patterns. Utilize Unix shell-style wildcards to denote these patterns.
* `ignore_patterns`: Conversely, you can also denote file patterns to ignore, hence streamlining the caching process.
* `runtime_secret_name`: The name of your secret containing the credentials for a private repository or bucket, such as a `hf_access_token` or `gcs_service_account`.
* `kind`: The storage provider type for the model weights.
* `"hf"` (default): Hugging Face
* `"gcs"`: Google Cloud Storage
* `"s3"`: AWS S3
* `"azure"`: Azure Blob Storage
Here is an example of a well written `model_cache` for Stable Diffusion XL. Note how it only pulls the model weights that it needs using `allow_patterns`.
```yaml config.yaml theme={"system"}
model_cache:
- repo_id: madebyollin/sdxl-vae-fp16-fix
revision: 207b116dae70ace3637169f1ddd2434b91b3a8cd
use_volume: true
volume_folder: sdxl-vae-fp16
allow_patterns:
- config.json
- diffusion_pytorch_model.safetensors
- repo_id: stabilityai/stable-diffusion-xl-base-1.0
revision: 462165984030d82259a11f4367a4eed129e94a7b
use_volume: true
volume_folder: stable-diffusion-xl-base
allow_patterns:
- "*.json"
- "*.fp16.safetensors"
- sd_xl_base_1.0.safetensors
- repo_id: stabilityai/stable-diffusion-xl-refiner-1.0
revision: 5d4cfe854c9a9a87939ff3653551c2b3c99a4356
use_volume: true
volume_folder: stable-diffusion-xl-refiner
allow_patterns:
- "*.json"
- "*.fp16.safetensors"
- sd_xl_refiner_1.0.safetensors
```
Many Hugging Face repos have model weights in different formats (`.bin`, `.safetensors`, `.h5`, `.msgpack`, etc.). You only need one of these most of the time. To minimize cold starts, ensure that you only cache the weights you need.
### What is weight "pre-fetching"?
With `model_cache`, weights are pre-fetched by downloading your weights ahead of time in a dedicated Rust thread.
This means, you can perform all kinds of preparation work (importing libraries, jit compilation of torch/triton modules), until you need access to the files.
In practice, executing statements like `import tensorrt_llm` typically take 10–15 seconds. By that point, the first 5–10GB of the weights will have already been downloaded.
To use the `model_cache` config with truss, we require you to actively interact with the `lazy_data_resolver`.
Before using any of the downloaded files, you must call the `lazy_data_resolver.block_until_download_complete()`. This will block until all files in the `/app/model_cache` directory are downloaded & ready to use.
This call must be either part of your `__init__` or `load` implementation.
```python model.py theme={"system"}
# <- download is invoked before here.
import torch # this line usually takes 2-5 seconds.
import tensorrt_llm # this line usually takes 10-15 seconds
import onnxruntime # this line usually takes 5-10 seconds
class Model:
"""example usage of `model_cache` in truss"""
def __init__(self, *args, **kwargs):
# `lazy_data_resolver` is passed as keyword-argument in init
self._lazy_data_resolver = kwargs["lazy_data_resolver"]
def load(self):
# work that does not require the download may be done beforehand
random_vector = torch.randn(1000)
# important to collect the download before using any incomplete data
self._lazy_data_resolver.block_until_download_complete()
# after the call, you may use the /app/model_cache directory and the contents
torch.load(
"/app/model_cache/stable-diffusion-xl-base/model.fp16.safetensors"
)
```
## Private repositories/cloud storage
### Private Hugging Face repositories 🤗
For any public Hugging Face repo, you don't need to do anything else. Adding the `model_cache` key with an appropriate `repo_id` should be enough.
However, if you want to deploy a model from a gated repo like [Gemma](https://huggingface.co/google/gemma-3-27b-it) to Baseten, there are a few steps you need to take:
[Grab an API key](https://huggingface.co/settings/tokens) from Hugging Face with `read` access. Make sure you have access to the model you want to serve.
Paste your API key in your [secrets manager in Baseten](https://app.baseten.co/settings/secrets) under the specified key, such as `hf_access_token`. You can read more about secrets [here](/development/model/secrets).
In your Truss's `config.yaml`, add the secret key under `runtime_secret_name`:
```yaml config.yaml theme={"system"}
model_cache:
- repo_id: your-org/your-private-repo
revision: main # refs/pr/1
runtime_secret_name: hf_access_token
```
Once your truss is pushed, we resolve the sha behind your branch (main), and protect the deployment against changes on this branch.
If you run into any issues, run through all the steps above again and make sure you did not misspell the name of the repo or secret name, or paste an incorrect API key.
### Private GCS buckets
If you want to deploy a model from a private GCS bucket to Baseten, there are a few steps you need to take:
Create a [service account key](https://cloud.google.com/iam/docs/keys-create-delete#creating) in your GCS account for the project which contains the model weights.
Paste the contents of the `service_account.json` in your [secrets manager in Baseten](https://app.baseten.co/settings/secrets) under the specified key, for example, `gcs_service_account`. You can read more about secrets [here](/development/model/secrets).
At a minimum, you should have these credentials:
```json gcs_service_account theme={"system"}
{
"private_key_id": "xxxxxxx",
"private_key": "-----BEGIN PRIVATE KEY-----\nMI",
"client_email": "b10-some@xxx-example.iam.gserviceaccount.com"
}
```
In your Truss's `config.yaml`, make sure to add the `runtime_secret_name` to your `model_cache` matching the above secret name:
```yaml config.yaml theme={"system"}
model_cache:
- repo_id: gs://your-private-bucket
use_volume: true
volume_folder: your-model-weights
runtime_secret_name: gcs_service_account
kind: "gcs"
ignore_patterns: "*.protobuf"
```
Note: S3/Azure/GCS Buckets are immutable. Once the truss is pushed, you may no longer delete or modify files as they are referenced as required files for a model startup.
It's easy to make a mistake in any of these steps. If you run into issues, you're encouraged to go through the steps again just in case. Please contact [Baseten support](mailto:support@baseten.co) if you continue to experience issues.
### Private S3 buckets
If you want to deploy a model from a private S3 bucket to Baseten, there are a few steps you need to take:
[Get your `aws_access_key_id` and `aws_secret_access_key`](https://aws.amazon.com/blogs/security/how-to-find-update-access-keys-password-mfa-aws-management-console/) in your AWS account for the bucket that contains the model weights.
Paste the following `json` in your [secrets manager in Baseten](https://app.baseten.co/settings/secrets) under the specified key, for example, `aws_secret_json`. You can read more about secrets [here](/development/model/secrets).
```json aws_secret_json theme={"system"}
{
"aws_access_key_id": "XXXXX",
"aws_secret_access_key": "xxxxx/xxxxxx",
"aws_region": "us-west-2"
}
```
In your Truss's `config.yaml`, make sure to add the `runtime_secret_name` to your `model_cache` matching the above secret name:
```yaml config.yaml theme={"system"}
model_cache:
- repo_id: s3://your-bucket-west-2-name/path/to/model/
use_volume: true
volume_folder: your-model-weights # sync of s3 path/to/model/* to /app/model_cache/your-model-weights/*
runtime_secret_name: aws_secret_json
kind: "s3"
ignore_patterns: "*.protobuf"
```
Note: S3/Azure/GCS Buckets are immutable. Once the truss is pushed, you may no longer delete or modify files as they are referenced as required files for a model startup.
It's easy to make a mistake in any of these steps. If you run into issues, you're encouraged to go through the steps again just in case. Please contact [Baseten support](mailto:support@baseten.co) if you continue to experience issues.
### Private Azure containers
If you want to deploy a model from a private Azure container to Baseten, there are a few steps you need to take:
[Get the your `account_key`](https://learn.microsoft.com/en-us/azure/storage/common/storage-account-get-info?tabs=portal#get-a-connection-string-for-the-storage-account) in your Azure account with the container that has the model weights.
Paste the following `json` in your [secrets manager in Baseten](https://app.baseten.co/settings/secrets) under the specified key, for example, `azure_secret_json`. You can read more about secrets [here](/development/model/secrets).
```json azure_secret_json theme={"system"}
{
"account_key": "xxxxx",
}
```
In your Truss's `config.yaml`, make sure to add the `runtime_secret_name` to your `model_cache` matching the above secret name:
```yaml config.yaml theme={"system"}
model_cache:
- repo_id: az://your-private-container/path/to/model/
use_volume: true
volume_folder: your-model-weights
runtime_secret_name: azure_secret_json
kind: "azure"
ignore_patterns: "*.protobuf"
```
Note: S3/Azure/GCS Buckets are immutable. Once the truss is pushed, you may no longer delete or modify files as they are referenced as required files for a model startup.
It's easy to make a mistake in any of these steps. If you run into issues, you're encouraged to go through the steps again just in case. Please contact [Baseten support](mailto:support@baseten.co) if you continue to experience issues.
## `model_cache` within Chains
To use `model_cache` for [chains](/development/chain/getting-started) - use the `Assets` specifier. In the example below, we will download `llama-3.2-1B`.
As this model is a gated huggingface model, we are setting the mounting token as part of the assets `chains.Assets(..., secret_keys=["hf_access_token"])`.
The model is quite small - in many cases, we will be able to download the model while `from transformers import pipeline` and `import torch` are running.
```python chain_cache.py theme={"system"}
import random
import truss_chains as chains
try:
# imports on global level for PoemGeneratorLM, to save time during the download.
from transformers import pipeline
import torch
except ImportError:
# RandInt does not have these dependencies.
pass
class RandInt(chains.ChainletBase):
async def run_remote(self, max_value: int) -> int:
return random.randint(1, max_value)
@chains.mark_entrypoint
class PoemGeneratorLM(chains.ChainletBase):
from truss import truss_config
LLAMA_CACHE = truss_config.ModelRepo(
repo_id="meta-llama/Llama-3.2-1B-Instruct",
revision="c4219cc9e642e492fd0219283fa3c674804bb8ed",
use_volume=True,
volume_folder="llama_mini",
ignore_patterns=["*.pth", "*.onnx"]
)
remote_config = chains.RemoteConfig(
docker_image=chains.DockerImage(
# The phi model needs some extra python packages.
pip_requirements=[
"transformers==4.48.0",
"torch==2.6.0",
]
),
compute=chains.Compute(
gpu="L4"
),
# The phi model needs a GPU and more CPUs.
# compute=chains.Compute(cpu_count=2, gpu="T4"),
# Cache the model weights in the image
assets=chains.Assets(cached=[LLAMA_CACHE], secret_keys=["hf_access_token"]),
)
# <- Download happens before __init__ is called.
def __init__(self, rand_int=chains.depends(RandInt, retries=3)) -> None:
self._rand_int = rand_int
print("loading cached llama_mini model")
self.pipeline = pipeline(
"text-generation",
model=f"/app/model_cache/llama_mini",
)
async def run_remote(self, max_value: int = 3) -> str:
num_repetitions = await self._rand_int.run_remote(max_value)
print("writing poem with num_repetitions", num_repetitions)
poem = str(self.pipeline(
text_inputs="Write a beautiful and descriptive poem about the ocean. Focus on its vastness, movement, and colors.",
max_new_tokens=150,
do_sample=True,
return_full_text=False,
temperature=0.7,
top_p=0.9,
)[0]['generated_text'])
return poem * num_repetitions
```
## `model_cache` for custom servers
If you are not using Python's `model.py` and [custom servers](/development/model/custom-server) such as [vllm](/examples/vllm), TEI or [sglang](/examples/sglang),
you are required to use the `truss-transfer-cli` command, to force population of the `/app/model_cache` location. The command will block until the weights are downloaded.
Here is an example for how to use text-embeddings-inference on a L4 to populate a jina embeddings model from huggingface into the model\_cache.
```yaml config.yaml theme={"system"}
base_image:
image: baseten/text-embeddings-inference-mirror:89-1.6
docker_server:
liveness_endpoint: /health
predict_endpoint: /v1/embeddings
readiness_endpoint: /health
server_port: 7997
# using `truss-transfer-cli` to download the weights to `cached_model`
start_command: bash -c "truss-transfer-cli && text-embeddings-router --port 7997
--model-id /app/model_cache/my_jina --max-client-batch-size 128 --max-concurrent-requests
128 --max-batch-tokens 16384 --auto-truncate"
model_cache:
- repo_id: jinaai/jina-embeddings-v2-base-code
revision: 516f4baf13dec4ddddda8631e019b5737c8bc250
use_volume: true
volume_folder: my_jina
ignore_patterns: ["*.onnx"]
model_metadata:
example_model_input:
encoding_format: float
input: text string
model: model
model_name: TEI-jinaai-jina-embeddings-v2-base-code-truss-example
resources:
accelerator: L4
```
# Developing a model on Baseten
Source: https://docs.baseten.co/development/model/overview
This page introduces the key concepts and workflow you'll use to package, configure, and iterate on models using Baseten's developer tooling.
Baseten makes it easy to go from a trained machine learning model to a fully-deployed, production-ready API. You'll use Truss, our open-source model packaging tool, to containerize your model code and configuration, and ship it to Baseten for deployment, testing, and scaling.
## What does it mean to develop a model?
In Baseten, developing a model means:
1. [Packaging your model code and weights](/development/model/implementation):
Wrap your trained model into a structured project that includes your inference logic and dependencies.
2. [Configuring the model environment](/development/model/configuration):
Define everything needed to run your model: Python packages, system dependencies, and secrets.
3. [Deploying and iterating quickly](/development/model/deploy-and-iterate):
Push your model to Baseten and iterate with live edits using `truss push --watch`.
Once your model works the way you want, you can promote it to [production](/deployment/environments), ready for live traffic.
## Development flow on Baseten
Here's what the typical model development loop looks like:
1. **Initialize a new model project** using the Truss CLI.
2. **Add your model logic** to a Python class (model.py), specifying how to load and run inference.
3. **Configure dependencies** in a YAML or Python config.
4. **Deploy the model** using `truss push` for a published deployment, or `truss push --watch` for development mode.
5. **Iterate fast** with `truss push --watch` or `truss watch` to live-reload your dev deployment as you make changes.
6. **Test and tune** the model until it's production-ready.
7. **Promote the model** to production when you're ready to scale.
**Note:** Truss runs your model in a standardized container without needing
Docker installed locally. It also gives you a fast developer loop and a
consistent way to configure and serve models.
## What is Truss?
Truss is the tool you use to:
* **Scaffold a new model project**
* **Serve models locally or in the cloud**
* **Package your code, config, and model files**
* **Push to Baseten for deployment**
You can think of it as the developer toolkit for building and managing model servers, built specifically for machine learning workflows.
With Truss, you can create a containerized model server **without needing to learn Docker**, and define everything about how your model runs: Python and system packages, GPU settings, environment variables, and custom inference logic. It gives you a fast, reproducible dev loop: test changes locally or in a remote environment that mirrors production.
Truss is **flexible enough to support a wide range of ML stacks**, including:
* Model frameworks like PyTorch, transformers, and diffusers
* [Inference engines](/development/model/performance-optimization) like TensorRT-LLM, SGLang, vLLM
* Serving technologies like Triton
* Any package installable with `pip` or `apt`
We'll use Truss throughout this guide, but the focus will stay on **how you develop models**, not just how Truss works.
## From model to server: the key components
When you develop a model on Baseten, you define:
* A `Model` **class**: This is where your model is loaded, preprocessed, run, and the results returned.
* A **configuration file** (`config.yaml` or Python config): Defines the runtime environment, dependencies, and deployment settings.
* Optional **extra assets**, like model weights, secrets, or external packages.
These components together form a **Truss**, which is what you deploy to Baseten.
Truss simplifies and standardizes model packaging for seamless deployment. It encapsulates model code, dependencies, and configurations into a **portable, reproducible structure**, enabling efficient development, scaling, and optimization.
## Development vs. published deployments
By default, `truss push` creates a **published deployment**, which is stable, autoscaled, and ready for live traffic.
* **Published deployment** (`truss push`)
Stable, autoscaled, and ready for live traffic but **doesn't support live-reloading**.
* **Development deployment** (`truss push --watch`)
Meant for iteration and testing. It supports [live-reloading](/development/model/deploy-and-iterate#truss-watch) for quick feedback loops and will only scale to **one replica**, no autoscaling.
Use development mode to build and test, then deploy a published version with `truss push` when you're satisfied.
# Performance optimization
Source: https://docs.baseten.co/development/model/performance-optimization
Optimize model latency, throughput, and cost with Baseten engines
Model performance means optimizing every layer of your model serving infrastructure to balance four goals:
1. **Latency**: How quickly does each user get output from the model?
2. **Throughput**: How many requests can the deployment handle at once?
3. **Cost**: How much does a standardized unit of work cost?
4. **Quality**: Does your model consistently deliver high-quality output after optimization?
## Performance engines
Baseten's performance-optimized engines deliver the best possible inference speed and efficiency:
### **[Engine-Builder-LLM](/engines/engine-builder-llm/overview)** - Dense Models
* **Best for**: Llama, Mistral, Qwen, and other causal language models
* **Features**: TensorRT-LLM optimization, lookahead decoding, quantization
* **Performance**: Lowest latency and highest throughput for dense models
### **[BIS-LLM](/engines/bis-llm/overview)** - MoE Models
* **Best for**: DeepSeek, Mixtral, and other mixture-of-experts models
* **Features**: V2 inference stack, expert routing, structured outputs
* **Performance**: Optimized for large-scale MoE inference
### **[BEI](/engines/bei/overview)** - Embedding Models
* **Best for**: Sentence transformers, rerankers, classification models
* **Features**: OpenAI-compatible, high-performance embeddings
* **Performance**: Fastest embedding inference with optimized batching
## Performance concepts
Detailed performance optimization guides are now organized in the **[Performance Concepts](/engines/performance-concepts/quantization-guide)** section:
* **[Quantization Guide](/engines/performance-concepts/quantization-guide)** - FP8/FP4 trade-offs and hardware requirements
* **[Structured Outputs](/engines/performance-concepts/structured-outputs)** - JSON schema validation and controlled generation
* **[Function Calling](/engines/performance-concepts/function-calling)** - Tool use and function selection
* **[Performance Client](/engines/performance-concepts/performance-client)** - High-throughput client library
* **[Deployment Guide](/engines/performance-concepts/deployment-from-training-and-s3)** - Training checkpoints and cloud storage
## Quick performance wins
### **Quantization**
Reduce memory usage and improve speed with post-training quantization:
```yaml theme={"system"}
trt_llm:
build:
quantization_type: fp8 # 50% memory reduction
```
### **Lookahead decoding**
Accelerate inference for predictable content (code, JSON):
```yaml theme={"system"}
trt_llm:
build:
speculator:
speculative_decoding_mode: LOOKAHEAD_DECODING
lookahead_windows_size: 5
```
### **Performance client**
Maximize client-side throughput with Rust-based client:
```bash theme={"system"}
uv pip install baseten-performance-client
```
## Where to start
1. **Choose your engine**: [Engines overview](/engines)
2. **Configure your model**: Engine-specific configuration guides
3. **Optimize performance**: [Performance concepts](/engines/performance-concepts/quantization-guide)
4. **Deploy and monitor**: Use [performance client](/engines/performance-concepts/performance-client) for maximum throughput
***
Start with the default engine configuration, then apply quantization and other optimizations based on your specific performance requirements.
# Private Docker registries
Source: https://docs.baseten.co/development/model/private-registries
A guide to configuring a private container registry for your Truss
Truss uses containerized environments to ensure consistent model execution
across deployments. When deploying a custom base image or a custom server from a
private registry, you'll need to grant Baseten access to download that image.
## AWS Elastic Container Registry (ECR)
AWS supports [OIDC](/organization/oidc),
[service accounts](https://docs.aws.amazon.com/IAM/latest/UserGuide/introduction.html),
and
[access tokens](https://docs.aws.amazon.com/AmazonECR/latest/userguide/registry_auth.html#registry-auth-token)
for container registry authentication.
### AWS OIDC (Recommended)
OIDC provides short-lived, narrowly scoped tokens for secure authentication without managing long-lived credentials.
1. [Configure AWS to trust the Baseten OIDC provider](/organization/oidc#aws-setup) and create an IAM role with ECR permissions.
2. Add the OIDC configuration to your `config.yaml`:
```yaml config.yaml theme={"system"}
base_image:
image: .dkr.ecr..amazonaws.com/path/to/image
docker_auth:
auth_method: AWS_OIDC
aws_oidc_role_arn: arn:aws:iam:::role/baseten-ecr-access
aws_oidc_region:
registry: .dkr.ecr..amazonaws.com
```
No secrets needed! The `aws_oidc_role_arn` and `aws_oidc_region` are not sensitive and can be committed to your repository.
See the [OIDC authentication guide](/organization/oidc) for detailed setup instructions and best practices.
### AWS IAM Service accounts
To use an IAM service account for long-lived access, use the `AWS_IAM`
authentication method in Truss.
1. Get an `AWS_ACCESS_KEY_ID` and `AWS_SECRET_ACCESS_KEY` from the AWS dashboard.
2. Add these as [secrets](https://app.baseten.co/settings/secrets) in Baseten. Name them `aws_access_key_id` and `aws_secret_access_key`.
3. Choose the `AWS_IAM` authentication method when setting up your Truss. The `config.yaml` file should look like this:
```yaml config.yaml theme={"system"}
...
base_image:
image: .dkr.ecr..amazonaws.com/path/to/image
docker_auth:
auth_method: AWS_IAM
registry: .dkr.ecr..amazonaws.com
secrets:
aws_access_key_id: null
aws_secret_access_key: null
...
```
Specify the registry and image separately.
To use different secret names, configure the
`aws_access_key_id_secret_name` and `aws_secret_access_key_secret_name` options
under `docker_auth`:
```yaml theme={"system"}
...
base_image:
...
docker_auth:
auth_method: AWS_IAM
registry: .dkr.ecr..amazonaws.com
aws_access_key_id_secret_name: custom_aws_access_key_secret
aws_secret_access_key_secret_name: custom_aws_secret_key_secret
secrets:
custom_aws_access_key_secret: null
custom_aws_secret_key_secret: null
```
### Access Token
1. Get the **Base64-encoded** secret:
```sh theme={"system"}
PASSWORD=`aws ecr get-login-password --region `
echo -n "AWS:$PASSWORD" | base64
```
2. Add a new [secret](https://app.baseten.co/settings/secrets) to Baseten named `DOCKER_REGISTRY_.dkr.ecr..amazonaws.com` with the Base64-encoded secret as the value.
3. Add the secret name to the `secrets` section of `config.yaml` to allow this model to access the secret when pushed.
```yaml config.yaml theme={"system"}
secrets:
DOCKER_REGISTRY_.dkr.ecr..amazonaws.com: null
```
## Google Cloud Artifact Registry
GCP supports [OIDC](/organization/oidc), [service accounts](https://cloud.google.com/iam/docs/service-account-overview), and [access tokens](https://cloud.google.com/artifact-registry/docs/docker/authentication#token) for container registry authentication.
This method also works with Google Container Registry (`gcr.io`, `.gcr.io`).
### GCP OIDC (Recommended)
OIDC provides short-lived, narrowly scoped tokens for secure authentication without managing long-lived credentials.
1. [Configure GCP Workload Identity](/organization/oidc#google-cloud-setup) to trust the Baseten OIDC provider and grant Artifact Registry permissions.
2. Add the OIDC configuration to your `config.yaml`:
```yaml config.yaml theme={"system"}
base_image:
image: gcr.io/my-project/my-image:latest
docker_auth:
auth_method: GCP_OIDC
gcp_oidc_service_account: baseten-oidc@my-project.iam.gserviceaccount.com
gcp_oidc_workload_id_provider: projects//locations/global/workloadIdentityPools/baseten-pool/providers/baseten-provider
registry: gcr.io
```
No secrets needed! The service account and workload identity provider are not sensitive and can be committed to your repository.
See the [OIDC authentication guide](/organization/oidc) for detailed setup instructions and best practices.
### Service Account
1. Get your [service account key](https://cloud.google.com/artifact-registry/docs/docker/authentication#json-key) as a JSON key blob.
2. Add a new [secret](https://app.baseten.co/settings/secrets) to Baseten named `gcp-service-account` (or similar) with the JSON key blob as the value.
3. Add the secret name to the `secrets` section of `config.yaml` to allow this model to access the secret when pushed.
```yaml config.yaml theme={"system"}
secrets:
gcp-service-account: null
```
4. Configure the `docker_auth` section of your `base_image` to use service account authentication:
```yaml theme={"system"}
base_image:
...
docker_auth:
auth_method: GCP_SERVICE_ACCOUNT_JSON
secret_name: gcp-service-account
registry: -docker.pkg.dev
```
`secret_name` must match the secret you created in step 2.
### Access Token
1. Get your [access token](https://cloud.google.com/artifact-registry/docs/docker/authentication#token).
2. Add a new [secret](https://app.baseten.co/settings/secrets) to Baseten named `DOCKER_REGISTRY_-docker.pkg.dev` with the Base64-encoded secret as the value.
3. Add the secret name to the `secrets` section of `config.yaml` to allow this model to access the secret when pushed.
```yaml config.yaml theme={"system"}
secrets:
DOCKER_REGISTRY_-docker.pkg.dev: null
```
## Docker Hub
1. Get the **Base64-encoded** secret:
```sh theme={"system"}
echo -n 'username:password' | base64
```
2. Add a new [secret](https://app.baseten.co/settings/secrets) to Baseten named `DOCKER_REGISTRY_https://index.docker.io/v1/` with the Base64-encoded secret as the value.
```yaml theme={"system"}
Name: DOCKER_REGISTRY_https://index.docker.io/v1/
Token:
```
3. Add the secret name to the `secrets` section of `config.yaml`:
```yaml config.yaml theme={"system"}
secrets:
DOCKER_REGISTRY_https://index.docker.io/v1/: null
```
## GitHub Container Registry (GHCR)
1. Create a GitHub [Personal Access Token](https://github.com/settings/tokens) with the `read:packages` scope. Use a **classic** token, not fine-grained.
2. Get the **Base64-encoded** secret:
```sh theme={"system"}
echo -n 'github_username:ghp_your_personal_access_token' | base64
```
3. Add a new [secret](https://app.baseten.co/settings/secrets) to Baseten named `DOCKER_REGISTRY_ghcr.io` with the Base64-encoded secret as the value.
```yaml theme={"system"}
Name: DOCKER_REGISTRY_ghcr.io
Token:
```
4. Add the secret name to the `secrets` section of `config.yaml`:
```yaml config.yaml theme={"system"}
base_image:
image: ghcr.io/your-org/your-image:tag
secrets:
DOCKER_REGISTRY_ghcr.io: null
```
## NVIDIA NGC
1. Generate an [NGC API Key](https://org.ngc.nvidia.com/setup/api-key) from your NVIDIA NGC account.
2. Get the **Base64-encoded** secret:
```sh theme={"system"}
echo -n '$oauthtoken:your_ngc_api_key' | base64
```
The username `$oauthtoken` is a literal string, not a variable. Use it exactly as shown.
3. Add a new [secret](https://app.baseten.co/settings/secrets) to Baseten named `DOCKER_REGISTRY_nvcr.io` with the Base64-encoded secret as the value.
```yaml theme={"system"}
Name: DOCKER_REGISTRY_nvcr.io
Token:
```
4. Add the secret name to the `secrets` section of `config.yaml`:
```yaml config.yaml theme={"system"}
base_image:
image: nvcr.io/nvidia/pytorch:24.01-py3
secrets:
DOCKER_REGISTRY_nvcr.io: null
```
# Using request objects / cancellation
Source: https://docs.baseten.co/development/model/requests
Get more control by directly using the request object.
Truss processes client requests by extracting and validating payloads. For **advanced use cases**, you can access the raw request object to:
* **Customize payload deserialization** (for example, binary protocol buffers).
* **Handle disconnections and cancel long-running predictions.**
You can mix request objects with standard inputs or use requests exclusively for performance optimization.
## Using request objects in Truss
You can define request objects in `preprocess`, `predict`, and `postprocess`:
```python theme={"system"}
import fastapi
class Model:
def preprocess(self, request: fastapi.Request):
...
def predict(self, inputs, request: fastapi.Request):
...
def postprocess(self, inputs, request: fastapi.Request):
...
```
### Rules for using requests
* The request must be **type-annotated** as `fastapi.Request`.
* If **only** requests are used, Truss **skips payload extraction** for better performance.
* If **both** request objects and standard inputs are used:
* Request **must be the second argument**.
* **Preprocessing transforms inputs**, but the request object remains unchanged.
* `postprocess` can’t use only the request. It must receive the model’s output.
* If `predict` only uses the request, `preprocess` cannot be used.
```python theme={"system"}
import fastapi, asyncio, logging
class Model:
async def predict(self, inputs, request: fastapi.Request):
await asyncio.sleep(1)
if await request.is_disconnected():
logging.warning("Cancelled before generation.")
return # Cancel request on the model engine here.
for i in range(5):
await asyncio.sleep(1.0)
logging.warning(i)
yield str(i) # Streaming response
if await request.is_disconnected():
logging.warning("Cancelled during generation.")
return # Cancel request on the model engine here.
```
You must implement request cancellation at the model level, which varies by framework.
## Cancelling requests in specific frameworks
### TRT-LLM (polling-based cancellation)
For TensorRT-LLM, use `response_iterator.cancel()` to terminate streaming requests:
```python theme={"system"}
async for request_output in response_iterator:
if await is_cancelled_fn():
logging.info("Request cancelled. Cancelling Triton request.")
response_iterator.cancel()
return
```
See full example in [TensorRT-LLM Docs](https://developer.nvidia.com/tensorrt-llm).
### vLLM (abort API)
For vLLM, use `engine.abort()` to stop processing:
```python theme={"system"}
async for request_output in results_generator:
if await request.is_disconnected():
await engine.abort(request_id)
return
```
See full example in [vLLM Docs](https://docs.vllm.ai/en/latest/dev/engine/async_llm_engine.html#vllm.AsyncLLMEngine.generate).
## Unsupported request features
* **Streaming file uploads**: Use URLs instead of embedding large data in the request.
* **Client-side headers**: Most headers are stripped; include necessary metadata in the payload.
# Custom responses
Source: https://docs.baseten.co/development/model/responses
Get more control by directly creating the response object.
By default, Truss wraps prediction results into an HTTP response. For **advanced use cases**, you can create response objects manually to:
* **Control HTTP status codes.**
* **Use server-sent events (SSEs) for streaming responses.**
You can return a response from predict or postprocess, but not both.
## Returning custom response objects
Any subclass of starlette.responses.Response is supported.
```python theme={"system"}
import fastapi
class Model:
def predict(self, inputs) -> fastapi.Response:
return fastapi.Response(...)
```
If `predict` returns a response, `postprocess` cannot be used.
## Example: Streaming with SSEs
For **server-sent events (SSEs)**, use `StreamingResponse`:
```python theme={"system"}
import time
from starlette.responses import StreamingResponse
class Model:
def predict(self, model_input):
def event_stream():
while True:
time.sleep(1)
yield f"data: Server Time: {time.strftime('%Y-%m-%d %H:%M:%S')}\n\n"
return StreamingResponse(event_stream(), media_type="text/event-stream")
```
## Limitations
* **Response headers aren't fully propagated**: include metadata in the response body.
Also see [Using Request Objects](/development/model/requests)
for handling raw requests.
# Secrets
Source: https://docs.baseten.co/development/model/secrets
Use secrets securely in your models
Truss allows you to securely manage API keys, access tokens, passwords, and other secrets without exposing them in code.
## Create a secret
1. Go to [Secrets](https://app.baseten.co/settings/secrets) in your account settings.
2. Enter the name and value of the secret, for example `hf_access_token` and `hf_...`.
3. Select **Add secret**.
To create a secret with the API, use the following command:
```bash theme={"system"}
curl --request POST \
--url https://api.baseten.co/v1/secrets \
--header "Authorization: Api-Key $BASETEN_API_KEY" \
--data '{
"name": "hf_access_token",
"value": "hf_..."
}'
```
For more information, see the
[Upsert a secret](/reference/management-api/secrets/upserts-a-secret) reference.
## Use secrets in your model
Once you've created a secret, declare it in your `config.yaml` and access it in your model code.
Never store actual secret values in `config.yaml`. Use `null` as a placeholder.
The secret in your `config.yaml` is a reference to the key in the secret manager.
Specify the reference to the secret in `config.yaml`:
```yaml config.yaml theme={"system"}
secrets:
hf_access_token: null
```
Secrets are passed as keyword arguments to the `Model` class. To access them, store the secrets in `__init__`:
```python main.py theme={"system"}
def __init__(self, **kwargs):
self._secrets = kwargs["secrets"]
```
Then use the secret in `load` or `predict` section of your model by accessing the secret using the key:
```python main.py theme={"system"}
def load(self):
self._model = pipeline(
"fill-mask",
model="baseten/docs-example-gated-model",
use_auth_token=self._secrets["hf_access_token"]
)
```
## Use secrets in custom Docker images
When using [custom Docker images](/development/model/custom-server), Truss
injects secrets into your container at `/secrets/{secret_name}` instead of
passing them through `kwargs`.
You must specify the reference to the secret and then access it in your `start_command` or application code.
Specify the reference to the secret in `config.yaml`:
```yaml config.yaml theme={"system"}
secrets:
hf_access_token: null
```
### Read secrets in your `start_command`
To read a secret in your `start_command`:
```yaml config.yaml theme={"system"}
docker_server:
start_command: sh -c "HF_TOKEN=$(cat /secrets/hf_access_token) my-server --port 8000"
```
### Read secrets in application code
To read a secret in application code:
```python main.py theme={"system"}
with open("/secrets/hf_access_token", "r") as f:
hf_token = f.read().strip()
```
# Streaming output
Source: https://docs.baseten.co/development/model/streaming
Streaming Output for LLMs
Streaming output significantly reduces wait time for generative AI models by returning results as they are generated instead of waiting for the full response.
## Why streaming?
* ✅ **Faster response time** – Get initial results in under **1 second** instead of waiting **10+ seconds**.
* ✅ **Improved user experience** – Partial outputs are **immediately usable**.
This guide walks through **deploying Falcon 7B** with streaming enabled.
### 1. Initialize Truss
```sh theme={"system"}
truss init falcon-7b && cd falcon-7b
```
### 2: Implement Model (Non-Streaming)
This first version loads the Falcon 7B model **without** streaming:
```python model/model.py theme={"system"}
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, GenerationConfig
from typing import Dict
CHECKPOINT = "tiiuae/falcon-7b-instruct"
DEFAULT_MAX_NEW_TOKENS = 150
DEFAULT_TOP_P = 0.95
class Model:
def __init__(self, **kwargs) -> None:
self.tokenizer = None
self.model = None
def load(self):
self.tokenizer = AutoTokenizer.from_pretrained(CHECKPOINT)
self.model = AutoModelForCausalLM.from_pretrained(
CHECKPOINT, torch_dtype=torch.bfloat16, trust_remote_code=True, device_map="auto"
)
def predict(self, request: Dict) -> Dict:
prompt = request["prompt"]
inputs = self.tokenizer(prompt, return_tensors="pt", max_length=512, truncation=True, padding=True)
input_ids = inputs["input_ids"].to("cuda")
generation_config = GenerationConfig(temperature=1, top_p=DEFAULT_TOP_P, top_k=40)
with torch.no_grad():
return self.model.generate(
input_ids=input_ids,
generation_config=generation_config,
return_dict_in_generate=True,
output_scores=True,
pad_token_id=self.tokenizer.eos_token_id,
max_new_tokens=DEFAULT_MAX_NEW_TOKENS,
)
```
### 3. Add streaming support
To enable streaming, we:
* Use `TextIteratorStreamer` to stream tokens as they are generated.
* Run `generate()` in a **separate thread** to prevent blocking.
* Return a **generator** that streams results.
```python model/model.py theme={"system"}
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, GenerationConfig, TextIteratorStreamer
from threading import Thread
from typing import Dict
CHECKPOINT = "tiiuae/falcon-7b-instruct"
class Model:
def __init__(self, **kwargs) -> None:
self.tokenizer = None
self.model = None
def load(self):
self.tokenizer = AutoTokenizer.from_pretrained(CHECKPOINT)
self.model = AutoModelForCausalLM.from_pretrained(
CHECKPOINT, torch_dtype=torch.bfloat16, trust_remote_code=True, device_map="auto"
)
def predict(self, request: Dict):
prompt = request["prompt"]
inputs = self.tokenizer(prompt, return_tensors="pt", max_length=512, truncation=True, padding=True)
input_ids = inputs["input_ids"].to("cuda")
streamer = TextIteratorStreamer(self.tokenizer)
generation_config = GenerationConfig(temperature=1, top_p=0.95, top_k=40)
def generate():
self.model.generate(
input_ids=input_ids,
generation_config=generation_config,
return_dict_in_generate=True,
output_scores=True,
pad_token_id=self.tokenizer.eos_token_id,
max_new_tokens=150,
streamer=streamer,
)
thread = Thread(target=generate)
thread.start()
def stream_output():
for text in streamer:
yield text
thread.join()
return stream_output()
```
### 4. Configure `config.yaml`
```yaml config.yaml theme={"system"}
model_name: falcon-streaming
requirements:
- torch==2.0.1
- peft==0.4.0
- scipy==1.11.1
- sentencepiece==0.1.99
- accelerate==0.21.0
- bitsandbytes==0.41.1
- einops==0.6.1
- transformers==4.31.0
resources:
cpu: "4"
memory: 16Gi
use_gpu: true
accelerator: L4
```
### 5. Deploy and invoke
Deploy the model:
```sh theme={"system"}
truss push --watch
```
Invoke with:
```sh theme={"system"}
truss predict -d '{"prompt": "Tell me about falcons", "do_sample": true}'
```
# Torch compile caching
Source: https://docs.baseten.co/development/model/torch-compile-cache
Accelerate cold starts by loading in previous compilation artifacts.
### Requires [b10cache](/development/model/b10cache) enabled
## Overview
PyTorch's `torch.compile` feature offers significant performance improvements for inference workloads, reducing inference time by up to 40%. However, this optimization comes with a trade-off: the initial compilation process adds considerable latency to cold starts, as the model must be compiled before serving its first inference request.
This compilation overhead becomes particularly problematic in production environments where:
* Models frequently scale up and down based on demand
* New pods are regularly spawned to handle traffic spikes
* Each new instance must repeat the compilation process from scratch
## Solution
Persist compilation artifacts across deployments and pod restarts, by storing them in [b10cache](/development/model/b10cache). When a new pod starts up, it can load previously compiled artifacts instead of recompiling from scratch. The library gracefully handles large scale ups, managing race conditions and ensuring fault-tolerance in the shared b10cache.
In practice, this strategy slashes compilation latencies to just around 5-20 seconds, depending on the model.
***
## Implementation options
There are two different deployment patterns that benefit from torch compile caching:
* **Truss Models**: `model.py` calling `torch.compile` ([Jump to](#truss-models-model-py))
* **vLLM Servers**: vLLM custom server ([Jump to](#vllm-servers-cli-tool))
***
## Truss models (`model.py`)
### API reference
We expose two API calls that return an `OperationStatus` object to help you control program flow based on the result.
If you have previously saved compilation cache for this model, load it to speed up the compilation for the model on this pod.
**Returns:**
* `OperationStatus.SUCCESS` → successful load
* `OperationStatus.SKIPPED` → if torch compilation artifacts already exist on the pod
* `OperationStatus.ERROR` → general catch-all errors
* `OperationStatus.DOES_NOT_EXIST` → if no cache file was found
Save your model's torch compilation cache for future use. This should be called after running prompts to warm up your model and trigger compilation.
**Returns:**
* `OperationStatus.SUCCESS` → successful save
* `OperationStatus.SKIPPED` → skipped because compile cache already exists in shared directory
* `OperationStatus.ERROR` → general catch-all errors
### Implementation example
Here is an example of compile caching for Flux, an image generation model. Note how we save the result of `load_compile_cache` to inform on whether to `save_compile_cache`.
#### Step 1: Update `config.yaml`
Under requirements, add `b10-transfer`:
```yaml theme={"system"}
requirements:
- b10-transfer
```
#### Step 2: Update `model.py`
Import the library and use the two functions to speed up torch compilation time:
```python theme={"system"}
from b10_transfer import load_compile_cache, save_compile_cache, OperationStatus
class Model:
def load(self):
self.pipe = FluxPipeline.from_pretrained(
self.model_name, torch_dtype=torch.bfloat16, token=self.hf_access_token
).to("cuda")
# Try to load compile cache
cache_loaded: OperationStatus = load_compile_cache()
if cache_loaded == OperationStatus.ERROR:
logging.info("Run in eager mode, skipping torch compile")
else:
logging.info("Compiling the model for performance optimization")
self.pipe.transformer = torch.compile(
self.pipe.transformer, mode="max-autotune-no-cudagraphs", dynamic=False
)
self.pipe.vae.decode = torch.compile(
self.pipe.vae.decode, mode="max-autotune-no-cudagraphs", dynamic=False
)
seed = random.randint(0, MAX_SEED)
generator = torch.Generator().manual_seed(seed)
start_time = time.time()
# Warmup the model with dummy prompts, also triggering compilation
self.pipe(
prompt="dummy prompt",
prompt_2=None,
guidance_scale=0.0,
max_sequence_length=256,
num_inference_steps=4,
width=1024,
height=1024,
output_type="pil",
generator=generator
)
end_time = time.time()
logging.info(
f"Warmup completed in {(end_time - start_time)} seconds. "
"This is expected to take a few minutes on the first run."
)
if cache_loaded != OperationStatus.SUCCESS:
# Save compile cache for future runs
outcome: OperationStatus = save_compile_cache()
```
See the [full example](https://github.com/basetenlabs/truss-examples/tree/main/flux/schnell).
***
## vLLM servers (CLI tool)
### Overview
This should be used whenever using compile options with vLLM. On vLLM V1, compiling is the default behavior. This command line tool spawns a process that is completely automatic. It loads the compile cache if you've saved it before, and if not, it saves the compile cache.
### Implementation
There are two changes to make in `config.yaml`:
#### Step 1: Add requirements
Under requirements, add `b10-transfer`:
```yaml theme={"system"}
requirements:
- b10-transfer
```
#### Step 2: Update start command
Under start command, add `b10-compile-cache &` right before the `vllm serve` call:
```yaml theme={"system"}
start_command: "... b10-compile-cache & vllm serve ..."
```
See the [full example](https://github.com/basetenlabs/truss-examples/tree/main/mistral/mistral-small-3.1).
***
## Advanced configuration
The torch compile caching library supports several environment variables for fine-tuning behavior in production environments:
### Cache directory configuration
**`TORCHINDUCTOR_CACHE_DIR`** (optional)
* **Default**: `/tmp/torchinductor_`
* **Description**: Directory where PyTorch stores compilation artifacts locally
* **Allowed prefixes**: `/tmp/`, `/cache/`, `~/.cache`
* **Usage**: Set this if you need to customize where torch compilation artifacts are stored on the local filesystem
**`B10FS_CACHE_DIR`** (optional)
* **Default**: Derived from b10cache mount point + `/compile_cache`
* **Description**: Directory in b10cache where compilation artifacts are persisted across deployments
* **Usage**: Typically doesn't need to be changed as it's automatically configured based on your b10cache setup
**`LOCAL_WORK_DIR`** (optional)
* **Default**: `/app`
* **Description**: Local working directory for temporary operations
* **Allowed prefixes**: `/app/`, `/tmp/`, `/cache/`
### Performance and resource limits
**`MAX_CACHE_SIZE_MB`** (optional)
* **Default**: `1024` (1GB)
* **Cap**: Limited by `MAX_CACHE_SIZE_CAP_MB` for safety
* **Description**: Maximum size of a single cache archive in megabytes
* **Usage**: Increase for larger models with extensive compilation artifacts, decrease to save storage
**`MAX_CONCURRENT_SAVES`** (optional)
* **Default**: `50`
* **Cap**: Limited by `MAX_CONCURRENT_SAVES_CAP` for safety
* **Description**: Maximum number of concurrent save operations allowed
* **Usage**: Tune based on your deployment's concurrency requirements and storage performance
### Cleanup and maintenance
**`CLEANUP_LOCK_TIMEOUT_SECONDS`** (optional)
* **Default**: `30`
* **Cap**: Limited by `LOCK_TIMEOUT_CAP_SECONDS`
* **Description**: Timeout for cleaning up stale lock files, to prevent deadlocks. They may occur when a replica holding the lock crashes.
* **Usage**: Decrease if you're experiencing deadlocks in high-load scenarios
**`CLEANUP_INCOMPLETE_TIMEOUT_SECONDS`** (optional)
* **Default**: `60`
* **Cap**: Limited by `INCOMPLETE_TIMEOUT_CAP_SECONDS`
* **Description**: Timeout for cleaning up incomplete cache files
* **Usage**: Increase for slower storage systems or larger cache files
### Example configuration
```yaml theme={"system"}
# config.yaml
environment_variables:
MAX_CACHE_SIZE_MB: "2048"
MAX_CONCURRENT_SAVES: "25"
CLEANUP_LOCK_TIMEOUT_SECONDS: "45"
```
Most users won't need to modify these settings. The defaults are optimized for typical production workloads. Only adjust these values if you're experiencing specific performance issues or have unusual deployment requirements.
***
## Further reading
To understand implementation details, read more [here](https://docs.pytorch.org/tutorials/recipes/torch_compile_caching_tutorial.html).
# WebSockets
Source: https://docs.baseten.co/development/model/websockets
Enable real-time, streaming, bidirectional communication using WebSockets for Truss models and Chainlets.
## Overview
WebSockets provide a persistent, full-duplex communication channel between clients and server-side models or chains. Full duplex means that chunks of data can be sent client→server and server→client simultaneously and repeatedly.
This guide covers how to implement WebSocket-based interactions for Truss models and Chains/Chainlets.
Unlike traditional request-response models, WebSockets allow continuous data exchange without reopening connections. This is useful for real-time applications, streaming responses, and maintaining lightweight interactions. Example applications could be real-time audio transcription, AI phone calls or agents with turn-based interactions. WebSockets are also useful for situations where you want to manage some state on the server-side, and you want requests that are part of the same "session" to always be routed to the replica that maintains that state.
## WebSocket usage in Truss models
In Truss models, WebSockets replace the conventional request/response flow: a single `websocket` method handles all processing and input/output communication goes through the WebSocket object (not arguments and return values). There are no separate `preprocess`, `predict`, and `postprocess` methods anymore, but you can still implement `load`.
1. **Initialize your Truss**:
```bash theme={"system"}
truss init websocket-model
```
For more detailed information about this command, refer to the [truss init documentation](/reference/cli/truss/init).
2. Replace the `predict` method with a `websocket` method to your Truss in `model/model.py`. For example:
```python theme={"system"}
import fastapi
class Model:
async def websocket(self, websocket: fastapi.WebSocket):
try:
while True:
message = await websocket.receive_text()
await websocket.send_text(f"WS obtained: {message}")
except fastapi.WebSocketDisconnect:
pass
```
3. Set `runtime.transport.kind=websocket` in `config.yaml`:
```yaml theme={"system"}
...
runtime:
transport:
kind: websocket
```
### Key points
* Continuous message exchange occurs in a loop until client disconnection. You can also decide to close the connection server-side if a certain condition is reached
* This is done by calling `websocket.close()`
* WebSockets enable bidirectional streaming, avoiding the need for multiple HTTP requests (or return values).
* You must not implement any of the traditional methods `predict`, `preprocess`, or `postprocess`.
* The WebSocket object passed to the `websocket` method has already accepted the connection, so you must not call `websocket.accept()` on it. You may close the connection though at the end of your processing. If you don’t close it explicitly, it will be closed after exiting your `websocket` method.
### Invocation
Using `websocat` ([get it](https://github.com/vi/websocat)), you can call the model like this:
```bash theme={"system"}
websocat -H="Authorization: Api-Key $BASETEN_API_KEY" \
wss://model-{MODEL_ID}.api.baseten.co/environments/production/websocket
Hello # Your input.
WS obtained: Hello # Echoed from model.
# ctrl+c to close connection.
```
The path you use depends on which environment or deployment of the model you'd like to call.
* Environment: `wss://model-{MODEL_ID}.api.baseten.co/environments/{ENVIRONMENT_NAME}/websocket`.
* Deployment: `wss://model-{MODEL_ID}.api.baseten.co/deployment/{DEPLOYMENT_NAME}/websocket`.
* Regional environment: `wss://model-{MODEL_ID}-{ENV_NAME}.api.baseten.co/websocket`. See [Regional environments](/deployment/environments#regional-environments).
See [Reference](/reference/inference-api/predict-endpoints/environments-websocket) for the full details.
## WebSocket usage in Chains/Chainlets
For Chains, WebSockets are wrapped in a reduced API object `WebSocketProtocol`. All processing happens in the `run_remote` method as usual. But inputs as well as outputs (or “return values”) are sent through the WebSocket object using async `send_{X}` and `receive_{X}` methods (there are variants for `text`, `bytes` and `json)`. As a convenience, there's also a `receive` method that can passthrough both `str` and `bytes` types.
### Implementation Example
```python theme={"system"}
import fastapi
import truss_chains as chains
class Dependency(chains.ChainletBase):
async def run_remote(self, name: str) -> str:
return f"Hello from dependency, {name}."
@chains.mark_entrypoint
class WSEntrypoint(chains.ChainletBase):
def __init__(self, dependency=chains.depends(Dependency)):
self._dependency = dependency
async def run_remote(self, websocket: chains.WebSocketProtocol) -> None:
try:
while True:
message = await websocket.receive_text()
if message == "dep":
response = await self._dependency.run_remote("WSEntrypoint")
else:
response = f"You said: {message}"
await websocket.send_text(response)
except fastapi.WebSocketDisconnect:
print("Disconnected.")
```
### Key points
* WebSocket interactions in Chains must follow `WebSocketProtocol` (it is essentially the same as `fastapi.Websocket`, but you cannot accept the connection, because inside the Chainlet, the connection will be already accepted).
* No other arguments are allowed in `run_remote()` when using WebSockets.
* The return type must be `None` (if you return data to the client, send it through the WebSocket itself).
* WebSockets can only be used only in the *entrypoint*, not in dependencies.
* Unlike for truss models you don't need to explicitly set `runtime.transport.kind` .
### Invocation
Using `websocat` ([get it](https://github.com/vi/websocat)), you can call the chain like this:
```bash theme={"system"}
websocat -H="Authorization: Api-Key $BASETEN_API_KEY" \
wss://chain-{CHAIN_ID}.api.baseten.co/environments/production/websocket
```
Similarly to models, WebSocket chains can also be invoked either via deployment or environment. For regional environments, use `wss://chain-{CHAIN_ID}-{ENV_NAME}.api.baseten.co/websocket`. See [Regional environments](/deployment/environments#regional-environments).
See [Reference](/reference/inference-api/predict-endpoints/environments-websocket) for the full details.
## WebSocket usage with custom servers
You can deploy WebSocket servers using **custom Docker images** with the `docker_server` configuration. This approach is useful when you have an existing WebSocket server packaged in a Docker container or need specific runtime environments.
### Configuration
To deploy a WebSocket server using a custom Docker image, configure your `config.yaml` as follows:
```yaml config.yaml theme={"system"}
base_image:
image: bryanzhang2/custom_ws:v0.0.4
docker_server:
start_command: /app/server
readiness_endpoint: /health
liveness_endpoint: /health
predict_endpoint: /v1/websocket
server_port: 8081
model_name: custom_ws
runtime:
transport:
kind: "websocket"
```
### Key configurations for WebSocket custom servers
* `predict_endpoint` (**required**) – The WebSocket endpoint path (e.g., `/v1/websocket`, `/ws`)
* `runtime.transport.kind` (**required**) – Must be set to `"websocket"`
* `start_command` (**required**) – Command to start your WebSocket server
* `readiness_endpoint` (**required**) – Health check endpoint for Kubernetes readiness probes
* `liveness_endpoint` (**required**) – Health check endpoint for Kubernetes liveness probes
### Invocation
Using `websocat`, you can connect to your custom WebSocket server:
```bash theme={"system"}
websocat -H="Authorization: Api-Key $BASETEN_API_KEY" \
wss://model-{MODEL_ID}.api.baseten.co/environments/production/websocket
```
The WebSocket connection will be routed to your custom server's `predict_endpoint` path.
For more details on custom server deployment, see [Custom servers documentation](/development/model/custom-server).
# Deployment and concurrency considerations
### Scheduling
The WebSocket scaling algorithm will schedule new WebSocket connections to the least-utilized replica until all replicas are at `maxConcurrency - 1` concurrent WebSocket connections, at which point the total number of replicas will be incremented, until the `maxReplica` setting is hit.
Scale-down occurs when the number of replicas is greater than `minReplica` , and there are replicas with 0 concurrent connections. At this point, we begin scaling down idle replicas one-by-one.
Some other scheduling factors to consider when using WebSockets:
* Resource utilization: Standard HTTP requests are stateless and allow Baseten to optimize replica utilization and autoscaling. With WebSockets, long-lived connections are tied to specific replicas and count against your concurrency targets, even if underutilized. It's your responsibility to manage connection efficiency.
* Stateful complexity: WebSocket handlers often assume server-side state. This adds complexity around connection lifecycle management (e.g., handling unexpected disconnects, cleanup, reconnection logic).
### Lifetime guarantees
WebSockets are guaranteed to last a minimum of *1 hour*. In reality, a single WebSocket connection should be able to continue for much longer, but this is the guarantee that we provide in order to ensure that we can make changes to our system at a reasonable rate (including restarting and moving internal services as needed).
### Concurrency changes
When scaling concurrency down, existing WebSockets will be allowed to continue until they complete, even if it means that a replica indefinitely has a greater number of ongoing connections than the max concurrency setting.
For instance, suppose:
* You have a concurrency setting of 10, and currently have 10 websocket connections active on a replica.
* Then, you change the concurrency setting to 5.
In this case, Baseten will not force any of the ongoing connections to close as a result of the concurrency change. They will be allowed to continue and close naturally (unless the 1 hour minimum has passed, and an internal restart is required).
### Promotion
Just like with HTTP, you can promote a WebSocket model or chain to an environment via the REST API or UI.
When promoting a WebSocket model or chain, new connections will be routed to the new deployment, but existing
connections will remain connected to the current deployment until a termination happens.
Depending on the length of the connection, this could result in old deployments taking longer to scale down
than for HTTP deployments.
### Maximum message size
As a hard limit, we enforce a 100MiB maximum message size for any individual message sent over a websocket. This means that both clients and models are limited to 100MiB for *each* outgoing message, though *there is no overall limit on the cumulative data that can be sent over a websocket*.
# Monitoring
Just like with HTTP deployment, with WebSockets, we offer metrics on the performance
of the deployment.
## Inference volume
Inference volume is tracked as the number of connections per minute. These
metrics are published *after* the connection is closed, so these include the
status that the connection was closed with.
See [WebsSocket connection close codes](https://developer.mozilla.org/en-US/docs/Web/API/CloseEvent/code) for a full list.
## End-to-end connection duration
Measured at different percentiles (p50, p90, p95, p99):
End-to-end connection duration is tracked as the duration of the connection. Just
like connections/minute, this is tracked after connections are closed.
## Connection input and output size
Measured at different percentiles (p50, p90, p95, p99):
* **Connection input size:** Bytes sent by the client to the server for the duration of the connection.
* **Connection output size:** Bytes sent by the client to the server for the duration of the connection.
# BEI-Bert
Source: https://docs.baseten.co/engines/bei/bei-bert
BERT-optimized embeddings with cold-start performance
BEI-Bert is a specialized variant of Baseten Embeddings Inference optimized for BERT-based model architectures. It provides superior cold-start performance and 16-bit precision for models that benefit from bidirectional attention patterns.
## When to use BEI-Bert
### Ideal use cases
**Model architectures:**
* **Sentence-transformers**: `sentence-transformers/all-MiniLM-L6-v2`
* **Jina models**: `jinaai/jina-embeddings-v2-base-en`, `jinaai/jina-embeddings-v2-base-code`
* **Nomic models**: `nomic-ai/nomic-embed-text-v1.5`, `nomic-ai/nomic-embed-code-v1.5`
* **BERT variants**: `FacebookAI/roberta-base`, `cardiffnlp/twitter-roberta-base`
* **Gemma3Bidirectional**: `google/embeddinggemma-300m`
* **ModernBERT**: `answerdotai/ModernBERT-base`
* **Qwen2Bidirectional**: `Alibaba-NLP/gte-Qwen2-7B-instruct`
* **QWen3Bidirectional** `voyageai/voyage-4-nano`
* **LLama3Bidirectional** `nvidia/llama-embed-nemotron-8b`
**Deployment scenarios:**
* **Cold-start sensitive applications**: Where first-request latency is critical
* **Small to medium models**: (under 4B parameters) where quantization isn't needed
* **High-accuracy requirements**: Where 16-bit precision is preferred
* **Bidirectional attention**: Models with bidirectional attention run best on this engine.
### BEI-Bert vs BEI comparison
| Feature | BEI-Bert | BEI |
| ------------ | ------------------------------------ | --------------------------------- |
| Architecture | BERT-based (bidirectional) | Causal (unidirectional) |
| Precision | FP16 (16-bit) | BF16/FP16/FP8/FP4 (quantized) |
| Cold-start | Optimized for fast initialization | Standard startup |
| Quantization | Not supported | FP8/FP4 supported |
| Memory usage | Lower for small models | Higher or equal |
| Throughput | 600-900 embeddings/sec | 800-1400 embeddings/sec |
| Best for | Small BERT models, accuracy-critical | Large models, throughput-critical |
## Recommended models (MTEB ranking)
### Top-tier embeddings
**High performance (rank 2-8):**
* `Alibaba-NLP/gte-Qwen2-7B-instruct` (7.61B): Bidirectional.
* `intfloat/multilingual-e5-large-instruct` (560M): Multilingual.
* `google/embeddinggemma-300m` (308M): Google's compact model.
**Mid-range performance (rank 15-35):**
* `Alibaba-NLP/gte-Qwen2-1.5B-instruct` (1.78B): Cost-effective.
* `Salesforce/SFR-Embedding-2_R` (7.11B): Salesforce model.
* `Snowflake/snowflake-arctic-embed-l-v2.0` (568M): Snowflake large.
* `Snowflake/snowflake-arctic-embed-m-v2.0` (305M): Snowflake medium.
**Efficient models (rank 52-103):**
* `WhereIsAI/UAE-Large-V1` (335M): UAE large model.
* `nomic-ai/nomic-embed-text-v1` (137M): Nomic original.
* `nomic-ai/nomic-embed-text-v1.5` (137M): Nomic improved.
* `sentence-transformers/all-mpnet-base-v2` (109M): MPNet base.
**Specialized models:**
* `nomic-ai/nomic-embed-text-v2-moe` (475M-A305M): Mixture of experts.
* `Alibaba-NLP/gte-large-en-v1.5` (434M): Alibaba large English.
* `answerdotai/ModernBERT-large` (396M): Modern BERT large.
* `jinaai/jina-embeddings-v2-base-en` (137M): Jina English.
* `jinaai/jina-embeddings-v2-base-code` (137M): Jina code.
### Re-ranking models
**Top re-rankers:**
* `BAAI/bge-reranker-large`: XLM-RoBERTa based.
* `BAAI/bge-reranker-base`: XLM-RoBERTa base.
* `Alibaba-NLP/gte-multilingual-reranker-base`: GTE multilingual.
* `Alibaba-NLP/gte-reranker-modernbert-base`: ModernBERT reranker.
### Classification models
**Sentiment analysis:**
* `SamLowe/roberta-base-go_emotions`: RoBERTa for emotions.
## Supported model families
### Popular Hugging Face models
Find supported models on Hugging Face:
* [Embedding Models](https://huggingface.co/models?pipeline_tag=feature-extraction\&other=text-embeddings-inference\&sort=trending)
* [Classification Models](https://huggingface.co/models?pipeline_tag=text-classification\&other=text-embeddings-inference\&sort=trending)
### Sentence-transformers
The most common BERT-based embedding models, optimized for semantic similarity.
**Popular models:**
* `sentence-transformers/all-MiniLM-L6-v2` (384D, 22M params)
* `sentence-transformers/all-mpnet-base-v2` (768D, 110M params)
* `sentence-transformers/multi-qa-mpnet-base-dot-v1` (768D, 110M params)
**Configuration:**
```yaml theme={"system"}
trt_llm:
build:
base_model: encoder_bert
checkpoint_repository:
source: HF
repo: "sentence-transformers/all-MiniLM-L6-v2"
quantization_type: no_quant
runtime:
webserver_default_route: /v1/embeddings
```
### Voyage and Nemotron Bidirectional LLMs
Large-decoder architectures with bidirectional attention like Qwen3 (`voyageai/voyage-4-nano`) or Llama3 (`nvidia/llama-embed-nemotron-8b`) can be deployed with BEi-bert.
**Configuration:**
```yaml theme={"system"}
trt_llm:
build:
base_model: encoder_bert
checkpoint_repository:
source: HF
repo: "voyageai/voyage-4-nano"
# rewrite of the config files for compatibility (no custom code support)
revision: "refs/pr/5"
quantization_type: no_quant
runtime:
webserver_default_route: /v1/embeddings
```
### Jina AI embeddings
Jina's BERT-based models optimized for various domains including code.
**Popular models:**
* `jinaai/jina-embeddings-v2-base-en` (512D, 137M params)
* `jinaai/jina-embeddings-v2-base-code` (512D, 137M params)
* `jinaai/jina-embeddings-v2-base-es` (512D, 137M params)
**Configuration:**
```yaml theme={"system"}
trt_llm:
build:
base_model: encoder_bert
checkpoint_repository:
source: HF
repo: "jinaai/jina-embeddings-v2-base-en"
quantization_type: no_quant
runtime:
webserver_default_route: /v1/embeddings
```
### Nomic AI embeddings
Nomic's models with specialized training for text and code.
**Popular models:**
* `nomic-ai/nomic-embed-text-v1.5` (768D, 137M params)
* `nomic-ai/nomic-embed-code-v1.5` (768D, 137M params)
**Configuration:**
```yaml theme={"system"}
trt_llm:
build:
base_model: encoder_bert
checkpoint_repository:
source: HF
repo: "nomic-ai/nomic-embed-text-v1.5"
quantization_type: no_quant
runtime:
webserver_default_route: /v1/embeddings
```
### Alibaba GTE and Qwen models
Advanced multilingual models with instruction-tuning and long-context support.
**Popular models:**
* `Alibaba-NLP/gte-Qwen2-7B-instruct`: Top-ranked multilingual.
* `Alibaba-NLP/gte-Qwen2-1.5B-instruct`: Cost-effective alternative.
* `intfloat/multilingual-e5-large-instruct`: E5 multilingual variant.
**Configuration:**
```yaml theme={"system"}
trt_llm:
build:
base_model: encoder_bert
checkpoint_repository:
source: HF
repo: "Alibaba-NLP/gte-Qwen2-7B-instruct"
quantization_type: no_quant
runtime:
webserver_default_route: /v1/embeddings
```
## Configuration examples
### Cost-effective GTE-Qwen deployment
```yaml theme={"system"}
model_name: BEI-Bert-GTE-Qwen-1.5B
resources:
accelerator: L4
cpu: '1'
memory: 15Gi
use_gpu: true
trt_llm:
build:
base_model: encoder_bert
checkpoint_repository:
source: HF
repo: "Alibaba-NLP/gte-Qwen2-1.5B-instruct"
revision: main
max_num_tokens: 8192
quantization_type: no_quant
runtime:
webserver_default_route: /v1/embeddings
kv_cache_free_gpu_mem_fraction: 0.85
batch_scheduler_policy: guaranteed_no_evict
```
### Basic sentence-transformer deployment
```yaml theme={"system"}
model_name: BEI-Bert-MiniLM
resources:
accelerator: L4
cpu: '1'
memory: 10Gi
use_gpu: true
trt_llm:
build:
base_model: encoder_bert
checkpoint_repository:
source: HF
repo: "sentence-transformers/all-MiniLM-L6-v2"
revision: main
max_num_tokens: 8192
quantization_type: no_quant
runtime:
webserver_default_route: /v1/embeddings
kv_cache_free_gpu_mem_fraction: 0.9
batch_scheduler_policy: guaranteed_no_evict
```
### Jina code embeddings deployment
```yaml theme={"system"}
model_name: BEI-Bert-Jina-Code
resources:
accelerator: H100
cpu: '1'
memory: 10Gi
use_gpu: true
trt_llm:
build:
base_model: encoder_bert
checkpoint_repository:
source: HF
repo: "jinaai/jina-embeddings-v2-base-code"
revision: main
max_num_tokens: 8192
quantization_type: no_quant
runtime:
webserver_default_route: /v1/embeddings
kv_cache_free_gpu_mem_fraction: 0.9
batch_scheduler_policy: guaranteed_no_evict
```
### Nomic text embeddings with custom routing
```yaml theme={"system"}
model_name: BEI-Bert-Nomic-Text
resources:
accelerator: L4
cpu: '1'
memory: 10Gi
use_gpu: true
trt_llm:
build:
base_model: encoder_bert
checkpoint_repository:
source: HF
repo: "nomic-ai/nomic-embed-text-v1.5"
revision: main
max_num_tokens: 16384
quantization_type: no_quant
runtime:
webserver_default_route: /v1/embeddings
kv_cache_free_gpu_mem_fraction: 0.85
batch_scheduler_policy: guaranteed_no_evict
```
## Integration examples
### OpenAI client with Qwen3 instructions
```python theme={"system"}
from openai import OpenAI
import os
client = OpenAI(
api_key=os.environ['BASETEN_API_KEY'],
base_url="https://model-xxxxxx.api.baseten.co/environments/production/sync/v1"
)
response = client.embeddings.create(
input="This is a test sentence for embedding.",
model="not-required"
)
# Batch embedding with multiple documents
documents = [
"Product documentation for software library",
"User question about API usage",
"Code snippet example"
]
response = client.embeddings.create(
input=documents,
model="not-required"
)
print(f"Embedding dimension: {len(response.data[0].embedding)}")
print(f"Processed {len(response.data)} embeddings")
```
### Baseten Performance Client
For maximum throughput with BEI-Bert:
```python theme={"system"}
from baseten_performance_client import PerformanceClient
client = PerformanceClient(
api_key=os.environ['BASETEN_API_KEY'],
base_url="https://model-xxxxxx.api.baseten.co/environments/production/sync"
)
# High-throughput batch processing
texts = [f"Sentence {i}" for i in range(1000)]
response = client.embed(
input=texts,
model="not-required",
batch_size=8,
max_concurrent_requests=16,
timeout_s=300
)
print(f"Processed {len(response.numpy())} embeddings")
print(f"Embedding shape: {response.numpy().shape}")
```
### Direct API usage
```python theme={"system"}
import requests
import os
import json
headers = {
"Authorization": f"Api-Key {os.environ['BASETEN_API_KEY']}",
"Content-Type": "application/json"
}
data = {
"input": ["Text to embed", "Another text"],
"encoding_format": "float"
}
response = requests.post(
"https://model-xxxxxx.api.baseten.co/environments/production/sync/v1/embeddings",
headers=headers,
json=data
)
result = response.json()
print(f"Embeddings: {len(result['data'])} embeddings generated")
```
## Best practices
### Model selection guide
Choose based on your primary constraint:
**Cost-effective (balanced performance/cost):**
* `Alibaba-NLP/gte-Qwen2-7B-instruct`: Instruction-tuned, ranked #1 for multilingual.
* `Alibaba-NLP/gte-Qwen2-1.5B-instruct`: 1/5 the size, still top-tier.
* `Snowflake/snowflake-arctic-embed-m-v2.0`: Multilingual-optimized, MRL support.
**Lightweight & fast (under 500M):**
* `google/embeddinggemma-300m`: 300M params, 100+ languages.
* `Snowflake/snowflake-arctic-embed-m-v2.0`: 305M, compression-friendly.
* `nomic-ai/nomic-embed-text-v1.5`: 137M, minimal latency.
* `sentence-transformers/all-MiniLM-L6-v2`: 22M, legacy standard.
**Specialized:**
* **Code:** `jinaai/jina-embeddings-v2-base-code`
* **Long sequences:** `Alibaba-NLP/gte-large-en-v1.5`
* **Re-ranking:** `BAAI/bge-reranker-large`, `Alibaba-NLP/gte-reranker-modernbert-base`
### Hardware optimization
**Cost-effective deployments:**
* L4 GPUs for models `<200M` parameters
* H100 GPUs for models 200-500M parameters
* Enable autoscaling for variable traffic
**Performance optimization:**
* Use `max_num_tokens: 8192` for most use cases
* Use `max_num_tokens: 16384` for long documents
* Tune `batch_scheduler_policy` based on traffic patterns
### Deployment strategies
**For development:**
* Start with smaller models (MiniLM)
* Use L4 GPUs for cost efficiency
* Enable detailed logging
**For production:**
* Use larger models (MPNet) for better quality
* Use H100 GPUs for better performance
* Implement monitoring and alerting
**For edge deployments:**
* Use smallest suitable models
* Optimize for cold-start performance
* Consider model size constraints
## Troubleshooting
### Common issues
**Slow cold-start times:**
* Ensure model is properly cached
* Consider using smaller models
* Check GPU memory availability
**Lower than expected throughput:**
* Verify `max_num_tokens` is appropriate
* Check `batch_scheduler_policy` settings
* Monitor GPU utilization
**Memory issues:**
* Reduce `max_num_tokens` if needed
* Use smaller models for available memory
* Monitor memory usage during deployment
### Performance tuning
**For lower latency:**
* Reduce `max_num_tokens`
* Use `batch_scheduler_policy: guaranteed_no_evict`
* Consider smaller models
**For higher throughput:**
* Increase `max_num_tokens` appropriately
* Use `batch_scheduler_policy: max_utilization`
* Optimize batch sizes in client code
**For cost optimization:**
* Use L4 GPUs when possible
* Choose appropriately sized models
* Implement efficient autoscaling
## Migration from other systems
### From sentence-transformers library
**Python code:**
```python theme={"system"}
# Before (sentence-transformers)
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('all-MiniLM-L6-v2')
embeddings = model.encode(sentences)
# After (BEI-Bert)
from openai import OpenAI
client = OpenAI(api_key=BASETEN_API_KEY, base_url=BASE_URL)
embeddings = client.embeddings.create(input=sentences, model="not-required")
```
### From other embedding services
BEI-Bert provides OpenAI-compatible endpoints:
1. **Update base URL**: Point to Baseten deployment
2. **Update API key**: Use Baseten API key
3. **Test compatibility**: Verify embedding dimensions and quality
4. **Optimize**: Tune batch sizes and concurrency for performance
## Related
* [BEI overview](/engines/bei/overview) - General BEI documentation.
* [BEI reference config](/engines/bei/bei-reference) - Complete configuration options.
* [Embedding examples](/examples/bei) - Concrete deployment examples.
* [Performance client documentation](/engines/performance-concepts/performance-client) - Client usage with embeddings.
* [Performance optimization](/development/model/performance-optimization) - General performance guidance.
# Configuration reference
Source: https://docs.baseten.co/engines/bei/bei-reference
Complete reference config for BEI and BEI-Bert engines
This reference covers all configuration options for BEI and BEI-Bert deployments. All settings use the `trt_llm` section in `config.yaml`.
## Configuration structure
```yaml theme={"system"}
trt_llm:
inference_stack: v1 # Always v1 for BEI
build:
base_model: encoder | encoder_bert
checkpoint_repository: {...}
max_num_tokens: 16384
quantization_type: no_quant | fp8 | fp4 | fp4_kv
quantization_config: {...}
plugin_configuration: {...}
runtime:
webserver_default_route: /v1/embeddings | /rerank | /predict
kv_cache_free_gpu_mem_fraction: 0.9
enable_chunked_context: true
batch_scheduler_policy: guaranteed_no_evict
```
## Build configuration
The `build` section configures model compilation and optimization settings.
The base model architecture determines which BEI variant to use.
**Options:**
* `encoder`: BEI - for causal embedding models (Llama, Mistral, Qwen, Gemma)
* `encoder_bert`: BEI-Bert - for BERT-based models (BERT, RoBERTa, Jina, Nomic)
```yaml theme={"system"}
build:
base_model: encoder
```
Specifies where to find the model checkpoint. Repository must follow the standard HuggingFace structure.
**Source options:**
* `HF`: Hugging Face Hub (default)
* `GCS`: Google Cloud Storage
* `S3`: AWS S3
* `AZURE`: Azure Blob Storage
* `REMOTE_URL`: HTTP URL to tar.gz file
* `BASETEN_TRAINING`: Baseten Training checkpoints
For detailed configuration options including training checkpoints and cloud storage setup, see [Deploy training and S3 checkpoints](/engines/performance-concepts/deployment-from-training-and-s3).
```yaml theme={"system"}
checkpoint_repository:
source: HF
repo: "BAAI/bge-large-en-v1.5"
revision: main
runtime_secret_name: hf_access_token # Optional, for private repos
```
Maximum number of tokens that can be processed in a single batch. BEI and BEI-Bert run without chunked-prefill for performance reasons. This limits the effective context length to the `max_position_embeddings` value.
**Range:** 64 to 131072, must be multiple of 64. Use higher values (up to 131072) for long context models. Most models use 16384 as default.
```yaml theme={"system"}
build:
max_num_tokens: 16384
```
Not supported for BEI engines. Leave this value unset. BEI automatically sets it and truncates if context length is exceeded.
Specifies the quantization format for model weights. `FP8` quantization maintains accuracy within 1% of `FP16` for embedding models.
**Options for BEI:**
* `no_quant`: `FP16`/`BF16` precision
* `fp8`: `FP8` weights + 16-bit KV cache
* `fp4`: `FP4` weights + 16-bit KV cache (B200 only)
* `fp4_mlp_only`: `FP4` MLP weights only (B200 only)
**Options for BEI-Bert:**
* `no_quant`: `FP16` precision (only option)
For detailed quantization guidance, see [Quantization guide](/engines/performance-concepts/quantization-guide).
```yaml theme={"system"}
build:
quantization_type: fp8
```
Configuration for post-training quantization calibration.
**Fields:**
* `calib_size`: Size of calibration dataset (64-16384, multiple of 64)
* `calib_dataset`: HuggingFace dataset for calibration
* `calib_max_seq_length`: Maximum sequence length for calibration
```yaml theme={"system"}
quantization_config:
calib_size: 1024
calib_dataset: "abisee/cnn_dailymail"
calib_max_seq_length: 1536
```
BEI automatically configures optimal TensorRT-LLM plugin settings. Manual configuration is not required or supported.
**Automatic optimizations:**
* XQA kernels for maximum throughput
* Dynamic batching for optimal utilization
* Memory-efficient attention mechanisms
* Hardware-specific optimizations
**Note:** Plugin configuration is only available for Engine-Builder-LLM engine.
## Runtime configuration
The `runtime` section configures serving behavior.
The default API endpoint for the deployment.
**Options:**
* `/v1/embeddings`: OpenAI-compatible embeddings endpoint
* `/rerank`: Reranking endpoint
* `/predict`: Classification/prediction endpoint
BEI automatically detects embedding models and sets `/v1/embeddings`. Classification models default to `/predict`.
```yaml theme={"system"}
runtime:
webserver_default_route: /v1/embeddings
```
Available but has no effect for BEI embedding models, which do not use a KV cache. Only relevant for generative (decoder) models.
Available but has no effect for BEI embedding models. Only relevant for generative (decoder) models.
Available but has no effect for BEI embedding models. Only relevant for generative (decoder) models.
## HuggingFace model repository structure
All model sources (S3, GCS, HuggingFace, or tar.gz) must follow the standard HuggingFace repository structure. Files must be in the root directory, similar to running:
```bash theme={"system"}
git clone https://huggingface.co/michaelfeil/bge-small-en-v1.5
```
### Model configuration
**config.json**
* `max_position_embeddings`: Limits maximum context size (content beyond this is truncated)
* `id2label`: Required dictionary mapping IDs to labels for classification models.
* **Note**: Needs to have len of the shape of the last dense layer. Each dense output needs a `name` for the json response.
* `architecture`: Must be `ModelForSequenceClassification` or similar (cannot be `ForCausalLM`)
* **Note**: Remote code execution is not supported; architecture is inferred automatically
* `torch_dtype`: Default inference dtype (BEI-Bert: always `fp16`, BEI: `float16`, `bfloat16`)
* **Note**: We don't support `pre-quantized` loading, meaning your weights need to be `float16`, `bfloat16` or `float32` for all engines.
* `quant_config`: Not allowed, as no `pre-quantized` weights.
#### Model weights
**model.safetensors** (preferred)
* Or: `model.safetensors.index.json` + `model-xx-of-yy.safetensors` (sharded)
* **Note**: Convert to safetensors if you encounter issues with other formats
#### Tokenizer files
**tokenizer\_config.json** and **tokenizer.json**
* Must be "FAST" tokenizers compatible with Rust
* Typically cannot contain custom Python code, will be unread.
#### Embedding model files (sentence-transformers)
**1\_Pooling/config.json**
* Required for embedding models to define pooling strategy
**modules.json**
* Required for embedding models
* Shows available pooling layers and configurations
### Pooling layer support
| **Engine** | **Classification Layers** | **Pooling Types** | **Notes** |
| ------------ | -------------------------- | --------------------------------------------- | ------------------------ |
| **BEI** | 1 layer maximum | Last token, first token | Limited pooling options |
| **BEI-Bert** | Multiple layers or 1 layer | Last token, first token, mean, SPLADE pooling | Advanced pooling support |
## Complete configuration examples
### BEI with `FP8` quantization (embedding model)
```yaml theme={"system"}
model_name: BEI-BGE-Large-FP8
resources:
accelerator: H100
use_gpu: true
trt_llm:
build:
base_model: encoder
checkpoint_repository:
source: HF
repo: "Qwen/Qwen3-Embedding-8B"
revision: main
max_num_tokens: 16384
quantization_type: fp8
quantization_config:
calib_size: 1536
calib_dataset: "abisee/cnn_dailymail"
calib_max_seq_length: 1536
# plugin_configuration is auto-configured for BEI models.
# Encoder models disable paged_kv_cache and use_paged_context_fmha automatically.
runtime:
webserver_default_route: /v1/embeddings
```
### BEI-Bert for small BERT model
```yaml theme={"system"}
model_name: BEI-Bert-MiniLM-L6
resources:
accelerator: L4
use_gpu: true
trt_llm:
build:
base_model: encoder_bert
checkpoint_repository:
source: HF
repo: "sentence-transformers/all-MiniLM-L6-v2"
revision: main
max_num_tokens: 8192
quantization_type: no_quant
# plugin_configuration is auto-configured for BEI-Bert models.
# paged_kv_cache and use_paged_context_fmha are disabled automatically.
runtime:
webserver_default_route: /v1/embeddings
```
### BEI for reranking model
```yaml theme={"system"}
model_name: BEI-BGE-Reranker
resources:
accelerator: H100
use_gpu: true
trt_llm:
build:
base_model: encoder
checkpoint_repository:
source: HF
repo: "BAAI/bge-reranker-large"
revision: main
max_num_tokens: 16384
quantization_type: fp8
quantization_config:
calib_size: 1024
calib_dataset: "abisee/cnn_dailymail"
calib_max_seq_length: 2048
runtime:
webserver_default_route: /rerank
```
### BEI-Bert for classification model
```yaml theme={"system"}
model_name: BEI-Bert-Language-Detection
resources:
accelerator: L4
use_gpu: true
trt_llm:
build:
base_model: encoder_bert
checkpoint_repository:
source: HF
repo: "papluca/xlm-roberta-base-language-detection"
revision: main
max_num_tokens: 8192
quantization_type: no_quant
runtime:
webserver_default_route: /predict
```
## Validation and troubleshooting
### Common configuration errors
**Error:** `encoder does not have a kv-cache, therefore a kv specific datatype is not valid`
* **Cause:** Using KV quantization (fp8\_kv, fp4\_kv) with encoder models
* **Fix:** Use `fp8` or `no_quant` instead
**Error:** `FP8 quantization is only supported on L4, H100, H200, B200`
* **Cause:** Using `FP8` quantization on unsupported GPU.
* **Fix:** Use H100 or newer GPU, or use `no_quant`.
**Error:** `FP4 quantization is only supported on B200`
* **Cause:** Using `FP4` quantization on unsupported GPU.
* **Fix:** Use B200 GPU or `FP8` quantization.
### Performance tuning
**For maximum throughput:**
* Use `max_num_tokens: 16384` for BEI.
* Enable `FP8` quantization on supported hardware.
* Use `batch_scheduler_policy: max_utilization` for high load.
**For lowest latency:**
* Use smaller `max_num_tokens` for your use case
* Use `batch_scheduler_policy: guaranteed_no_evict`
* Consider BEI-Bert for small models with cold-start optimization
**For cost optimization:**
* Use L4 GPUs with `FP8` quantization.
* Use BEI-Bert for small models.
* Tune `max_num_tokens` to your actual requirements.
## Migration from older configurations
If you're migrating from older BEI configurations:
1. **Update base\_model**: Change from specific model types to `encoder` or `encoder_bert`
2. **Add checkpoint\_repository**: Use the new structured repository configuration
3. **Review quantization**: Ensure quantization type matches hardware capabilities
4. **Update engine**: Add engine configuration for better performance
**Old configuration:**
```yaml theme={"system"}
trt_llm:
build:
model_type: "bge"
checkpoint_repo: "BAAI/bge-large-en-v1.5"
```
**New configuration:**
```yaml theme={"system"}
trt_llm:
build:
base_model: encoder
checkpoint_repository:
source: HF
repo: "BAAI/bge-large-en-v1.5"
max_num_tokens: 16384
quantization_type: fp8
runtime:
webserver_default_route: /v1/embeddings
```
# Overview
Source: https://docs.baseten.co/engines/bei/overview
Production-grade embeddings, reranking, and classification models
Baseten Embeddings Inference (BEI) is Baseten's solution for production-grade inference on embedding, classification, and reranking models using TensorRT-LLM. BEI delivers the lowest latency and highest throughput inference across any embedding solution.
## BEI vs BEI-Bert
BEI comes in two variants, each optimized for different model architectures:
Causal embedding models with quantization support and maximum throughput.
BERT-based models with cold-start optimization, 16-bit precision and bidirectional attention.
### BEI features
**Use BEI when:**
* Model uses causal architecture (Llama, Mistral, Qwen for embeddings)
* You need quantization support (FP8, FP4)
* Maximum throughput is required
* Models like BAAI/bge, Qwen3-Embedding, Salesforce/SFR-Embedding
**Benefits:**
* **Quantization Support**: FP8 and FP4 quantization for 2-4x speedup
* **Highest Throughput**: Up to 1400 client embeddings per second
* **XQA Kernels**: Optimized attention kernels for maximum performance
* **Dynamic Batching**: Automatic batch optimization for varying loads
**Supported Architectures:**
* `LlamaModel` (e.g., BAAI/bge-multilingual-gemma2)
* `MistralModel` (e.g., Salesforce/SFR-Embedding-Mistral)
* `Qwen2Model` (e.g., Qwen/Qwen3-Embedding-8B)
* `Gemma2Model` (e.g., Google/EmbeddingGemma)
### BEI-Bert features
**Use BEI-Bert when:**
* Model uses BERT-based architecture (sentence-transformers, jinaai, nomic-ai) or generic bidirectional attention models
* You need cold-start optimization for small models (`<4B` parameters)
* 16-bit precision is sufficient for your use case
* Model architectures like Jina-BERT, Nomic, or ModernBERT
**Benefits:**
* **Cold-Start Optimization**: Optimized for fast initialization and small models
* **16-bit Precision**: Models run in FP16 precision
* **BERT Architecture Support**: Specialized optimization for bidirectional models
* **Low Memory Footprint**: Efficient for smaller models and edge deployments
**Supported Architectures:**
* `BertModel` (e.g., sentence-transformers/all-MiniLM-L6-v2)
* `RobertaModel` (e.g., FacebookAI/roberta-base)
* `Jina-BERT` (e.g., jinaai/jina-embeddings-v2-base-en)
* `Nomic-BERT` (e.g., nomic-ai/nomic-embed-text-v1.5)
* `Alibaba-GTE` (e.g., Alibaba-NLP/gte-large-en-v1.5)
* `Llama Bidirectional` (e.g., nvidia/llama-embed-nemotron-8b)
## Model types and use cases
### Embedding models
Embedding models convert text into numerical representations for semantic search, clustering, and retrieval-augmented generation (RAG).
**Examples:**
* **BAAI/bge-large-en-v1.5**: General-purpose English embeddings
* **michaelfeil/Qwen3-Embedding-8B-auto**: Multilingual embeddings with quantization support
* **Salesforce/SFR-Embedding-Mistral**: Instruction-tuned embeddings
**Configuration:**
```yaml theme={"system"}
trt_llm:
build:
base_model: encoder
checkpoint_repository:
source: HF
repo: "BAAI/bge-large-en-v1.5"
quantization_type: no_quant # Supported for causal models
```
### Reranking models
Reranking models are actually classification models that score document relevance for search and retrieval tasks. They work by classifying query-document pairs as relevant or not relevant.
**How rerankers work:**
* Rerankers are sequence classification models (ending with `ForSequenceClassification`)
* They take a query and document as input and output a relevance score
* The "reranking" is accomplished by scoring multiple documents and ranking them by the classification score
* You can implement reranking by using the classification endpoint with proper prompt templates
**Recommended:**
* **BAAI/bge-reranker-v2-m3**: Great reranking model (279M params). Performs well in RAG systems where a first pass of vector retrieval surfaces dozens of snippets of data.
* **michaelfeil/Qwen3-Reranker-8B-seq**: Best multilingual and general-purpose reranker. **Note:** Needs to be used with the `webserver_default_route: /predict` setting.
**Configuration:**
```yaml theme={"system"}
trt_llm:
build:
base_model: encoder
checkpoint_repository:
source: HF
repo: "BAAI/bge-reranker-v2-m3"
max_num_tokens: 16384
runtime:
webserver_default_route: /rerank
```
**Implementation:**
Use the `/predict` endpoint with proper prompt formatting for query-document pairs. The baseten-performance-client handles reranking template formatting automatically.
### Classification models
Classification models categorize text into predefined classes for tasks like sentiment analysis, content moderation, and language detection.
**Examples:**
* **papluca/xlm-roberta-base-language-detection**: Language identification
* **samlowe/roberta-base-go\_emotions**: Emotion classification
* **Reward Models**: RLHF reward model examples
**Configuration:**
```yaml theme={"system"}
trt_llm:
build:
base_model: encoder
checkpoint_repository:
source: HF
repo: "papluca/xlm-roberta-base-language-detection"
quantization_type: no_quant # BEI-Bert required for classification models
runtime:
webserver_default_route: /predict
```
## Performance and optimization
### Throughput benchmarks
For detailed performance benchmarks, see: [Run Qwen3 Embedding on NVIDIA Blackwell GPUs](https://www.baseten.co/blog/run-qwen3-embedding-on-nvidia-blackwell-gpus/#bei-provides-the-fastest-embeddings-inference-on-b200s)
| Framework | Precision | GPU | Max Token/s Throughput | Max Request/s Throughput |
| --------- | --------- | ---- | ---------------------- | ------------------------ |
| TEI | FP16 | H100 | 34,055 | 824.25 |
| BEI-Bert | FP16 | H100 | 36,520 | 841.05 |
| vLLM | BF16 | H100 | 36,625 | 155.23 |
| BEI | BF16 | H100 | 47,549 | 761.44 |
| BEI | FP8 | H100 | 77,107 | 855.96 |
| BEI | FP8 | B200 | 121,443 | 1,310.52 |
* **Token Throughput/s**: Measured on 500 tokens per request
* **Request Throughput/s**: Measured on 5 tokens per request
### Quantization impact
| **Quantization** | **Speed Improvement** | **Memory Reduction** | **Accuracy Impact** |
| ---------------- | --------------------- | -------------------- | ------------------- |
| FP16/BF16 vLLM | Baseline | None | None |
| FP16/BF16 BEI | 1.3x | None | None |
| FP8 BEI | 2x faster | 50% | \~1% |
| FP4 BEI | 3.5x faster | 75% | 1-2% |
### Hardware requirements
| **GPU Type** | **BEI Support** | **BEI-Bert Support** | **Recommended For** |
| ------------ | --------------- | -------------------- | -------------------------- |
| L4 | Full | Full | Cost-effective deployments |
| A10G, A100 | Full | Full | Legacy support |
| T4 | No | Full | Legacy support |
| H100 | Full | Full | Maximum performance |
| B200 | Full | Full | FP4 quantization |
## OpenAI compatibility
BEI deployments are fully OpenAI compatible for embeddings:
```python theme={"system"}
from openai import OpenAI
import os
client = OpenAI(
api_key=os.environ['BASETEN_API_KEY'],
base_url="https://model-xxxxxx.api.baseten.co/environments/production/sync/v1"
)
embedding = client.embeddings.create(
input=["Baseten Embeddings are fast.", "Embed this sentence!"],
model="not-required"
)
```
### Baseten Performance Client
For maximum throughput, use the [Baseten Performance Client](/engines/performance-concepts/performance-client).
```python theme={"system"}
from baseten_performance_client import PerformanceClient
client = PerformanceClient(
api_key=os.environ['BASETEN_API_KEY'],
base_url="https://model-xxxxxx.api.baseten.co/environments/production/sync"
)
texts = ["Hello world", "Example text", "Another sample"]
response = client.embed(
input=texts,
model="my_model",
batch_size=4,
max_concurrent_requests=32,
timeout_s=360
)
```
## Reference config
For complete configuration options, see the [BEI reference config](/engines/bei/bei-reference).
### Key configuration options
```yaml theme={"system"}
trt_llm:
build:
base_model: encoder # or encoder_bert for BEI-Bert
checkpoint_repository:
source: HF # or GCS, S3, AZURE, REMOTE_URL
repo: "model-repo-name"
revision: main
runtime_secret_name: hf_access_token
max_num_tokens: 16384 # BEI automatically upgrades to 16384
quantization_type: fp8 # or no_quant for BEI-Bert
runtime:
webserver_default_route: /v1/embeddings # or /rerank, /predict
```
## Production best practices
### GPU selection guidelines
* **L4**: Best for models `<4B` parameters, cost-effective
* **H100**: Required for models 4B+ parameters or long context (>8K tokens)
* **H100\_40GB**: Use for models with memory constraints
### Build job optimization
```yaml theme={"system"}
# H100 builds (default)
trt_llm:
build:
num_builder_gpus: 2
# L4 builds (memory-constrained)
trt_llm:
build:
num_builder_gpus: 4
```
### Model-specific recommendations
**BERT-based models (BEI-Bert):**
* Use `encoder_bert` base model
* No quantization support (FP16/BF16 only)
* Best for models `<200M` parameters on L4
**ModernBERT and newer architectures:**
* Support longer contexts (up to 8192 tokens)
* Use H100 for models >1B parameters
* Consider memory requirements for long sequences
**Qwen embedding models:**
* Use regular FP8 quantization
* Support very long contexts (up to 131K tokens)
* Higher memory requirements for long sequences
### Token limit optimization
```yaml theme={"system"}
trt_llm:
build:
max_num_tokens: 16384 # Default, automatically set by BEI
# Override for specific use cases:
# max_num_tokens: 8192 # Standard embeddings
# max_num_tokens: 131072 # Qwen long-context models
```
## Getting started
1. **Choose your variant**: BEI for causal models and quantization, BEI-Bert for BERT models
2. **Review configuration**: See [BEI reference config](/engines/bei/bei-reference)
3. **Deploy your model**: Use the configuration templates and examples
4. **Test integration**: Use OpenAI client or Performance Client for maximum throughput
## Examples and further reading
* [BEI-Bert examples](/engines/bei/bei-bert) - BERT-specific configurations
* [BEI reference config](/engines/bei/bei-reference) - Complete configuration options
* [Embedding examples](/examples/bei) - Concrete deployment examples
* [Performance client documentation](/engines/performance-concepts/performance-client) - Client Usage with Embeddings
# Gated features for BIS-LLM
Source: https://docs.baseten.co/engines/bis-llm/advanced-features
KV-aware routing, disaggregated serving, and other gated features
BIS-LLM provides features for large-scale deployments: KV cache optimization, disaggregated serving, and specialized inference strategies.
These advanced features are not fully self-serviceable. [Contact us](mailto:support@baseten.co) to enable them for your organization.
## Available advanced features
### Routing and scaling
*KV-aware routing* and *disaggregated serving* optimize multi-replica deployments. KV-aware routing directs requests to replicas with the best cache hit potential, while disaggregated serving separates prefill and decode phases into independent clusters that scale separately. *Separate prefill and decode autoscaling* uses token-exact metrics to right-size each phase.
### MoE optimization
*WideEP* (expert parallelism) distributes experts across multiple GPUs for extremely large expert counts. These features work together to maximize hardware utilization on models like DeepSeek-V3 and Qwen3MoE.
### Attention and memory
*DP attention for MLA* (Multi-Head Latent Attention) compresses KV cache by projecting attention tensors into a compact latent space, *DP attention* helps to managed KV-Cache across GPU ranks, and tunes DeepSeek deployments for high throughput. *DeepSparseAttention* sparsifies the attention matrix based on token relevance. *Distributed KV storage* spreads KV cache across devices for long-context inference beyond single-device memory limits.
### Speculative decoding
*Speculative n-gram automata-based decoding* uses automata to predict tokens from n-gram patterns without full model computation. *Speculative MTP or Eagle3 decoding* uses draft-model approaches to predict and verify multiple future tokens.
### Kernel optimization
*Zero-overlap scheduling* overlaps computation and communication to hide latency. *Auto-tuned kernels* optimize kernel parameters for your specific hardware and model topology.
## KV-aware routing
KV-aware routing directs requests to replicas with the best chance of KV cache hits, routing based on cache availability and replica utilization.
KV-aware routing reduces inter-token latency by distributing load across replicas, improves time-to-first-token through cache hits on repeated queries, and increases global throughput through cache reuse.
## Disaggregated serving
Disaggregated serving separates prefill and decode phases into independent clusters, allowing each to scale and be optimized independently. This architecture is particularly valuable for large MoE models.
Disaggregated serving is available as a gated feature. [Contact us](mailto:support@baseten.co) to be paired with an engineer to discuss your needs.
Disaggregated serving enables independent scaling of prefill and decode resources, isolates time-critical TTFT metrics from throughput-focused phases, and optimizes costs by right-sizing each phase for its workload.
## Get started
### Choose the right configuration
**For advanced deployments** with large MoE models and planet-scale inference, [contact us](mailto:support@baseten.co).
**For standard deployments**:
Use the standard BIS-LLM configuration as documented in [BIS-LLM configuration](/engines/bis-llm/bis-llm-config).
## Model recommendations
### Models that benefit from advanced features
**Large MoE models:**
* DeepSeek-V3
* Qwen3MoE
* Kimi-K2
* GLM-4.7
* GPT-OSS
**Ideal use cases:**
* High-throughput API services
* Complex reasoning tasks
* Long-context applications, including agentic coding
* Planet-scale deployments
### When to use standard BIS-LLM or Engine-Builder-LLM
* Dense models under 70B parameters
* Standard MoE models under 30B parameters
* Development and testing environments
* Workloads with low KV cache hit rates
## Related
* [BIS-LLM overview](/engines/bis-llm/overview): Main engine documentation.
* [BIS-LLM reference config](/engines/bis-llm/bis-llm-config): Configuration options.
* [Structured outputs documentation](/engines/performance-concepts/structured-outputs): JSON schema validation.
* [Examples section](/examples/overview): Deployment examples.
# Reference Config (BIS-LLM)
Source: https://docs.baseten.co/engines/bis-llm/bis-llm-config
Complete reference config for V2 inference stack and MoE models
This reference provides complete configuration options for BIS-LLM (Baseten Inference Stack V2) engine. BIS-LLM uses the V2 inference stack with simplified configuration and enhanced features for MoE models and advanced use cases.
## Configuration structure
```yaml theme={"system"}
trt_llm:
inference_stack: v2 # Always v2 for BIS-LLM
build:
checkpoint_repository: {...}
quantization_type: no_quant | fp8 | fp8_kv | fp4 | fp4_kv | fp4_mlp_only
quantization_config: {...}
num_builder_gpus: 1
skip_build_result: false
runtime:
max_seq_len: 32768
max_batch_size: 256
max_num_tokens: 8192
tensor_parallel_size: 1
enable_chunked_prefill: true
served_model_name: "model-name"
patch_kwargs: {...}
```
## Build configuration
### `checkpoint_repository`
Specifies where to find the model checkpoint. Same structure as V1 but with V2-specific optimizations.
**Structure:**
```yaml theme={"system"}
checkpoint_repository:
source: HF | GCS | S3 | AZURE | REMOTE_URL | BASETEN_TRAINING
repo: "model-repository-name"
revision: main # Optional, only for HF
runtime_secret_name: hf_access_token # Optional, for private repos
```
For detailed configuration options including training checkpoints and cloud storage setup, see [Deploy training and S3 checkpoints](/engines/performance-concepts/deployment-from-training-and-s3).
### `quantization_type`
Quantization options for V2 inference stack (simplified from V1):
**Options:**
* `no_quant`: precision of the repo. This can be fp16 / bf16. Unique to BIS-LLM is that we also do support quantized checkpoints from nvidia-modelopt libraries.
* `fp8`: FP8 weights + 16-bit KV cache
* `fp8_kv`: FP8 weights + FP8 KV cache
* `fp4`: FP4 weights + 16-bit KV cache (B200 only)
* `fp4_kv`: FP4 weights + FP8 KV cache (B200 only)
* `fp4_mlp_only`: FP4 MLP layers only + 16-bit KV cache
For detailed quantization guidance including hardware requirements, calibration strategies, and model-specific recommendations, see [Quantization Guide](/engines/performance-concepts/quantization-guide).
### `quantization_config`
Configuration for post-training quantization calibration:
**Structure:**
```yaml theme={"system"}
quantization_config:
calib_size: 1024
calib_dataset: "abisee/cnn_dailymail"
calib_max_seq_length: 2048
```
### `num_builder_gpus`
Number of GPUs to use during the build process.
**Default:** `1` (auto-detected from resources)\
**Range:** 1 to 8
**Example:**
```yaml theme={"system"}
build:
num_builder_gpus: 4 # For large models or complex quantization
```
### `skip_build_result`
Skip the engine build step and use a pre-built model, that does not require any quantization.
**Default:** `false`\
**Use case:** When you have a pre-built engine from model cache
**Example:**
```yaml theme={"system"}
build:
skip_build_result: true
```
## Engine configuration
### `max_seq_len`
Maximum sequence length (context) for single requests.
**Default:** `None` (auto-detected from model config)
**Range:** 1 to 1048576
**Example:**
```yaml theme={"system"}
runtime:
max_seq_len: 131072 # 128K context
```
### `max_batch_size`
Maximum number of input sequences processed concurrently.
**Default:** `256`\
**Range:** 1 to 2048
**Example:**
```yaml theme={"system"}
runtime:
max_batch_size: 128 # Lower for better latency
```
### `max_num_tokens`
Maximum number of batched input tokens after padding removal.
**Default:** `8192`\
**Range:** 64 to 131072
**Example:**
```yaml theme={"system"}
runtime:
max_num_tokens: 16384 # Higher for better throughput
```
### `tensor_parallel_size`
Number of GPUs to use for tensor parallelism.
**Default:** `1` (auto-detected from resources)\
**Range:** 1 to 8
**Example:**
```yaml theme={"system"}
runtime:
tensor_parallel_size: 4 # For large models
```
### `enable_chunked_prefill`
Enable chunked prefilling for long sequences.
**Default:** `true`
**Example:**
```yaml theme={"system"}
runtime:
enable_chunked_prefill: true
```
### `served_model_name`
Model name returned in API responses.
**Default:** `None` (uses model name from config)
**Example:**
```yaml theme={"system"}
runtime:
served_model_name: "gpt-oss-120b"
```
### `patch_kwargs`
Advanced configuration patches for V2 inference stack.
**Structure:**
```yaml theme={"system"}
patch_kwargs:
custom_setting: "value"
advanced_config:
nested_setting: true
```
**Note:** This is a preview feature and may change in future versions.
## Complete configuration examples
### Qwen3-30B-A3B-Instruct-2507 MoE with FP4 on B200
```yaml theme={"system"}
model_name: Qwen3-30B-A3B-Instruct-2507-FP4
resources:
accelerator: B200:1
cpu: '4'
memory: 40Gi
use_gpu: true
trt_llm:
inference_stack: v2
build:
checkpoint_repository:
source: HF
repo: "Qwen/Qwen3-Coder-30B-A3B-Instruct"
revision: main
quantization_type: fp4
quantization_config:
calib_size: 2048
calib_dataset: "abisee/cnn_dailymail"
calib_max_seq_length: 4096
num_builder_gpus: 1
runtime:
max_seq_len: 65536
max_batch_size: 256
max_num_tokens: 8192
tensor_parallel_size: 1
enable_chunked_prefill: true
served_model_name: "Qwen3-30B-A3B-Instruct-2507"
```
### GPT-OSS 120B on B200:1 with no\_quant
**Note**: We have GPT-OSS much more optimized. The below example is functional, but you can squeeze much more performance using `B200`, e.g. with Baseten's custom Eagle Heads.
```yaml theme={"system"}
model_name: gpt-oss-120b-b200
resources:
accelerator: B200:1
cpu: '4'
memory: 40Gi
use_gpu: true
trt_llm:
inference_stack: v2
build:
checkpoint_repository:
source: HF
repo: "openai/gpt-oss-120b"
revision: main
runtime_secret_name: hf_access_token
quantization_type: no_quant
quantization_config:
calib_size: 1024
calib_dataset: "abisee/cnn_dailymail"
calib_max_seq_length: 2048
runtime:
max_seq_len: 131072
max_batch_size: 256
max_num_tokens: 16384
tensor_parallel_size: 1
enable_chunked_prefill: true
served_model_name: "gpt-oss-120b"
```
### DeepSeek V3
**Note**: We have DeepSeek V3 / V3.1 / V3.2 much more optimized. The below example is functional, but you can squeeze much more performance using `B200:4`, e.g. with MTP Heads and disaggregated serving, or data-parallel attention.
```yaml theme={"system"}
model_name: nvidia/DeepSeek-V3.1-NVFP4
resources:
accelerator: B200:4
cpu: '8'
memory: 80Gi
use_gpu: true
trt_llm:
inference_stack: v2
build:
checkpoint_repository:
source: HF
repo: "nvidia/DeepSeek-V3.1-NVFP4"
revision: main
runtime_secret_name: hf_access_token
quantization_type: no_quant # nvidia/DeepSeek-V3.1-NVFP4 is already modelopt compatible
quantization_config:
calib_size: 1024
calib_dataset: "abisee/cnn_dailymail"
calib_max_seq_length: 2048
runtime:
max_seq_len: 131072
max_batch_size: 256
max_num_tokens: 16384
tensor_parallel_size: 8
enable_chunked_prefill: true
served_model_name: "nvidia/DeepSeek-V3.1-NVFP4"
```
## V2 vs V1 configuration differences
### Simplified build configuration
**V1 build configuration:**
```yaml theme={"system"}
trt_llm:
build:
base_model: decoder
max_seq_len: 131072
max_batch_size: 256
max_num_tokens: 8192
quantization_type: fp8_kv
tensor_parallel_count: 4
plugin_configuration: {...}
speculator: {...}
```
**V2 build configuration:**
```yaml theme={"system"}
trt_llm:
inference_stack: v2
build:
checkpoint_repository: {...}
quantization_type: fp8
num_builder_gpus: 4
runtime:
max_seq_len: 131072
max_batch_size: 256
max_num_tokens: 8192
tensor_parallel_size: 4
```
### Key differences
1. **`inference_stack`**: Explicitly set to `v2`
2. **Simplified build options**: Many V1 options moved to engine
3. **No `base_model`**: Automatically detected from checkpoint
4. **No `plugin_configuration`**: Handled automatically
5. **No `speculator`**: Lookahead decoding requires FDE involvement.
6. **Tensor parallel**: Moved to engine as `tensor_parallel_size`
## Validation and troubleshooting
### Common V2 configuration errors
**Error:** `Field trt_llm.build.base_model is not allowed to be set when using v2 inference stack`
* **Cause:** Setting `base_model` in V2 configuration
* **Fix:** Remove `base_model` field, V2 detects automatically
**Error:** `Field trt_llm.build.quantization_type is not allowed to be set when using v2 inference stack`
* **Cause:** Using unsupported quantization type
* **Fix:** Use supported quantization: `no_quant`, `fp8`, `fp4`, `fp4_mlp_only`, `fp4_kv`, `fp8_kv`
**Error:** `Field trt_llm.build.speculator is not allowed to be set when using v2 inference stack`
* **Cause:** Trying to use lookahead decoding in V2
* **Fix:** Use V1 stack for lookahead decoding, or V2 without speculation or reach out to us to use V2 with speculation.
## Migration from V1
### V1 to V2 migration
**V1 configuration:**
```yaml theme={"system"}
trt_llm:
build:
base_model: decoder
checkpoint_repository:
source: HF
repo: "Qwen/Qwen3-4B"
max_seq_len: 32768
max_batch_size: 256
max_num_tokens: 8192
quantization_type: fp8_kv
tensor_parallel_count: 1
plugin_configuration:
paged_kv_cache: true
use_paged_context_fmha: true
use_fp8_context_fmha: true
runtime:
kv_cache_free_gpu_mem_fraction: 0.9
enable_chunked_context: true
```
**V2 configuration:**
```yaml theme={"system"}
trt_llm:
inference_stack: v2
build:
checkpoint_repository:
source: HF
repo: "Qwen/Qwen3-4B"
quantization_type: fp8_kv
runtime:
max_seq_len: 32768
max_batch_size: 256
max_num_tokens: 8192
tensor_parallel_size: 1
enable_chunked_prefill: true
```
### Migration steps
1. **Add `inference_stack: v2`**
2. **Remove `base_model`** (auto-detected)
3. \*\*Move `max_seq_len`, `max_batch_size`, `max_num_tokens` to engine
4. **Change `tensor_parallel_count` to `tensor_parallel_size`**
5. **Remove `plugin_configuration`** (handled automatically)
6. **Update quantization type** (V2 has simplified options)
7. **Remove `speculator`** (not supported in V2)
## Hardware selection
**GPU recommendations for V2:**
* **B200**: Best for FP4 quantization and next-gen performance
* **H100**: Best for FP8 quantization and production workloads
* **Multi-GPU**: Required for large MoE models (>30B parameters)
**Configuration guidelines:**
| **Model Size** | **Recommended GPU** | **Quantization** | **Tensor Parallel** |
| -------------- | ------------------- | ---------------- | ------------------- |
| `<30B` MoE | H100:2-4 | FP8 | 2-4 |
| 30-100B MoE | H100:4-8 | FP8 | 4-8 |
| 100B+ MoE | B200:4-8 | FP4 | 4-8 |
| Dense >30B | H100:2-4 | FP8 | 2-4 |
## Related
* [BIS-LLM overview](/engines/bis-llm/overview) - Main engine documentation.
* [Advanced features documentation](/engines/bis-llm/advanced-features) - Enterprise features and capabilities.
* [Structured outputs for BIS-LLM](/engines/performance-concepts/structured-outputs) - Advanced JSON schema validation.
* [Examples section](/examples/overview) - Concrete deployment examples.
# Overview
Source: https://docs.baseten.co/engines/bis-llm/overview
Next-generation engine for MoE models with advanced optimizations
BIS-LLM (Baseten Inference Stack V2) is Baseten's next-generation engine for Mixture of Experts (MoE) models and advanced text generation use cases. Built on the V2 inference stack, it provides cutting-edge optimizations including KV-aware routing, disaggregated serving, expert parallel load balancing and DP attention.
Before you continue reading - we have enabled a small subset of features for customers - the primary way to deploy these large models is though Forward Deployed Engineers.
## Overview and use cases
BIS-LLM is designed for MoE models and scenarios requiring the most advanced inference optimizations.
### Ideal for:
**MoE model families:**
* **DeepSeek**: `deepseek-ai/DeepSeek-R1`, `deepseek-ai/DeepSeek-V3.1`, `deepseek-ai/DeepSeek-V3.2`
* **Qwen MoE**: `Qwen/Qwen3-30B-A3B`, `Qwen/Qwen3-Coder-480B-A35B-Instruct`
* **Kimi**: `moonshotai/Kimi-K2-Instruct`
* **GLM**: `zai-org/GLM-4.7`
* **LLama4**: `meta-llama/llama-4-maverick`
* **GPT-OSS**: Various open-source GPT variants
**Advanced use cases:**
* **High-performance inference**: FP4 quantization on GB200/B200 GPUs
* **Complex reasoning**: Advanced tool calling and structured outputs
* **Large-scale deployments**: Multi-node setups and distributed inference
## Forward deployed engineer gated features
We gated some more advanced features behind feature flags that we internally toggle.
They are not the easiest to use, and some are mutually exclusive - making them hard to maintain on our docs page.
The features below power some of the largest LLM deployments for the customer logos on our website and a couple of [world-records on GPUs](https://www.baseten.co/blog/how-we-made-the-fastest-gpt-oss-on-nvidia-gpus-60-percent-faster/).
For detailed information on each advanced feature, see [Gated Features for BIS-LLM](/engines/bis-llm/advanced-features).
## Architecture support
### MoE model support
BIS-LLM specifically optimizes for Mixture of Experts architectures:
**Primary MoE architectures:**
* `DeepseekV32ForCausalLM` - DeepSeek family
* `Qwen3MoEForCausalLM` - Qwen3 MoE family
* `KimiK2ForCausalLM` - Kimi K2 family
* `Glm4MoeForCausalLM` - GLM MoE variants
* `GPTOSS` - OpenAI GPT-OSS variants
* ...
### Dense model support
While optimized for MoE, BIS-LLM also supports dense models with advanced features:
**Benefits for dense models:**
* **GB200/B200 optimization**: Advanced GPU kernel optimization
* **FP4 quantization**: Next-generation quantization support
* **Enhanced memory management**: Improved KV cache handling
**When to use BIS-LLM for dense models:**
* Models >30B parameters requiring maximum performance
* Deployments on GB200/B200 GPUs with advanced quantization
* You tried out V1 and want to compare against V2
* You want to try V2 features like KV routing or Disaggregated Serving.
* Speculation on GB200/B200
### Advanced quantization
BIS-LLM supports next-generation quantization formats for maximum performance:
**Quantization options:**
* `no_quant`: FP16/BF16 precision, or automatically uses hf\_quant\_config.json from modelopt if available
* `fp8`: FP8 weights + 16-bit KV cache
* `fp4`: FP4 weights + 16-bit KV cache
* `fp8_kv`: FP8 weights + 8-bit symmetric kv cache
* `fp4_kv`: FP8 weights + 8-bit symmetric kv cache
* `fp4_mlp_only`: FP4 weights (mlp layers) + 16-bit kv-cache and attn computation
**B200 optimization:**
* **FP4 kernels**: Custom B200 kernels for maximum performance
* **Memory efficiency**: 75% memory reduction with FP4, some models like DeepSeekV3 strongly preferred on B200 due to kernel selection.
* **Speed improvement**: 4x-8x faster inference with minimal accuracy loss
* **Cascaded improvements**: More memory and faster inference leading to improved system performance, especially under high load.
**Example:**
```yaml theme={"system"}
trt_llm:
inference_stack: v2
build:
checkpoint_repository:
source: HF
repo: "Qwen/Qwen3-30B-A3B"
quantization_type: fp4 # B200 only
```
### Structured outputs and tool calling
Advanced JSON schema validation and function calling capabilities:
**Features:**
* **JSON schema validation**: Precise structured output generation
* **Function calling**: Advanced tool selection and execution
* **Multi-tool support**: Complex tool chains and reasoning
* **Schema inheritance**: Nested and complex schema support
**Example:**
```python theme={"system"}
from pydantic import BaseModel
from openai import OpenAI
class ResearchResult(BaseModel):
topic: str
findings: list[str]
confidence: float
sources: list[str]
client = OpenAI(
api_key=os.environ['BASETEN_API_KEY'],
base_url="https://model-xxxxxx.api.baseten.co/environments/production/sync/v1"
)
response = client.beta.chat.completions.parse(
model="not-required",
messages=[
{"role": "user", "content": "Analyze the latest AI research papers"}
],
response_format=ResearchResult
)
result = response.choices[0].message.parsed
```
## Configuration examples
**Note**: The below examples are just functional examples. Advanced features are frequently changing. Please reach out how to best configure a specific or fine-tuned model, we are happy to help.
### GPT-OSS 120B deployment
```yaml theme={"system"}
model_name: gpt-oss-120b
resources:
accelerator: H100:8 # 8 GPUs for large dense model
cpu: '8'
memory: 80Gi
use_gpu: true
trt_llm:
inference_stack: v2
build:
checkpoint_repository:
source: HF
repo: "openai/gpt-oss-120b"
revision: main
runtime_secret_name: hf_access_token
# GPT-OSS runs in MXFP4 - which is supported by H100.
# by selecting `no_quant` we apply no special quantization.
# MXFP4 and modelopt-style nvfp4 are supported out of the box.
quantization_type: no_quant
num_builder_gpus: 8
runtime:
max_seq_len: 32768
max_batch_size: 256
max_num_tokens: 16384
tensor_parallel_size: 8
enable_chunked_prefill: true
served_model_name: "gpt-oss-120b"
```
### Qwen3-30B-A3B-Instruct-2507 MoE with FP4 quantization
```yaml theme={"system"}
model_name: Qwen3-30B-A3B-Instruct-2507-FP4
resources:
accelerator: B200:2
cpu: '4'
memory: 40Gi
use_gpu: true
trt_llm:
inference_stack: v2
build:
checkpoint_repository:
source: HF
repo: "Qwen/Qwen3-30B-A3B-Instruct-2507"
revision: main
quantization_type: fp4
num_builder_gpus: 2
runtime:
max_seq_len: 65536
max_batch_size: 128
max_num_tokens: 8192
tensor_parallel_size: 2
enable_chunked_prefill: true
served_model_name: "Qwen3-30B-A3B-Instruct-2507"
```
### Dense model with BIS-LLM V2
```yaml theme={"system"}
model_name: Llama-3.3-70B-V2
resources:
accelerator: H100:4
cpu: '4'
memory: 40Gi
use_gpu: true
trt_llm:
inference_stack: v2
build:
checkpoint_repository:
source: HF
repo: "meta-llama/Llama-3.3-70B-Instruct"
revision: main
runtime_secret_name: hf_access_token
quantization_type: fp8
num_builder_gpus: 4
runtime:
max_seq_len: 131072
max_batch_size: 256
max_num_tokens: 8192
tensor_parallel_size: 4
enable_chunked_prefill: true
served_model_name: "Llama-3.3-70B-Instruct"
```
## Integration examples
### OpenAI-compatible inference
```python theme={"system"}
from openai import OpenAI
import os
client = OpenAI(
api_key=os.environ['BASETEN_API_KEY'],
base_url="https://model-xxxxxx.api.baseten.co/environments/production/sync/v1"
)
# Standard chat completion
response = client.chat.completions.create(
model="not-required",
messages=[
{"role": "system", "content": "You are an advanced AI assistant."},
{"role": "user", "content": "Explain the concept of mixture of experts in AI."}
],
temperature=0.7,
max_tokens=1000
)
print(response.choices[0].message.content)
```
### Advanced structured outputs
```python theme={"system"}
from pydantic import BaseModel
from openai import OpenAI
class ExpertAnalysis(BaseModel):
routing_decision: str
expert_utilization: dict[str, float]
processing_time: float
confidence_score: float
client = OpenAI(
api_key=os.environ['BASETEN_API_KEY'],
base_url="https://model-xxxxxx.api.baseten.co/environments/production/sync/v1"
)
response = client.beta.chat.completions.parse(
model="not-required",
messages=[
{"role": "user", "content": "Analyze the expert routing for this complex query"}
],
response_format=ExpertAnalysis
)
analysis = response.choices[0].message.parsed
print(f"Routing decision: {analysis.routing_decision}")
print(f"Expert utilization: {analysis.expert_utilization}")
```
### Multi-tool function calling
```python theme={"system"}
client = OpenAI(
api_key=os.environ['BASETEN_API_KEY'],
base_url="https://model-xxxxxx.api.baseten.co/environments/production/sync/v1"
)
tools = [
{
"type": "function",
"function": {
"name": "analyze_expert_routing",
"description": "Analyze expert routing patterns",
"parameters": {
"type": "object",
"properties": {
"query": {"type": "string"},
"expert_count": {"type": "integer"}
}
}
}
},
{
"type": "function",
"function": {
"name": "optimize_performance",
"description": "Optimize model performance",
"parameters": {
"type": "object",
"properties": {
"target_tps": {"type": "number"},
"memory_budget": {"type": "integer"}
}
}
}
}
]
response = client.chat.completions.create(
model="not-required",
messages=[
{"role": "user", "content": "Analyze and optimize the performance of this MoE model"}
],
tools=tools
)
for tool_call in response.choices[0].message.tool_calls:
print(f"Function: {tool_call.function.name}")
print(f"Arguments: {tool_call.function.arguments}")
```
## Best practices
### Hardware selection
**GPU recommendations:**
* **B200**: Best for FP4 quantization and next-gen performance
* **H100**: Best for FP8 quantization and production workloads
* **Multi-GPU**: Required for large MoE models (>30B parameters)
* **Multi-Node**:
**Configuration guidelines:**
| **Model Size** | **Recommended GPU** | **Quantization** | **Tensor Parallel** |
| -------------- | ------------------- | ---------------- | ------------------- |
| `<30B` MoE | H100:2-4 | FP8 | 2-4 |
| 30-100B MoE | H100:4-8 | FP8 | 4-8 |
| 100B+ MoE | B200:4-8 | FP4 | 4-8 |
| Dense >30B | H100:2-4 | FP8 | 2-4 |
## Production best practices
### V2 inference stack optimization
#### Configuration differences from V1
```yaml theme={"system"}
# V2 (recommended for MoE and advanced models)
trt_llm:
inference_stack: v2
build:
checkpoint_repository:
source: HF
repo: "openai/gpt-oss-120b"
quantization_type: fp8
runtime:
max_seq_len: 32768 # Set in engine for V2
max_batch_size: 32
tensor_parallel_size: 8 # Engine configuration
```
## Migration guide
### From Engine-Builder-LLM
**V1 configuration:**
```yaml theme={"system"}
trt_llm:
build:
base_model: decoder
checkpoint_repository:
source: HF
repo: "Qwen/Qwen3-32B"
quantization_type: fp8_kv
tensor_parallel_count: 8
```
**V2 configuration:**
```yaml theme={"system"}
trt_llm:
inference_stack: v2
build:
checkpoint_repository:
source: HF
repo: "Qwen/Qwen3-32B"
quantization_type: fp8_kv
runtime:
tensor_parallel_size: 8
enable_chunked_prefill: true
```
### Key differences
1. **`inference_stack`**: Explicitly set to `v2`
2. **Build configuration**: Simplified with fewer options
3. **Engine configuration**: Enhanced with V2-specific features
4. **Performance**: Better optimization for MoE models
## Related
* [BIS-LLM reference config](/engines/bis-llm/bis-llm-config) - Complete V2 configuration options.
* [Advanced features documentation](/engines/bis-llm/advanced-features) - Enterprise features and capabilities.
* [Structured outputs](/engines/performance-concepts/structured-outputs) - Advanced JSON schema validation.
* [Examples section](/examples/overview) - Concrete deployment examples.
# Custom engine builder
Source: https://docs.baseten.co/engines/engine-builder-llm/custom-engine-builder
Implement custom model.py for business logic, logging, and advanced inference patterns
Implement custom business logic, request handling, and inference patterns in `model.py` while maintaining TensorRT-LLM performance. Custom engine builder enables billing integration, request tracing, fan-out generation, and multi-response workflows.
## Overview
The custom engine builder lets you:
* **Implement business logic**: Billing, usage tracking, access control.
* **Add custom logging**: Request tracing, performance monitoring, audit trails.
* **Create advanced inference patterns**: Fan-out generation, custom chat templates.
* **Integrate external services**: APIs, databases, monitoring systems.
* **Optimize performance**: Concurrent processing, custom batching strategies.
## When to use custom engine builder
### Ideal use cases
**Business logic integration:**
* **Usage tracking**: Monitor token usage per customer/request.
* **Access control**: Implement custom authentication/authorization.
* **Rate limiting**: Custom rate limiting based on user tiers.
* **Audit logging**: Compliance and security requirements.
**Advanced inference patterns:**
* **Fan-out generation**: Generate multiple responses from one request.
* **Custom chat templates**: Domain-specific conversation formats.
* **Multi-response workflows**: Parallel processing of variations.
* **Conditional generation**: Business rule-based output modification.
**Performance and monitoring:**
* **Custom logging**: Request tracing, performance metrics.
* **Concurrent processing**: Parallel generation for improved throughput.
* **Usage analytics**: Track patterns and optimize accordingly.
* **Error handling**: Custom error responses and fallback logic.
## Implementation
### Fan-out generation example
Multi-generation fan-out generates multiple texts from a single request. Running them sequentially ensures the KV cache is created before subsequent generations.
```python model/model.py theme={"system"}
# model/model.py
import copy
import asyncio
from typing import Any, Dict, List, Optional, Tuple
from fastapi import HTTPException, Request
from starlette.responses import JSONResponse, StreamingResponse
Message = Dict[str, str] # {"role": "...", "content": "..."}
class Model:
def __init__(self, trt_llm, **kwargs) -> None:
self._secrets = kwargs["secrets"]
self._engine = trt_llm["engine"]
async def predict(self, model_input: Dict[str, Any], request: Request) -> Any:
# Validate request structure
if not isinstance(model_input, dict):
raise HTTPException(status_code=400, detail="Request body must be a JSON object.")
# Enforce non-streaming for this example
if bool(model_input.get("stream", False)):
raise HTTPException(status_code=400, detail="stream=true is not supported here; set stream=false.")
# Extract base messages and fan-out tasks
prompt_key, base_messages = self._get_base_messages(model_input)
n, suffix_tasks = self._parse_fanout(model_input)
# Build reusable request (don't forward fan-out params to engine)
base_req = copy.deepcopy(model_input)
base_req.pop("suffix_messages", None)
# Extract debug ID for logging/tracing
debug_id = request.headers.get("X-Debug-ID", "")
# Run sequential generations
per_gen_payloads: List[Any] = []
async def run_generation(i: int) -> Any:
msgs_i = copy.deepcopy(base_messages)
if suffix_tasks is not None:
msgs_i.extend(suffix_tasks[i])
base_req[prompt_key] = msgs_i
# Debug logging
if debug_id:
print(f"Running generation {debug_id} {i} with messages: {msgs_i}")
# Time the generation
start_time = asyncio.get_event_loop().time()
resp = await self._engine.chat_completions(request=request, model_input=base_req)
end_time = asyncio.get_event_loop().time()
# Debug logging
if debug_id:
duration = end_time - start_time
print(f"Result Generation {debug_id} {i} response: {resp} (took {duration:.3f}s)")
# Validate response type
if isinstance(resp, StreamingResponse) or hasattr(resp, "body_iterator"):
raise HTTPException(status_code=400, detail="Engine returned streaming but stream=false was requested.")
return resp
# Run first generation
payload = await run_generation(0)
per_gen_payloads.append(payload)
# Run remaining generations concurrently
if n > 1:
results = await asyncio.gather(*(run_generation(i) for i in range(1, n)))
per_gen_payloads.extend(results)
# Convert to OpenAI-ish multi-choice response
out = self._to_openai_choices(per_gen_payloads)
return JSONResponse(content=out.model_dump())
# Helper methods
def _get_base_messages(self, model_input: Dict[str, Any]) -> Tuple[str, List[Message]]:
"""Extract and validate base messages from request."""
if "prompt" in model_input:
raise HTTPException(status_code=400, detail='Use "messages" instead of "prompt" for chat models.')
if "messages" not in model_input:
raise HTTPException(status_code=400, detail='Request must include "messages" field.')
key = "messages"
msgs = model_input.get(key)
if not isinstance(msgs, list):
raise HTTPException(status_code=400, detail=f'"{key}" must be a list of messages.')
for m in msgs:
if not isinstance(m, dict) or "role" not in m or "content" not in m:
raise HTTPException(status_code=400, detail=f'Each item in "{key}" must have role+content.')
return key, msgs
def _parse_fanout(self, model_input: Dict[str, Any]) -> Tuple[int, Optional[List[List[Message]]]]:
"""Parse and validate fan-out configuration."""
suffix = model_input.get("suffix_messages", None)
if not isinstance(suffix, list) or any(not isinstance(t, list) for t in suffix):
raise HTTPException(status_code=400, detail='"suffix_messages" must be a list of tasks (each task is a list of messages).')
if len(suffix) < 1 or len(suffix) > 256:
raise HTTPException(status_code=400, detail='"suffix_messages" must have between 1 and 256 tasks.')
for task in suffix:
for m in task:
if not isinstance(m, dict) or "role" not in m or "content" not in m:
raise HTTPException(status_code=400, detail="Each suffix message must have role+content.")
return len(suffix), suffix
def _to_openai_choices(self, payloads: List[Any]) -> Any:
"""Convert multiple payloads to OpenAI-style choices."""
base = payloads[0]
if hasattr(base, "choices") and hasattr(base, "model_dump"):
new_choices = []
for i, p in enumerate(payloads):
c0 = p.choices[0]
# Ensure index matches OpenAI n semantics
try:
c0.index = i
except Exception:
c0 = c0.model_copy(update={"index": i})
new_choices.append(c0)
# Aggregate usage statistics
base.usage.completion_tokens += p.usage.completion_tokens
base.usage.prompt_tokens += p.usage.prompt_tokens
base.usage.total_tokens += p.usage.total_tokens
base.choices = new_choices
return base
raise HTTPException(status_code=500, detail=f"Unsupported engine response type for fanout. {type(base)}")
async def chat_completions( # if you need to use /v1/completions use def completions(..)
self,
model_input: Dict[str, Any],
request: Request,
) -> Any:
# alias to predict, so that both /predict and (/sync)/v1/chat/completions work
return await self.predict(model_input, request)
```
### Fan-out generation configuration
To deploy the above example, create a new directory, e.g. `fanout` and create a `fanout/model/model.py` file.
Then create the following `config.yaml` at `fanout/config.yaml`
```yaml config.yaml theme={"system"}
model_name: Multi-Generation-LLM
resources:
accelerator: H100
cpu: '2'
memory: 20Gi
use_gpu: true
trt_llm:
build:
base_model: decoder
checkpoint_repository:
source: HF
repo: "meta-llama/Llama-3.1-8B-Instruct"
quantization_type: fp8
runtime:
served_model_name: "Multi-Generation-LLM"
```
Finally, push the model with `truss push --watch`.
## Limitations and considerations
### What custom engine builder cannot do
**Custom tokenization:**
* Cannot modify the underlying tokenizer implementation
* Cannot add custom vocabulary or special tokens
* Must use the model's native tokenization
**Model architecture changes:**
* Cannot modify the TensorRT-LLM engine structure
* Cannot change attention mechanisms or model layers
* Cannot add custom model components
### When to use standard engine instead
* Standard chat completions without special requirements
* No need for business logic integration
## Monitoring and debugging
### Request tracing
```python theme={"system"}
import uuid
import os
from contextlib import asynccontextmanager
class Model:
def __init__(self, trt_llm, **kwargs):
self._engine = trt_llm["engine"]
self._trace_enabled = os.environ.get("enable_tracing", True)
@asynccontextmanager
async def _trace_request(self, request_id: str):
"""Context manager for request tracing."""
if self._trace_enabled:
print(f"[TRACE] Start: {request_id}")
start_time = time.time()
try:
yield
finally:
if self._trace_enabled:
duration = time.time() - start_time
print(f"[TRACE] End: {request_id} (duration: {duration:.3f}s)")
async def predict(self, model_input: Dict[str, Any], request: Request) -> Any:
request_id = request.headers.get("X-Request-ID", str(uuid.uuid4()))
async with self._trace_request(request_id):
# Main logic here
response = await self._engine.chat_completions(request=request, model_input=model_input)
return response
```
## Related
* [Engine-Builder-LLM overview](/engines/engine-builder-llm/overview): Main engine documentation.
* [Engine-Builder-LLM configuration](/engines/engine-builder-llm/engine-builder-config): Complete reference config.
* [Examples section](/examples/overview): Deployment examples.
* [Chains documentation](/development/chain/overview): Multi-model workflows.
# Reference config (Engine-Builder-LLM)
Source: https://docs.baseten.co/engines/engine-builder-llm/engine-builder-config
Complete reference config for dense text generation models
This reference covers all build and runtime options for Engine-Builder-LLM deployments. All settings use the `trt_llm` section in `config.yaml`.
## Configuration structure
```yaml theme={"system"}
trt_llm:
inference_stack: v1 # Always v1 for Engine-Builder-LLM
build:
base_model: decoder
checkpoint_repository: {...}
max_seq_len: 131072
max_batch_size: 256
max_num_tokens: 8192
quantization_type: no_quant | fp8 | fp8_kv | fp4 | fp4_kv | fp4_mlp_only
quantization_config: {...}
tensor_parallel_count: 1
plugin_configuration: {...}
speculator: {...} # Optional for lookahead decoding
runtime:
kv_cache_free_gpu_mem_fraction: 0.9
enable_chunked_context: true
batch_scheduler_policy: guaranteed_no_evict
served_model_name: "model-name"
total_token_limit: 500000
```
## Build configuration
The `build` section configures model compilation and optimization settings.
The base model architecture for your model checkpoint.
**Options:**
* `decoder`: For CausalLM models (Llama, Mistral, Qwen, Gemma, Phi)
```yaml theme={"system"}
build:
base_model: decoder
```
Specifies where to find the model checkpoint. Repository must be a valid Hugging Face model repository with the standard structure (config.json, tokenizer files, model weights).
**Source options:**
* `HF`: Hugging Face Hub (default)
* `GCS`: Google Cloud Storage
* `S3`: AWS S3
* `AZURE`: Azure Blob Storage
* `REMOTE_URL`: HTTP URL to tar.gz file
* `BASETEN_TRAINING`: Baseten Training checkpoints
For detailed configuration options including training checkpoints and cloud storage setup, see [Deploy training and S3 checkpoints](/engines/performance-concepts/deployment-from-training-and-s3).
```yaml theme={"system"}
checkpoint_repository:
source: HF
repo: "meta-llama/Llama-3.3-70B-Instruct"
revision: main
runtime_secret_name: hf_access_token
```
Maximum sequence length (context) for single requests. Range: 1 to 1048576.
```yaml theme={"system"}
build:
max_seq_len: 131072 # 128K context
```
Maximum number of input sequences processed concurrently. Range: 1 to 2048.
Unless lookahead decoding is enabled, this parameter has little effect on performance. Keep it at 256 for most cases.
Recommended not to be set below 8 to keep performance dynamic for various problems.
```yaml theme={"system"}
build:
max_batch_size: 256
```
Maximum number of batched input tokens after padding removal in each batch. Range: >64 to 1048576.
If `enable_chunked_prefill: false`, this also limits the `max_seq_len` that can be processed. Recommended: `8192` or `16384`.
```yaml theme={"system"}
build:
max_num_tokens: 16384
```
Specifies the quantization format for model weights.
**Options:**
* `no_quant`: `FP16`/`BF16` precision
* `fp8`: `FP8` weights + 16-bit KV cache
* `fp8_kv`: `FP8` weights + `FP8` KV cache
* `fp4`: `FP4` weights + 16-bit KV cache (B200 only)
* `fp4_kv`: `FP4` weights + `FP8` KV cache (B200 only)
* `fp4_mlp_only`: `FP4` MLP only + 16-bit KV (B200 only)
For detailed quantization guidance, see [Quantization Guide](/engines/performance-concepts/quantization-guide).
```yaml theme={"system"}
build:
quantization_type: fp8_kv
```
Configuration for post-training quantization calibration.
**Fields:**
* `calib_size`: Size of calibration dataset (64-16384, multiple of 64). Defines how many rows of the train split with text column to take.
* `calib_dataset`: HuggingFace dataset for calibration. Dataset must have 'text' column (str type) for samples, or 'train' split as subsection.
* `calib_max_seq_length`: Maximum sequence length for calibration (default: 1536).
```yaml theme={"system"}
build:
quantization_type: fp8
quantization_config:
calib_size: 1536
calib_dataset: "abisee/cnn_dailymail"
calib_max_seq_length: 1536
```
Number of GPUs to use for tensor parallelism. Range: 1 to 8.
```yaml theme={"system"}
build:
tensor_parallel_count: 4 # For 70B+ models
```
TensorRT-LLM plugin configuration for performance optimization.
**Fields:**
* `paged_kv_cache`: Enable paged KV cache (recommended: true)
* `use_paged_context_fmha`: Enable paged context FMHA (recommended: true)
* `use_fp8_context_fmha`: Enable `FP8` context FMHA (requires `FP8_KV` quantization)
```yaml theme={"system"}
build:
plugin_configuration:
paged_kv_cache: true
use_paged_context_fmha: true
use_fp8_context_fmha: true # For FP8_KV quantization
```
Configuration for speculative decoding with lookahead. For detailed configuration, see [Lookahead decoding](/engines/engine-builder-llm/lookahead-decoding).
**Fields:**
* `speculative_decoding_mode`: `LOOKAHEAD_DECODING` (recommended)
* `lookahead_windows_size`: Window size for speculation (1-8)
* `lookahead_ngram_size`: N-gram size for patterns (1-16)
* `lookahead_verification_set_size`: Verification buffer size (1-8)
* `enable_b10_lookahead`: Enable Baseten's lookahead algorithm
```yaml theme={"system"}
build:
speculator:
speculative_decoding_mode: LOOKAHEAD_DECODING
lookahead_windows_size: 3
lookahead_ngram_size: 8
lookahead_verification_set_size: 3
enable_b10_lookahead: true
```
Number of GPUs to use during the build job. Only set this if you encounter errors during the build job. It has no impact once the model reaches the deploying stage. If not set, equals `tensor_parallel_count`.
```yaml theme={"system"}
build:
num_builder_gpus: 2
```
## Runtime configuration
The `runtime` section configures inference engine behavior.
Fraction of GPU memory to reserve for KV cache. Range: 0.1 to 1.0.
```yaml theme={"system"}
runtime:
kv_cache_free_gpu_mem_fraction: 0.85
```
Enable chunked prefilling for long sequences.
```yaml theme={"system"}
runtime:
enable_chunked_context: true
```
Policy for scheduling requests in batches.
**Options:**
* `max_utilization`: Maximize GPU utilization (may evict requests)
* `guaranteed_no_evict`: Guarantee request completion (recommended)
```yaml theme={"system"}
runtime:
batch_scheduler_policy: guaranteed_no_evict
```
Model name returned in API responses.
```yaml theme={"system"}
runtime:
served_model_name: "Llama-3.3-70B-Instruct"
```
Default maximum number of tokens to generate per request when not specified by the client. If not set, the engine uses its own default.
```yaml theme={"system"}
runtime:
request_default_max_tokens: 4096
```
Number of bytes to reserve on host (CPU) memory for KV cache offloading. Set to a high value to enable KV cache offloading from GPU to host memory. Only set this if you need to support longer contexts than GPU memory alone can handle.
```yaml theme={"system"}
runtime:
kv_cache_host_memory_bytes: 10000000000 # ~10GB host memory for KV cache
```
Maximum number of tokens that can be scheduled at once. Range: 1 to 1000000.
```yaml theme={"system"}
runtime:
total_token_limit: 1000000
```
## Configuration examples
### Llama 3.3 70B
```yaml theme={"system"}
model_name: Llama-3.3-70B-Instruct
resources:
accelerator: H100:4
cpu: '4'
memory: 40Gi
use_gpu: true
trt_llm:
build:
base_model: decoder
checkpoint_repository:
source: HF
repo: "meta-llama/Llama-3.3-70B-Instruct"
revision: main
runtime_secret_name: hf_access_token
max_seq_len: 131072
max_batch_size: 256
max_num_tokens: 8192
quantization_type: fp8_kv
tensor_parallel_count: 4
plugin_configuration:
paged_kv_cache: true
use_paged_context_fmha: true
use_fp8_context_fmha: true
quantization_config:
calib_size: 1024
calib_dataset: "abisee/cnn_dailymail"
calib_max_seq_length: 2048
runtime:
kv_cache_free_gpu_mem_fraction: 0.9
enable_chunked_context: true
batch_scheduler_policy: guaranteed_no_evict
served_model_name: "Llama-3.3-70B-Instruct"
```
### Qwen 2.5 32B with lookahead decoding
```yaml theme={"system"}
model_name: Qwen-2.5-32B-Lookahead
resources:
accelerator: H100:2
cpu: '2'
memory: 20Gi
use_gpu: true
trt_llm:
build:
base_model: decoder
checkpoint_repository:
source: HF
repo: "Qwen/Qwen2.5-32B-Instruct"
revision: main
max_seq_len: 32768
max_batch_size: 128
max_num_tokens: 8192
quantization_type: fp8_kv
tensor_parallel_count: 2
speculator:
speculative_decoding_mode: LOOKAHEAD_DECODING
lookahead_windows_size: 3
lookahead_ngram_size: 8
lookahead_verification_set_size: 3
enable_b10_lookahead: true
plugin_configuration:
paged_kv_cache: true
use_paged_context_fmha: true
use_fp8_context_fmha: true
runtime:
kv_cache_free_gpu_mem_fraction: 0.85
enable_chunked_context: true
batch_scheduler_policy: guaranteed_no_evict
served_model_name: "Qwen-2.5-32B-Instruct"
```
### Small model on L4
```yaml theme={"system"}
model_name: Llama-3.2-3B-Instruct
resources:
accelerator: L4
cpu: '1'
memory: 10Gi
use_gpu: true
trt_llm:
build:
base_model: decoder
checkpoint_repository:
source: HF
repo: "meta-llama/Llama-3.2-3B-Instruct"
revision: main
max_seq_len: 8192
max_batch_size: 256
max_num_tokens: 4096
quantization_type: fp8
tensor_parallel_count: 1
plugin_configuration:
paged_kv_cache: true
use_paged_context_fmha: true
use_fp8_context_fmha: false
runtime:
kv_cache_free_gpu_mem_fraction: 0.9
enable_chunked_context: true
batch_scheduler_policy: guaranteed_no_evict
served_model_name: "Llama-3.2-3B-Instruct"
```
### B200 with `FP4` quantization
```yaml theme={"system"}
model_name: Qwen-2.5-32B-FP4
resources:
accelerator: B200
cpu: '2'
memory: 20Gi
use_gpu: true
trt_llm:
build:
base_model: decoder
checkpoint_repository:
source: HF
repo: "Qwen/Qwen2.5-32B-Instruct"
revision: main
max_seq_len: 32768
max_batch_size: 256
max_num_tokens: 8192
quantization_type: fp4_kv
tensor_parallel_count: 1
plugin_configuration:
paged_kv_cache: true
use_paged_context_fmha: true
use_fp8_context_fmha: true
quantization_config:
calib_size: 1024
calib_dataset: "abisee/cnn_dailymail"
calib_max_seq_length: 2048
runtime:
kv_cache_free_gpu_mem_fraction: 0.9
enable_chunked_context: true
batch_scheduler_policy: guaranteed_no_evict
served_model_name: "Qwen-2.5-32B-Instruct"
```
## Validation and troubleshooting
### Common errors
**Error:** `FP8 quantization is only supported on L4, H100, H200, B200`
* **Cause:** Using `FP8` quantization on unsupported GPU.
* **Fix:** Use H100 or newer GPU, or use `no_quant`.
**Error:** `FP4 quantization is only supported on B200`
* **Cause:** Using `FP4` quantization on unsupported GPU.
* **Fix:** Use B200 GPU or `FP8` quantization.
**Error:** `Using fp8 context fmha requires fp8 kv, or fp4 with kv cache dtype`
* **Cause:** Mismatch between quantization and context FMHA settings.
* **Fix:** Use `fp8_kv` quantization or disable `use_fp8_context_fmha`.
**Error:** `Tensor parallelism and GPU count must be the same`
* **Cause:** Mismatch between `tensor_parallel_count` and GPU count.
* **Fix:** Ensure `tensor_parallel_count` matches `accelerator` count.
### Performance tuning
**For lowest latency:**
* Reduce `max_batch_size` and `max_num_tokens`.
* Use `batch_scheduler_policy: guaranteed_no_evict`.
* Consider smaller models or quantization.
**For highest throughput:**
* Increase `max_batch_size` and `max_num_tokens`.
* Use `batch_scheduler_policy: max_utilization`.
* Enable quantization on supported hardware.
**For cost optimization:**
* Use L4 GPUs with `FP8` quantization.
* Choose appropriately sized models.
* Tune `max_seq_len` to your actual requirements.
## Model repository structure
All model sources (S3, GCS, HuggingFace, or tar.gz) must follow the standard HuggingFace repository structure. Files must be in the root directory, similar to running:
```bash theme={"system"}
git clone https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct
```
### Required files
**Model configuration (`config.json`):**
* `max_position_embeddings`: Limits maximum context size (content beyond this is truncated).
* `vocab_size`: Vocabulary size for the model.
* `architectures`: Must include `LlamaForCausalLM`, `MistralForCausalLM`, or similar causal LM architectures. Custom code is typically not read.
* `torch_dtype`: Default inference dtype (`float16` or `bfloat16`). Cannot be a pre-quantized model.
**Model weights (`model.safetensors`):**
* Or: `model.safetensors.index.json` + `model-xx-of-yy.safetensors` (sharded).
* Convert to safetensors if you encounter issues with other formats.
* Cannot be a pre-quantized model. Model must be an `fp16`, `bf16`, or `fp32` checkpoint.
**Tokenizer files (`tokenizer_config.json` and `tokenizer.json`):**
* For maximum compatibility, use "FAST" tokenizers compatible with Rust.
* Cannot contain custom Python code.
* For chat completions: must contain `chat_template`, a Jinja2 template.
### Architecture support
| **Model family** | **Supported architectures** | **Notes** |
| ---------------- | -------------------------------------- | --------------------------------------------------- |
| **Llama** | `LlamaForCausalLM` | Full support for Llama 3. For Llama 4, use BIS-LLM. |
| **Mistral** | `MistralForCausalLM` | Including v0.3 and Small variants. |
| **Qwen** | `Qwen2ForCausalLM`, `Qwen3ForCausalLM` | Including Qwen 2.5 and Qwen 3 series. |
| **QwenMoE** | `Qwen3MoEForCausalLM` | Specific support for Qwen3MoE. |
| **Gemma** | `GemmaForCausalLM` | Including Gemma 2 and Gemma 3 series, bf16 only. |
## Best practices
### Model size and GPU selection
| **Model size** | **Recommended GPU** | **Quantization** | **Tensor parallel** |
| -------------- | ------------------- | ---------------- | ------------------- |
| `<8B` | L4/H100 | `FP8_KV` | 1 |
| 8B-70B | H100 | `FP8_KV` | 1-2 |
| 70B+ | H100/B200 | `FP8_KV`/`FP4` | 4+ |
### Production recommendations
* Use `quantization_type: fp8_kv` for best performance/accuracy balance.
* Set `max_batch_size` based on your expected traffic patterns.
* Enable `paged_kv_cache` and `use_paged_context_fmha` for optimal performance.
### Development recommendations
* Use `quantization_type: no_quant` for fastest iteration.
* Set smaller `max_seq_len` to reduce build time.
* Use `batch_scheduler_policy: guaranteed_no_evict` for predictable behavior.
# Speculative decoding guide
Source: https://docs.baseten.co/engines/engine-builder-llm/lookahead-decoding
Faster inference with speculative decoding for coding agents and text generation
Lookahead decoding is a speculative decoding technique that provides 2x-4x faster inference for suitable workloads by predicting future tokens using n-gram patterns. It's particularly effective for coding agents and content with predictable patterns.
## Overview
Lookahead decoding identifies n-gram patterns in the input context and past tokens, speculates on future tokens by generating candidate sequences, verifies predictions against the model's actual output, and accepts verified tokens in a single step.
The technique works with any model compatible with Engine-Builder-LLM. Baseten's B10 Lookahead implementation searches up to 10M past tokens for n-gram matches across language patterns.
## When to use lookahead decoding
Lookahead decoding excels at code generation where programming language syntax creates predictable patterns, and function signatures, variable names, and common idioms all benefit. It also accelerates prompt lookup scenarios where you provide example completions in the prompt, and general low-latency use cases where you can trade slightly decreased throughput for faster individual responses.
### Limitations
* Lookahead is supported on A10G, L4, A100, H100\_40GB, H200, and H100. Other GPUs may not be supported.
* During speculative decoding, sampling is disabled and temperature is set to 0.0.
* Speculative decoding does not affect output quality. The output depends only on model weights and prompt.
* Speculative decoding generates multiple tokens at a time. Structured output (xgrammar, outlines) with state-machine guarantees (enforced json via `response_format`) isn't possible with `engine-builder-llm`.
* For few versions, chunked prefill is now allowed with lookahead decoding, we will dynamically disable chunked prefill in this case.
## Configuration
### Basic lookahead configuration
Add a `speculator` section to your build configuration:
```yaml theme={"system"}
trt_llm:
build:
base_model: decoder
checkpoint_repository:
source: HF
repo: "Qwen/Qwen2.5-7B-Instruct"
speculator:
speculative_decoding_mode: LOOKAHEAD_DECODING
lookahead_windows_size: 3
lookahead_ngram_size: 8
lookahead_verification_set_size: 3
enable_b10_lookahead: true
```
### Configuration parameters
**`speculative_decoding_mode`**: Set to `LOOKAHEAD_DECODING` to enable Baseten's lookahead decoding algorithm.
**`lookahead_ngram_size`**: Size of n-gram patterns for speculation. Range: 1-64, default: 8. Use `4` for simple patterns, `8` for general use (recommended), or `16-32` for complex, highly predictable patterns.
**`lookahead_verification_set_size`**: Size of verification buffer for speculation. Range: 1-8. Use `1` for high-confidence patterns, `3` for general use (recommended), or `5` for complex patterns requiring more verification.
**`lookahead_windows_size`**: Size of the speculation window. Range: 1-8. Set to the same value as `lookahead_verification_set_size`.
**`enable_b10_lookahead`**: Enable Baseten's optimized lookahead algorithm. Default: `true`. Recommended to keep at `true`.
### Performance tuning
**For coding agents:** Use smaller window sizes with moderate n-gram sizes:
```yaml theme={"system"}
speculator:
speculative_decoding_mode: LOOKAHEAD_DECODING
lookahead_windows_size: 1
lookahead_ngram_size: 8
lookahead_verification_set_size: 3
enable_b10_lookahead: true
```
**For general text generation:** Use balanced window and n-gram sizes:
```yaml theme={"system"}
speculator:
speculative_decoding_mode: LOOKAHEAD_DECODING
lookahead_windows_size: 3
lookahead_ngram_size: 8
lookahead_verification_set_size: 3
enable_b10_lookahead: true
```
**For highly predictable content:** Use larger n-gram sizes with conservative verification:
```yaml theme={"system"}
speculator:
speculative_decoding_mode: LOOKAHEAD_DECODING
lookahead_windows_size: 1
lookahead_ngram_size: 32
lookahead_verification_set_size: 1
enable_b10_lookahead: true
```
## Performance impact
### Batch size considerations
Lookahead decoding performs best with smaller batch sizes. Set `max_batch_size` to 32 or 64, depending on your use case.
### Memory overhead
Lookahead decoding doesn't require additional GPU memory.
## Production best practices
### Recommended configurations
**Standard (general purpose):** Balanced settings for general-purpose text generation:
```yaml theme={"system"}
speculator:
speculative_decoding_mode: LOOKAHEAD_DECODING
lookahead_windows_size: 3
lookahead_ngram_size: 8
lookahead_verification_set_size: 3
enable_b10_lookahead: true
```
**Dynamic content (less predictable):**
Setting `enable_b10_lookahead: true` and `lookahead_windows_size: 1 + lookahead_verification_set_size: 1` will enable dynamic length speculation.
The speculated length will depend on the quality of the lookup match. By default we will speculate "a n-gram of k tokens for a k token suffix match".
```yaml theme={"system"}
speculator:
speculative_decoding_mode: LOOKAHEAD_DECODING
lookahead_windows_size: 1
lookahead_ngram_size: 32
lookahead_verification_set_size: 1
enable_b10_lookahead: true
```
**Code generation (highly predictable):** Code has predictable syntax patterns, so you can use larger windows:
```yaml theme={"system"}
speculator:
speculative_decoding_mode: LOOKAHEAD_DECODING
lookahead_windows_size: 7
lookahead_ngram_size: 5
lookahead_verification_set_size: 7
enable_b10_lookahead: true
```
### Build configuration
Set `max_batch_size` to control batch size limits:
```yaml theme={"system"}
trt_llm:
build:
max_batch_size: 64 # Recommended for lookahead decoding
speculator:
speculative_decoding_mode: LOOKAHEAD_DECODING
# ... other speculator config
```
### Engine optimization
* Use smaller batch sizes for maximum benefit (1-8 requests)
* Monitor memory overhead and adjust KV cache allocation
* Test with your specific workload for optimal parameters
## Examples
### Code generation example
Deploy a coding model with lookahead decoding on an H100:
```yaml theme={"system"}
model_name: Qwen-Coder-7B-Lookahead
resources:
accelerator: H100
cpu: '1'
memory: 10Gi
use_gpu: true
trt_llm:
build:
base_model: decoder
checkpoint_repository:
source: HF
repo: "Qwen/Qwen2.5-7B-Instruct"
quantization_type: fp8
max_batch_size: 64
speculator:
speculative_decoding_mode: LOOKAHEAD_DECODING
lookahead_windows_size: 1
lookahead_ngram_size: 8
lookahead_verification_set_size: 1
enable_b10_lookahead: true
runtime:
served_model_name: "Qwen-Coder-7B"
```
## Integration examples - Python code generation
Generate code using the chat completions API:
```python theme={"system"}
import os
from openai import OpenAI
client = OpenAI(
api_key=os.environ['BASETEN_API_KEY'],
base_url="https://model-xxxxxx.api.baseten.co/environments/production/sync/v1"
)
# Generate Python function refactor with lookahead decoding
code = "python\ndef hello_world(name):\n print(42)"
response = client.chat.completions.create(
model="not-required",
messages=[
{
"role": "system",
"content": "You are a Python programming assistant. Write clean, efficient code."
},
{
"role": "user", # By providing the code anywhere in the prompt, the generation is much faster.
"content": f"Please refactor the following function to have docstrings. {code}"
}
],
temperature=0.0,
max_tokens=200
)
print(response.choices[0].message.content)
```
## Best practices
### Configuration optimization
For coding assistants, use `lookahead_windows_size: 1` with `lookahead_ngram_size: 8` and keep batch sizes under 16 for best performance. For structured content like yamls or xml, use `lookahead_windows_size: 1` with `lookahead_ngram_size: 8`, note that `"response_format"` enforcement isn't available with Engine-Builder-LLM Lookahead decoding. For general use, start with default settings (window=3, ngram=8) and adjust based on your content patterns.
### Performance monitoring
Track tokens/second with and without lookahead to measure speed improvement, verification accuracy to see how often speculations succeed, and memory usage to catch overhead. If speed improvement diminishes, reduce batch size. Adjust window size based on content predictability and ngram size based on verification accuracy.
### Troubleshooting
**Common issues:**
**Low speed improvement:**
* Check if content is suitable for lookahead decoding
* Reduce batch size for better performance
* Adjust window and ngram sizes
**Blackwell support**
* Lookahead isn't fully supported in `Engine-Builder-LLM`, check [BIS-LLM overview](/engines/bis-llm/overview) for Blackwell support.
## Related
* [Engine-Builder-LLM overview](/engines/engine-builder-llm/overview): Main engine documentation.
* [Engine-Builder-LLM configuration](/engines/engine-builder-llm/engine-builder-config): Complete reference config.
* [Structured outputs documentation](/engines/performance-concepts/structured-outputs): JSON schema validation.
* [Examples section](/examples/speculative-decoding): Deployment examples.
# LoRA support
Source: https://docs.baseten.co/engines/engine-builder-llm/lora-support
Multi-LoRA adapters for Engine-Builder-LLM engine
Engine-Builder-LLM supports multi-LoRA deployments with runtime adapter switching. Share base model weights across fine-tuned variants and switch adapters without redeployment.
## Overview
Deploy multiple LoRA adapters on a single base model and switch between them at inference time. The engine shares base model weights across all adapters for memory efficiency.
## Configuration
### Basic LoRA configuration
```yaml theme={"system"}
model_name: Qwen2.5-Coder-LoRA
resources:
accelerator: H100
cpu: '2'
memory: 20Gi
use_gpu: true
trt_llm:
build:
base_model: decoder
checkpoint_repository:
source: HF
repo: "Qwen/Qwen2.5-Coder-1.5B-Instruct"
revision: "2e1fd397ee46e1388853d2af2c993145b0f1098a"
lora_adapters:
lora1:
repo: "ai-blond/Qwen-Qwen2.5-Coder-1.5B-Instruct-lora"
revision: "9cde18d8ed964b0519fb481cca6acd936b2ca811"
source: "HF"
lora_configuration:
max_lora_rank: 16
runtime:
served_model_name: "Qwen2.5-Coder-base"
```
## Limitations
* **Same rank and same modules**: For optimal performance and stability, the LoRA adapters for one deployment should be uniform. All target modules must be the same.
* **Build time availability**: The engine relies on numpy-style weights. These need to be pre-converted during deployment and distributed to each replica. For Engine-Builder-LLM, these repos must be known ahead of time.
* **Inference performance**: If you're using only one LoRA adapter, merging the adapter into the base weights provides better performance. Additional LoRA adapters add complexity to kernel selection and fundamentally increase flops.
## LoRA adapter configuration
### Adapter repository structure
LoRA adapters must follow the standard HuggingFace repository structure:
```
adapter-repo/
├── adapter_config.json
├── adapter_model.safetensors
└── README.md
```
### Required files
**adapter\_config.json**
```yaml theme={"system"}
# same base model for all configs
"base_model_name_or_path": "Qwen/Qwen2.5-Coder-1.5B-Instruct",
# same target modules among all lora adapters
"target_modules": [
"attn_q",
"attn_k",
"attn_v",
"attn_dense",
"mlp_h_to_4h",
"mlp_4h_to_h",
"mlp_gate"
],
# same rank among all lora adapters
"r": 16
```
**model.lora\_weights.npy**
* NumPy array containing LoRA weight matrices
* Shape: `(num_layers, rank, hidden_size, hidden_size)`
* Must match the target modules specified in config
**model.lora\_config.npy**
* NumPy array containing LoRA configuration
* Includes scaling factors and other parameters
* Must match the adapter\_config.json specifications
## Build configuration options
### `lora_adapters`
Dictionary of LoRA adapters to load during build:
```yaml theme={"system"}
lora_adapters:
adapter_name:
repo: "username/model-name"
revision: "main"
source: "HF" # or "GCS", "S3", "AZURE"
```
### `max_lora_rank`
Maximum LoRA rank for all adapters.
```yaml theme={"system"}
max_lora_rank: 16 # Default: 64
```
**Range**: 1 to 64, must be power of 2
**Recommended**: Set to exactly the rank `r` that you use for all adapters.
### `lora_configuration`
LoRA-specific configuration nested under `build`:
```yaml theme={"system"}
lora_configuration:
max_lora_rank: 16
lora_target_modules: [] # Auto-detected from adapter_config.json
```
**Fields:**
* `max_lora_rank`: Maximum LoRA rank across all adapters. Default: 64.
* `lora_target_modules`: Target modules for LoRA. Usually auto-detected from adapter config.
## Engine inference configuration
The model parameter in OpenAI-format requests selects which adapter to use. For the above example, valid model names are `Qwen2.5-Coder-base` or `lora1`.
This lets you select different adapters at runtime through the OpenAI client.
## Related
* [Engine-Builder-LLM overview](/engines/engine-builder-llm/overview): Main engine documentation.
* [Engine-Builder-LLM configuration](/engines/engine-builder-llm/engine-builder-config): Complete reference config.
* [Custom engine builder](/engines/engine-builder-llm/custom-engine-builder): Custom model.py implementation.
* [Quantization guide](/engines/performance-concepts/quantization-guide): Performance optimization.
# Overview
Source: https://docs.baseten.co/engines/engine-builder-llm/overview
Dense LLM text generation with lookahead decoding and structured outputs
Engine-Builder-LLM optimizes dense text generation models with [TensorRT-LLM](https://github.com/NVIDIA/TensorRT-LLM), delivering up to 4000 tokens/second for code generation with [lookahead decoding](/engines/engine-builder-llm/lookahead-decoding). The engine supports [structured outputs](/engines/performance-concepts/structured-outputs) for JSON schema validation.
## Use cases
**Model families:**
* **Llama**: `meta-llama/Llama-3.3-70B-Instruct`, `meta-llama/Llama-3.2-3B-Instruct`.
* **Qwen**: `Qwen/Qwen2.5-72B-Instruct`, `Qwen3/Qwen3-8B`, `Qwen/QwQ-32B-Preview`.
* **Mistral**: `mistralai/Mistral-7B-Instruct-v0.3`, `mistralai/Mistral-Small-24B-Instruct`.
* **DeepSeek**: `deepseek-ai/DeepSeek-R1-Distill-Llama-70B`.
* **Gemma 3**: `google/gemma-3-27b-it`, `google/gemma-3-12b-it`.
* **Microsoft**: `microsoft/Phi-4`.
Engine-Builder-LLM handles high-throughput dialogue systems, coding assistants with lookahead decoding, and content generation with structured outputs. The engine's speculative decoding accelerates code generation by 2-4x, making it ideal for coding agents and JSON-heavy workloads.
### LoRA support
Engine-Builder-LLM supports [multi-LoRA](/engines/engine-builder-llm/lora-support) deployments with engine adapter switching:
Multiple adapters, engine switching, parameter-efficient fine-tuning
Deploy LoRA adapters in minutes
### Structured outputs
Engine-Builder-LLM supports OpenAI-compatible structured outputs with JSON schema validation:
Full OpenAI compatibility, JSON schema validation, complex nested schemas
Get started with structured outputs in minutes
### Key benefits
TensorRT-LLM compilation optimizes time-to-first-token.
Batching and kernel optimization maximize tokens per second.
Speculative decoding accelerates coding agents and predictable content.
JSON schema validation for controlled text generation.
## Architecture support
### Supported model types
Engine-Builder-LLM supports all causal language model architectures that end with `ForCausalLM`:
**Primary architectures:**
* `LlamaForCausalLM`: Llama family models.
* `Qwen2ForCausalLM`: Qwen family models.
* `MistralForCausalLM`: Mistral family models.
* `Gemma2ForCausalLM`: Gemma family models.
* `Phi3ForCausalLM`: Phi family models.
**Automatic detection:**
The engine automatically detects the model architecture from the checkpoint repository and applies appropriate optimizations.
### Model size support
| **Model Size** | **Single GPU** | **Tensor Parallel** | **Recommended GPU** |
| -------------- | -------------- | ------------------- | ------------------- |
| `<8B` | L4, A10G, H100 | N/A | L4 (cost-effective) |
| 8B-70B | H100 | TP1-TP2 | H100 (2 GPUs) |
| 70B+ | H100 / B200 | TP4+ | H100 (4+ GPUs) |
## Advanced features
### Lookahead decoding
Lookahead decoding accelerates inference for code generation, JSON output, and templated content by speculating on future tokens using n-gram patterns.
**Best for:**
* **Code generation**: Highly predictable patterns in code.
* **Structured content**: Reliable JSON, YAML, XML generation.
* **Mathematical expressions**: Predictable mathematical notation.
* **Template completion**: Filling in predictable templates.
Enable lookahead decoding by adding a `speculator` section:
```yaml theme={"system"}
trt_llm:
build:
speculator:
speculative_decoding_mode: LOOKAHEAD_DECODING
lookahead_windows_size: 1
lookahead_ngram_size: 8
lookahead_verification_set_size: 1
enable_b10_lookahead: true
```
**Performance impact:**
* **Speed improvement**: Up to 2x faster for code and structured content.
* **Prompt lookup**: Up to 10x faster for prompt-lookup workloads like code apply, reaching 4000 tokens/s per request on Qwen-3-8B with a single H100.
* **Optimal batch size**: Less than 32 requests for best performance.
### Structured outputs
Generate text that conforms to JSON schemas for reliable data extraction and controlled generation.
**Use cases:**
* **Data extraction**: Extract structured information from unstructured text.
* **API response generation**: Generate JSON responses for APIs.
* **Configuration generation**: Create structured configuration files.
* **Content validation**: Ensure generated content meets specific criteria.
Structured outputs work out of the box with no extra configuration. Define a Pydantic schema:
```python theme={"system"}
import os
from pydantic import BaseModel
from openai import OpenAI
class User(BaseModel):
name: str
age: int
email: str
client = OpenAI(
api_key=os.environ['BASETEN_API_KEY'],
base_url="https://model-xxxxxx.api.baseten.co/environments/production/sync/v1"
)
response = client.beta.chat.completions.parse(
model="not-required",
messages=[
{"role": "user", "content": "Extract user info from: John is 25 years old and his email is john@example.com"}
],
response_format=User
)
user = response.choices[0].message.parsed
print(f"Name: {user.name}, Age: {user.age}, Email: {user.email}")
```
### Quantization options
Engine-Builder-LLM supports multiple [quantization](/engines/performance-concepts/quantization-guide) formats for different performance and accuracy trade-offs.
**Quantization types:**
* `no_quant`: `FP16`/`BF16` precision (baseline).
* `fp8`: `FP8` weights + 16-bit KV cache (2x speedup).
* `fp8_kv`: `FP8` weights + `FP8` KV cache (2.5x speedup).
* `fp4`: `FP4` weights + 16-bit KV cache (4x speedup, B200 only).
* `fp4_kv`: `FP4` weights + `FP8` KV cache (4.5x speedup, B200 only).
* `fp4_mlp_only`: `FP4` MLP only + 16-bit KV (3x speedup, B200 only).
**Hardware requirements:**
Hardware requirements vary by quantization type.
| **Quantization** | **Minimum GPU** | **Memory reduction** | **Speed improvement** |
| ------------------------------- | -------------------- | -------------------- | --------------------- |
| `no_quant` | A100 | None | Baseline |
| `fp8` | L4, H100, H200, B200 | 50% | 2x |
| `fp8_kv` | L4, H100, H200, B200 | 60% | 2.5x |
| `fp4`, `fp4_kv`, `fp4_mlp_only` | B200 only | 75% | 3-4.5x |
## Configuration examples
### Basic Llama 3.3 70B deployment
Llama 3.3 70B on H100 GPUs with `FP8` quantization:
```yaml theme={"system"}
model_name: Llama-3.3-70B-Instruct
resources:
accelerator: H100:4 # 4 GPUs for 70B model
cpu: '4'
memory: 40Gi
use_gpu: true
trt_llm:
build:
base_model: decoder
checkpoint_repository:
source: HF
repo: "meta-llama/Llama-3.3-70B-Instruct"
revision: main
runtime_secret_name: hf_access_token
max_seq_len: 131072
max_batch_size: 256
max_num_tokens: 8192
quantization_type: fp8_kv
tensor_parallel_count: 4
plugin_configuration:
paged_kv_cache: true
use_paged_context_fmha: true
use_fp8_context_fmha: true
quantization_config:
calib_size: 1024
calib_dataset: "abisee/cnn_dailymail"
calib_max_seq_length: 2048
runtime:
kv_cache_free_gpu_mem_fraction: 0.9
enable_chunked_context: true
batch_scheduler_policy: guaranteed_no_evict
served_model_name: "Llama-3.3-70B-Instruct"
```
### Qwen 2.5 32B with lookahead decoding
Qwen 2.5 32B with *speculative decoding* for faster inference. Read more on [lookahead decoding here](/engines/engine-builder-llm/lookahead-decoding)
```yaml theme={"system"}
model_name: Qwen-2.5-32B-Lookahead
resources:
accelerator: H100:1
cpu: '2'
memory: 20Gi
use_gpu: true
trt_llm:
build:
base_model: decoder
checkpoint_repository:
source: HF
repo: "Qwen/Qwen2.5-Coder-32B-Instruct"
revision: main
max_seq_len: 32768
max_batch_size: 128
max_num_tokens: 8192
quantization_type: fp8 # no fp8_kv for qwen2.5 models
tensor_parallel_count: 1
num_builder_gpus: 2 # will be loaded in bf16 for quantization, will require `2x32Gb memory -> 2H100s
speculator:
speculative_decoding_mode: LOOKAHEAD_DECODING
lookahead_windows_size: 3
lookahead_ngram_size: 8
lookahead_verification_set_size: 3
enable_b10_lookahead: true
plugin_configuration:
paged_kv_cache: true
use_paged_context_fmha: true
use_fp8_context_fmha: true
runtime:
kv_cache_free_gpu_mem_fraction: 0.85
enable_chunked_context: true
batch_scheduler_policy: guaranteed_no_evict
served_model_name: "Qwen-2.5-Coder-32B-Instruct"
```
### Small model for cost-effective deployment
Llama 3.2 3B on an L4 GPU for cost efficiency:
```yaml theme={"system"}
model_name: Llama-3.2-3B-Instruct
resources:
accelerator: L4
cpu: '1'
memory: 10Gi
use_gpu: true
trt_llm:
build:
base_model: decoder
checkpoint_repository:
source: HF
repo: "meta-llama/Llama-3.2-3B-Instruct"
revision: main
max_seq_len: 8192
max_batch_size: 256
max_num_tokens: 4096
quantization_type: fp8
tensor_parallel_count: 1
plugin_configuration:
paged_kv_cache: true
use_paged_context_fmha: true
use_fp8_context_fmha: false
runtime:
kv_cache_free_gpu_mem_fraction: 0.9
enable_chunked_context: true
batch_scheduler_policy: guaranteed_no_evict
served_model_name: "Llama-3.2-3B-Instruct"
```
## Performance characteristics
### Latency and throughput factors
Performance depends on model size (smaller models respond faster), quantization (`FP8`/`FP4` reduces memory and improves throughput), lookahead decoding (effective for code and structured content), batch size (larger batches improve throughput at the cost of latency), and hardware (H100 and B200 GPUs deliver the best results).
### Memory usage considerations
**Memory optimization factors:**
* **Quantization**: `FP8` reduces memory by \~50%, `FP4` by \~75%.
* **Lookahead decoding**: Minimal additional memory overhead.
* **Tensor parallelism**: Distributes memory across multiple GPUs.
* **KV cache management**: Configurable memory allocation for context handling.
## Integration examples
### OpenAI-compatible inference
Engine-Builder-LLM deployments are OpenAI compatible, enabling use of the standard OpenAI SDK.
```python theme={"system"}
import os
from openai import OpenAI
client = OpenAI(
api_key=os.environ['BASETEN_API_KEY'],
base_url="https://model-xxxxxx.api.baseten.co/environments/production/sync/v1"
)
# Standard chat completion
response = client.chat.completions.create(
model="not-required",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Explain quantum computing in simple terms."}
],
temperature=0.7,
max_tokens=500
)
print(response.choices[0].message.content)
# Streaming completion
for chunk in client.chat.completions.create(
model="not-required",
messages=[{"role": "user", "content": "Write a poem about AI"}],
stream=True,
):
print(chunk.choices[0].delta.content or "", end="")
```
Point `base_url` to your model's production endpoint. Find this URL in your Baseten dashboard after deployment. The `model` parameter can be any string since Baseten routes based on the URL, not this field. Set `stream=True` to receive tokens as they're generated.
Running this returns a chat completion response with the model's answer in `response.choices[0].message.content`, or streams chunks with partial content in `delta.content`.
### Performant Client Usage
For high-throughput batch processing, use the [Performance Client](/engines/performance-concepts/performance-client) which handles concurrent requests efficiently.
```python theme={"system"}
from baseten_performance_client import PerformanceClient
client = PerformanceClient(
base_url="https://model-xxxxxx.api.baseten.co/environments/production/sync",
api_key=os.environ['BASETEN_API_KEY']
)
# Batch chat completions with stream=False
payloads = [
{
"model": "model",
"messages": [{"role": "user", "content": "Explain quantum computing"}],
"stream": False,
"max_tokens": 500
},
{
"model": "model",
"messages": [{"role": "user", "content": "Write a poem about AI"}],
"stream": False,
"max_tokens": 300
}
] * 10 # 20 total requests
response = client.batch_post(
url_path="/v1/chat/completions",
payloads=payloads,
)
# Access 20 responses
for i, resp in enumerate(response.data):
print(f"Response {i+1}: {resp['choices'][0]['message']['content']}")
```
**Use cases:** Bulk content generation, Unlocked GIL during Request, batch data processing, performance benchmarking.
### Structured outputs
*Structured outputs* guarantee the response matches your Pydantic schema.
```python theme={"system"}
import os
from pydantic import BaseModel
from openai import OpenAI
class Task(BaseModel):
title: str
priority: str
due_date: str
description: str
client = OpenAI(
api_key=os.environ['BASETEN_API_KEY'],
base_url="https://model-xxxxxx.api.baseten.co/environments/production/sync/v1"
)
response = client.beta.chat.completions.parse(
model="not-required",
messages=[
{"role": "user", "content": "Create a task for: Review the quarterly report by next Friday"}
],
response_format=Task
)
task = response.choices[0].message.parsed
print(f"Task: {task.title}")
print(f"Priority: {task.priority}")
```
Define your schema as a Pydantic model with typed fields. Pass it to `response_format` and use `beta.chat.completions.parse` instead of the regular `create` method.
The response includes a `parsed` attribute with your data already converted to a `Task` object, so no JSON parsing is needed.
### Function calling
*Function calling* lets the model invoke your functions with structured arguments. Define available tools, and the model returns function calls when appropriate.
```python theme={"system"}
import os
from openai import OpenAI
client = OpenAI(
api_key=os.environ['BASETEN_API_KEY'],
base_url="https://model-xxxxxx.api.baseten.co/environments/production/sync/v1"
)
tools = [{
"type": "function",
"function": {
"name": "get_weather",
"description": "Get current weather for a location",
"parameters": {
"type": "object",
"properties": {
"location": {
"type": "string",
"description": "City name, e.g., San Francisco"
}
},
"required": ["location"]
}
}
}]
response = client.chat.completions.create(
model="not-required",
messages=[{"role": "user", "content": "What's the weather like in Boston?"}],
tools=tools
)
tool_call = response.choices[0].message.tool_calls[0]
print(f"Function: {tool_call.function.name}")
print(f"Arguments: {tool_call.function.arguments}")
```
Define each tool with a `name`, `description`, and JSON schema for `parameters`. The description helps the model decide when to use the tool.
When the model chooses to call a function, `tool_calls` contains the function name and JSON-encoded arguments. Your code executes the function and optionally sends the result back for a final response.
## Best practices
### Model selection
**For cost-effective deployments:**
* Use models under 8B parameters on L4 GPUs, H100 or H100\_40GB.
* Consider quantization for memory efficiency.
* Implement autoscaling for variable traffic.
**For high-performance deployments:**
* Use H100 GPUs with `FP8` quantization.
* Enable lookahead decoding for code generation.
* Use tensor parallelism for large models.
**For coding assistants:**
* Use models trained on code (Qwen-Coder, CodeLlama).
* Enable lookahead decoding with window size 1 for maximum throughput.
* Consider smaller models for faster response times.
### Hardware optimization
**GPU selection:**
* **L4 or H100\_40GB**: Best for models under 15B parameters, cost-effective.
* **H100\_80GB**: Recommended for models 15-70B parameters for optimal performance.
* **H100**: Best for models 15-70B parameters, high performance.
* **B200**: Required for `FP4` quantization.
**Memory optimization:**
* Use quantization to reduce memory usage.
* Lower max\_seq\_len or enable chunked prefill.
* Monitor memory usage during deployment.
### Performance tuning
**For lowest latency:**
* Use smaller models when possible.
* Enable lookahead decoding for code generation.
**For highest throughput:**
* Use larger batch sizes.
* Enable `FP8`/`FP4` quantization.
* Use tensor parallelism for large models.
**For cost efficiency:**
* Use L4 GPUs with quantization.
* Implement efficient autoscaling.
* Choose appropriately sized models.
## Migration guide
### From other deployment systems
Coming from vLLM? Here's how the configuration maps:
```yaml theme={"system"}
# vLLM configuration (old)
model: "meta-llama/Llama-3.3-70B-Instruct"
tensor_parallel_size: 4
quantization: "fp8"
# Engine-Builder-LLM configuration (new)
trt_llm:
build:
base_model: decoder
checkpoint_repository:
source: HF
repo: "meta-llama/Llama-3.3-70B-Instruct"
quantization_type: fp8_kv
tensor_parallel_count: 4
```
## Related
* [Engine-Builder-LLM reference config](/engines/engine-builder-llm/engine-builder-config): Complete configuration options.
* [Structured outputs](/engines/performance-concepts/structured-outputs): JSON schema validation and controlled generation.
* [Lookahead decoding guide](/engines/engine-builder-llm/lookahead-decoding): Advanced speculative decoding.
* [Custom engine builder](/engines/engine-builder-llm/custom-engine-builder): Custom model.py implementation.
* [Quantization guide](/engines/performance-concepts/quantization-guide): `FP8`/`FP4` trade-offs and hardware requirements.
* [TensorRT-LLM examples](/examples/tensorrt-llm): Concrete deployment examples.
# Overview
Source: https://docs.baseten.co/engines/index
Engine selection guide for embeddings, dense LLMs, and MoE models
Baseten engines optimize model inference for specific architectures using [TensorRT-LLM](https://github.com/NVIDIA/TensorRT-LLM). Select an engine based on your model type (embeddings, dense LLMs, or mixture-of-experts) to achieve the best latency and throughput.
## Engine ecosystem
Embeddings, reranking, and classification models with up to 1400 embeddings/sec throughput.
Dense text generation models with [lookahead decoding](/engines/engine-builder-llm/lookahead-decoding), [structured outputs](/engines/performance-concepts/structured-outputs), and single node inference.
MoE models with [KV-aware routing](/engines/bis-llm/advanced-features#kv-aware-routing), [tool calling](/engines/performance-concepts/function-calling), and speculative decoding.
Specialized engines for models like Whisper, Orpheus, or Flux, available as dedicated deployments rather than self-serviceable options.
## Engine selection
Select an engine based on your model's architecture and expected workload.
| Model type | Architecture | Recommended engine | Key features | **Hardware** |
| ------------------ | ----------------------------- | ------------------ | ----------------------------------------- | ------------------------ |
| **Dense LLM** | CausalLM (text generation) | Engine-Builder-LLM | Lookahead decoding, structured outputs | H100, B200 |
| **MoE Models** | Mixture of Experts | BIS-LLM | KV-aware routing, advanced quantization | H100, B200 |
| **Large Models** | 700B+ parameters | BIS-LLM | Distributed inference, `FP4` support | H100, B200 |
| **Embeddings** | BERT-based (bidirectional) | BEI-Bert | Cold-start optimization, 16-bit precision | T4, L4, A10G, H100, B200 |
| **Embeddings** | Causal (Llama, Mistral, Qwen) | BEI | `FP8` quantization, high throughput | L4, A10G, H100, B200 |
| **Reranking** | Cross-encoder architectures | BEI / BEI-Bert | Low latency, batch processing | L4, A10G, H100, B200 |
| **Classification** | Sequence classification | BEI / BEI-Bert | High throughput, cached weights | L4, A10G, H100, B200 |
### Feature availability
| Feature | BIS-LLM | Engine-Builder-LLM | BEI | BEI-Bert | Notes |
| ------------------------------------ | ------- | ------------------ | --- | -------- | ------------------------------------------------ |
| **Quantization** | ✅ | ✅ | ✅ | ❌ | BEI-Bert: `FP16`/`BF16` only |
| **KV quantization** | ✅ | ✅ | ⚠️ | ⚠️ | `FP8_KV`, `FP4_KV` supported |
| **Speculative lookahead decoding** | Gated | ✅ | ❌ | ❌ | n-gram based speculation |
| **Self-serviceable** | Gated/✅ | ✅ | ✅ | ✅ | All engines self-service |
| **KV-routing** | Gated | ❌ | ❌ | ❌ | BIS-LLM only |
| **Disaggregated serving** | Gated | ❌ | ❌ | ❌ | BIS-LLM enterprise |
| **Tool calling & structured output** | ✅ | ✅ | ❌ | ❌ | Function calling support |
| **Classification models** | ❌ | ❌ | ✅ | ✅ | Sequence classification |
| **Embedding models** | ❌ | ❌ | ✅ | ✅ | Embedding generation |
| **Mixture-of-experts** | ✅ | ⚠️ (Qwen3MoE only) | ❌ | ❌ | Mixture of Experts models like DeepSeek |
| **MTP and Eagle 3 speculation** | Gated | ❌ | ❌ | ❌ | Model-based speculation |
| **HTTP request cancellation** | ✅ | ❌ | ✅ | ✅ | Engine-Builder supports it within the first 10ms |
| **MultiModal Inputs** | Gated | ❌ | ⚠️ | ❌ | Selected architectures only |
## Architecture recommendations
### BEI vs BEI-Bert (embeddings)
BEI-Bert optimizes BERT-based architectures (sentence-transformers, jinaai, nomic-ai) with fast cold-start performance and 16-bit precision. Choose BEI-Bert for bidirectional models under 4B parameters where cold-start latency matters. Jina-BERT, Nomic, and ModernBERT architectures all run well on this engine.
BEI handles causal embedding architectures (Llama, Mistral, Qwen) with `FP8`/`FP4` quantization support. Choose BEI when you need maximum throughput or want to run larger embedding models like BAAI/bge, Qwen3-Embedding, or Salesforce/SFR-Embedding with quantization.
### Engine-Builder-LLM vs BIS-LLM (text generation)
Engine-Builder-LLM serves dense models (non-MoE) with lookahead decoding and structured outputs. Choose it for Llama 3.3, Qwen-3, Qwen2.5, Mistral, or Gemma-3 when you need speculative decoding for coding agents or JSON schema validation.
BIS-LLM serves large MoE models with KV-aware routing and advanced tool calling. Choose it for DeepSeek-R1, Qwen3MoE, Kimi-K2, Llama-4, or GLM-4.7 when you need enterprise features like disaggregated serving or H100/B200 optimization.
## Performance benchmarks
Benchmark results depend on model size, GPU type, and quantization settings. The figures below represent typical performance on H100 GPUs.
### Embedding performance (BEI/BEI-Bert)
* **Throughput**: Up to 1400 client embeddings per second.
* **Latency**: Sub-millisecond response times.
* **Quantization**: `FP8`/`FP4` provides 2x speedup with less than 1% accuracy loss.
### Text generation performance (Engine-Builder-LLM/BIS-LLM)
* **Speculative decoding**: Faster inference for code and structured content through lookahead decoding.
* **Quantization**: Memory reduction and speed improvements with `FP8`/`FP4`.
* **Distributed inference**: Scalable deployment with tensor parallelism.
## Hardware requirements and optimization
*[Quantization](/engines/performance-concepts/quantization-guide)* reduces memory usage and improves inference speed.
| Quantization | Minimum GPU | Recommended GPU | Memory reduction | Notes |
| ------------- | ----------- | --------------- | ---------------- | ------------------------------------------- |
| `FP16`/`BF16` | A100 | H100 | None | Baseline precision |
| `FP8` | L4 | H100 | \~50% | Good balance of performance and accuracy |
| `FP8_KV` | L4 | H100 | \~60% | KV cache quantization for memory efficiency |
| `FP4` | B200 | B200 | \~75% | B200-only quantization |
| `FP4_KV` | B200 | B200 | \~80% | Maximum memory reduction |
Some models require specialized engines that aren't self-serviceable:
* **Whisper**: Audio transcription and speech recognition.
* **Orpheus**: Audio generation.
## Next steps
* [BEI documentation](/engines/bei/overview): Embeddings and classification.
* [Engine-Builder-LLM documentation](/engines/engine-builder-llm/overview): Dense text generation.
* [BIS-LLM documentation](/engines/bis-llm/overview): MoE and advanced features.
**Examples:**
* [BEI deployment guide](/examples/bei): Complete embedding model setup.
* [TensorRT-LLM examples](/examples/tensorrt-llm): Dense LLM deployment.
* [DeepSeek R1](https://www.baseten.co/library/deepseek-r1/): Large MoE deployment.
# Autoscaling engines
Source: https://docs.baseten.co/engines/performance-concepts/autoscaling-engines
Engine-specific autoscaling settings for BEI and Engine-Builder-LLM
BEI and Engine-Builder-LLM use **dynamic batching** to process multiple requests in parallel. This increases throughput but requires different autoscaling settings than standard models.
## Quick reference
| Setting | BEI | Engine-Builder-LLM |
| -------------------------- | ----------------------------------------------- | ----------------------------- |
| **Target utilization** | 25% | 40–50% |
| **Concurrency target** | 96+ (min ≥ 8) | 32–256 |
| **Special considerations** | Use Performance client for multi-payload routes | Never exceed max\_batch\_size |
For general autoscaling concepts, see [Autoscaling](/deployment/autoscaling/overview).
***
## BEI
BEI provides millisecond-range inference times and scales differently than other models. With too few replicas, backpressure can build up quickly.
### Recommendations
| Setting | Value | Why |
| ------------------ | ----------------- | ----------------------------------------------- |
| Target utilization | **25%** | Low target provides headroom for traffic spikes |
| Concurrency target | **96+** (min ≥ 8) | High concurrency allows maximum throughput |
| Autoscaling | **Enabled** | Required for variable traffic |
### Multi-payload routes
The `/rerank` and `/v1/embeddings` routes can send multiple items per request, which challenges request-based autoscaling. Each API call counts as one request regardless of how many items it contains.
Use the [Performance client](/engines/performance-concepts/performance-client) for optimal scaling with multi-payload routes.
***
## Engine-Builder-LLM
Engine-Builder-LLM uses dynamic batching similar to BEI but doesn't face the multi-payload challenge.
### Recommendations
| Setting | Value | Why |
| ------------------ | ---------- | -------------------------------------- |
| Target utilization | **40–50%** | Accommodates dynamic batching behavior |
| Concurrency target | **32–256** | Match or stay below max\_batch\_size |
| Min concurrency | **≥ 8** | Optimal performance floor |
**Never set concurrency target above `max_batch_size`.** This causes on-replica queueing and negates the benefits of autoscaling. If your max\_batch\_size is 64, keep concurrency target at 64 or below.
### Lookahead decoding
If using lookahead decoding, set concurrency target to the same or slightly below `max_batch_size`. This allows lookahead to perform optimizations. This guidance applies to all Engine-Builder-LLM deployments, not just those using lookahead.
***
## Related
* [Autoscaling](/deployment/autoscaling/overview): Full parameter reference.
* [Traffic patterns](/deployment/autoscaling/traffic-patterns): Pattern-specific settings.
* [BEI overview](/engines/bei/overview): General BEI documentation.
* [Engine-Builder-LLM overview](/engines/engine-builder-llm/overview): Generation model details.
* [Performance client](/engines/performance-concepts/performance-client): Client usage for batch processing.
# Deploy training and S3 checkpoints
Source: https://docs.baseten.co/engines/performance-concepts/deployment-from-training-and-s3
Deploy training checkpoints and cloud storage models with TensorRT-LLM optimization.
Deploy training checkpoints and cloud storage models with Engine-Builder-LLM, BEI, or BIS-LLM.
## Training checkpoint deployment
Deploy fine-tuned models from Baseten Training with Engine-Builder-LLM. Specify `BASETEN_TRAINING` as the source:
```yaml config.yaml theme={"system"}
model_name: My Fine-Tuned LLM
resources:
accelerator: H100:1
use_gpu: true
secrets:
hf_access_token: null # do not set value here
trt_llm:
build:
base_model: decoder
checkpoint_repository:
source: BASETEN_TRAINING
repo: YOUR_TRAINING_PROJECT_NAME
revision: YOUR_TRAINING_JOB_ID/checkpoint-100
```
**Key fields:**
* `base_model`: `decoder` for LLMs, `encoder`/`encoder_bert` for embeddings
* `source`: `BASETEN_TRAINING` for Baseten Training checkpoints
* `repo`: Your training project name
* `revision`: Controls which job and checkpoint to deploy. Supports several formats:
* `/`: deploy a specific checkpoint from a specific job (e.g. `abc123/checkpoint-100`)
* ``: deploy the latest checkpoint from a specific job
* `latest` or omitted: deploy the latest checkpoint from the latest job
Find your checkpoint details with:
```sh theme={"system"}
truss train get_checkpoint_urls --job-id=YOUR_TRAINING_JOB_ID
```
### Encoder model requirements
To deploy a fine-tuned encoder model (embeddings, rerankers) from training checkpoints, use `encoder` or `encoder_bert` as the base model:
```yaml config.yaml theme={"system"}
model_name: My Fine-Tuned Embeddings
resources:
accelerator: A10G:1
use_gpu: true
trt_llm:
build:
base_model: encoder_bert
checkpoint_repository:
source: BASETEN_TRAINING
repo: YOUR_TRAINING_PROJECT_NAME
revision: YOUR_TRAINING_JOB_ID/checkpoint-100
runtime:
webserver_default_route: /v1/embeddings
```
Use `encoder_bert` for BERT-based models (sentence-transformers, classification, reranking). Use `encoder` for causal embedding models.
Encoder models have specific requirements:
* **No tensor parallelism**: Omit `tensor_parallel_count` or set it to `1`.
* **Fast tokenizer required**: Your checkpoint must include a `tokenizer.json` file. Models using only the legacy `vocab.txt` format aren't supported.
* **Embedding model files**: For sentence-transformer models, include `modules.json` and `1_Pooling/config.json` in your checkpoint.
The `webserver_default_route` configures the inference endpoint. Options include `/v1/embeddings` for embeddings, `/rerank` for rerankers, and `/predict` for classification.
## Cloud storage deployment
Deploy models directly from S3, GCS, or Azure. Specify the storage source and bucket path:
```yaml config.yaml theme={"system"}
trt_llm:
build:
base_model: decoder
checkpoint_repository:
source: S3 # or GCS, AZURE, HF
repo: s3://your-bucket/path/to/model/
```
**Storage sources:**
* `S3`: Amazon S3 buckets
* `GCS`: Google Cloud Storage
* `AZURE`: Azure Blob Storage
* `HF`: Hugging Face repositories
### Private storage setup
All runtimes use the same downloader system as [model\_cache](/development/model/model-cache). As a result, you configure the `runtime_secret_name` and `repo` identically across model\_cache and runtimes like Engine-Builder-LLM or BEI.
**Secret Setup:**
Add these JSON secrets to your [Baseten secrets manager](https://app.baseten.co/settings/secrets).
For more details, refer to the documentation in [model\_cache](/development/model/model-cache).
**S3:**
```json theme={"system"}
{
"access_key_id": "XXXXX",
"secret_access_key": "xxxxx/xxxxxx",
"region": "us-west-2"
}
```
**GCS:**
```json theme={"system"}
{
"private_key_id": "xxxxxxx",
"private_key": "-----BEGIN PRIVATE KEY-----\nMI",
"client_email": "b10-some@xxx-example.iam.gserviceaccount.com"
}
```
**Azure:**
```json theme={"system"}
{
"account_key": "xxxxx"
}
```
Reference the secret in your config:
```yaml theme={"system"}
secrets:
aws_secret_json: "set token in baseten workspace"
trt_llm:
build:
checkpoint_repository:
source: S3
repo: s3://your-private-bucket/model
runtime_secret_name: aws_secret_json
```
**For Baseten Training deployments:** These secrets are automatically mounted and available to your deployment.
## Related
* [Engine-Builder-LLM configuration](/engines/engine-builder-llm/engine-builder-config): Complete build and runtime options for LLMs.
* [BEI reference configuration](/engines/bei/bei-reference): Complete configuration for encoder models.
* [Model cache documentation](/development/model/model-cache): Caching strategies used by the engines.
* [Secrets management](/development/model/secrets): Configure credentials for private storage.
# Function calling
Source: https://docs.baseten.co/engines/performance-concepts/function-calling
Tool selection and structured function calls with LLMs
Function calling is supported by Baseten engines including [BIS-LLM](/engines/bis-llm/overview) and [Engine-Builder-LLM](/engines/engine-builder-llm/overview), as well as [Model APIs](/development/model-apis/overview) for instant access. It's also compatible with other inference frameworks like [vLLM](/examples/vllm) and [SGLang](/examples/sglang).
## Overview
*Function calling* (also known as *tool calling*) lets a model **choose a tool and produce arguments** based on a user request.
**Important:** the model **doesn't execute** your Python function. Your application must:
1. run the tool, and
2. optionally send the tool’s output back to the model to produce a final, user-facing response.
This is a great fit for [chains](/development/chain/overview) and other orchestrators.
***
## How tool calling works
A typical tool-calling loop looks like:
1. **Send** the user message and a list of tools.
2. The model returns either normal text or one or more **tool calls** (name and JSON arguments).
3. **Execute** the tool calls in your application.
4. **Send tool output** back to the model.
5. Receive a **final response** or additional tool calls.
***
## 1. Define tools
Tools can be anything: API calls, database queries, internal scripts, etc.
Docstrings matter. Models use them to decide which tool to call and how to fill parameters:
```python theme={"system"}
def multiply(a: float, b: float):
"""Multiply two numbers.
Args:
a: The first number.
b: The second number.
"""
return a * b
def divide(a: float, b: float):
"""Divide two numbers.
Args:
a: The dividend.
b: The divisor (must be non-zero).
"""
return a / b
def add(a: float, b: float):
"""Add two numbers.
Args:
a: The first number.
b: The second number.
"""
return a + b
def subtract(a: float, b: float):
"""Subtract two numbers.
Args:
a: The number to subtract from.
b: The number to subtract.
"""
return a - b
```
### Tool-writing tips
Design small, single-purpose tools and document constraints in docstrings (units, allowed values, required fields). Treat model-provided arguments as untrusted input and validate before execution.
***
## 2. Serialize functions
Convert functions into JSON-schema tool definitions (OpenAI-compatible format):
```python theme={"system"}
from transformers.utils import get_json_schema
calculator_functions = {
"multiply": multiply,
"divide": divide,
"add": add,
"subtract": subtract,
}
tools = [get_json_schema(f) for f in calculator_functions.values()]
```
***
## 3. Call the model
Include the `tools` array in your request:
```python theme={"system"}
import requests
payload = {
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What is 3.14 + 3.14?"},
],
"tools": tools,
"tool_choice": "auto", # default
}
MODEL_ID = ""
BASETEN_API_KEY = ""
resp = requests.post(
f"https://model-{MODEL_ID}.api.baseten.co/production/predict",
headers={"Authorization": f"Api-Key {BASETEN_API_KEY}"},
json=payload,
)
```
***
## 4. Control tool selection
Set `tool_choice` to control how the model uses tools. With `auto` (default), the model may respond with text or tool calls. With `required`, the model must return at least one tool call. With `none`, the model returns plain text only. To force a specific tool:
```python theme={"system"}
"tool_choice": {"type": "function", "function": {"name": "subtract"}}
```
***
## 5. Parse and execute tool calls
Depending on the engine and model, tool calls are typically returned in an assistant message under `tool_calls`:
```python theme={"system"}
import json
data = resp.json()
message = data["choices"][0]["message"]
tool_calls = message.get("tool_calls") or []
for tool_call in tool_calls:
name = tool_call["function"]["name"]
args = json.loads(tool_call["function"]["arguments"])
# Validate args in production.
result = calculator_functions[name](**args)
print(result)
```
### Full loop: send tool output back for a final answer
If you want the model to turn raw tool output into a user-facing response, append the assistant message and a tool response with the matching `tool_call_id`:
```python theme={"system"}
# Continue the conversation
messages = payload["messages"]
messages.append(message) # assistant tool call message
# Example: respond to the first tool call
tool_call = tool_calls[0]
name = tool_call["function"]["name"]
args = json.loads(tool_call["function"]["arguments"])
result = calculator_functions[name](**args)
messages.append({
"role": "tool",
"tool_call_id": tool_call["id"],
"content": json.dumps({"result": result}),
})
final_payload = {
**payload,
"messages": messages,
}
final_resp = requests.post(
f"https://model-{MODEL_ID}.api.baseten.co/production/predict",
headers={"Authorization": f"Api-Key {BASETEN_API_KEY}"},
json=final_payload,
)
print(final_resp.json()["choices"][0]["message"].get("content"))
```
***
## Practical tips
Use low temperature (0.0–0.3) for reliable tool selection and argument values. Add `enum` and `required` constraints in your JSON schema to guide model outputs. Consider parallel tool calls only if your model supports them. Always validate and sanitize inputs before calling real systems.
***
## Related
* [Chains](/development/chain/overview): Orchestrate multi-step workflows.
* [Custom engine builder](/engines/engine-builder-llm/custom-engine-builder): Advanced configuration options.
# Configure HTTP clients
Source: https://docs.baseten.co/engines/performance-concepts/http-client-configuration
Connection pooling, retries, and timeouts for reliable inference requests
When calling Baseten at scale, HTTP client configuration directly affects reliability and throughput.
Misconfigured clients cause `Connection refused` and `Client closed connection` errors that look like platform issues but originate client-side.
This page covers the settings that matter.
For a drop-in solution, use the [Performance Client](/engines/performance-concepts/performance-client), which handles connection pooling, retries, and concurrency automatically.
***
## Reuse client sessions
Creating a new HTTP client per request is the most common misconfiguration. Each
new client opens a fresh TCP connection, performs a full TLS handshake, and then
discards the connection after a single use. Under load, this pattern quickly
exhausts available ports and produces `Connection refused` errors that appear
intermittent and difficult to diagnose.
A reused client maintains a pool of open connections that are ready for
subsequent requests. This eliminates per-request connection overhead and keeps
your throughput stable as concurrency increases.
Create a single client session and reuse it for all requests.
For example, in Python you would set up the client at the start of the script and reuse it for all requests.
```python theme={"system"}
# Correct: reuse a client session
client = httpx.Client(
base_url=f"https://model-{model_id}.api.baseten.co",
headers={"Authorization": f"Api-Key {api_key}"},
)
def predict(payload):
response = client.post("/environments/production/predict", json=payload)
return response.json()
```
Creating a new client session for each request opens a fresh TCP connection every time.
```python theme={"system"}
# Anti-pattern: new client per request
def predict(payload):
response = httpx.post(
url, json=payload, headers=headers
) # New connection every time
return response.json()
```
***
## Choose an HTTP client
Your choice of HTTP client library determines which connection management
features are available to you. The [httpx](https://www.python-httpx.org/)
library is recommended over
[requests](https://requests.readthedocs.io/en/latest/) for Baseten workloads
because it provides built-in connection pooling, native async support, and
optional HTTP/2. The `requests` library can achieve connection reuse through its
[`Session`](https://requests.readthedocs.io/en/latest/user/advanced/#session-objects) object, but lacks async support and requires more manual
configuration.
The OpenAI Python SDK uses httpx internally, so if you're already using it, you
benefit from httpx's connection handling by default.
For example, here is how to create a basic [`httpx.Client`](https://www.python-httpx.org/api/#client):
```python theme={"system"}
import httpx
client = httpx.Client(
base_url=f"https://model-{model_id}.api.baseten.co",
headers={"Authorization": f"Api-Key {api_key}"},
)
```
***
## Configure connection pooling
Connection pooling keeps a set of open TCP connections ready for reuse. When
your client sends a request, it draws from this pool instead of opening a new
connection. This avoids the cost of repeated TCP handshakes and TLS
negotiations, which can add 50-100ms of latency per request.
The default httpx pool limits (100 total connections, 20 per host) work for
moderate workloads, but high-throughput applications that send hundreds of
concurrent requests will exhaust these limits. When the pool is full, new
requests block until a connection becomes available, resulting in [`PoolTimeout`](https://www.python-httpx.org/exceptions/#pooltimeout)
errors or increased latency.
Increase the pool limits based on your peak concurrency using [`httpx.Limits`](https://www.python-httpx.org/advanced/resource-limits/). The
`max_keepalive_connections` setting controls how many idle connections stay
open, and `keepalive_expiry` controls how long idle connections persist before
closing. Baseten keeps connections alive for 60-120 seconds, so setting
your client's expiry below the server minimum avoids hitting dead connections.
```python theme={"system"}
import httpx
limits = httpx.Limits(
max_connections=256,
max_keepalive_connections=128,
keepalive_expiry=30,
)
client = httpx.Client(
base_url=f"https://model-{model_id}.api.baseten.co",
headers={"Authorization": f"Api-Key {api_key}"},
limits=limits,
)
```
### Recommended values
| Setting | Default (httpx) | Recommended |
| ------------------------- | --------------- | ----------- |
| Max connections | 100 | 256 |
| Max keepalive connections | 20 | 128 |
| Keep-alive idle timeout | 5s | 30s |
| Keep-alives | Enabled | Enabled |
These values apply when calling a single Baseten model endpoint.
If you call multiple models, increase max connections proportionally.
Keep-alives are always enabled on Baseten.
***
## Set timeouts
httpx applies a default 5-second timeout to all operations, which is too short
for most inference workloads. LLM generation, image processing, and other model
inference tasks routinely take tens of seconds to minutes. Without properly
configured timeouts, your client will close connections before the model
finishes processing.
Set client timeouts based on your model's expected response time. Baseten's
ingress proxy allows up to 10 minutes (600 seconds) for synchronous predict
requests, but your client-side timeouts should reflect your actual workload
rather than matching the server maximum.
httpx lets you configure four separate timeout values with [`httpx.Timeout`](https://www.python-httpx.org/advanced/timeouts/). Separating connect and
read timeouts prevents slow network conditions from being confused with slow
model responses.
```python theme={"system"}
import httpx
timeout = httpx.Timeout(
connect=10.0, # Time to establish connection
read=600.0, # Time to receive response
write=30.0, # Time to send request body
pool=10.0, # Time to acquire a connection from the pool
)
client = httpx.Client(
base_url=f"https://model-{model_id}.api.baseten.co",
headers={"Authorization": f"Api-Key {api_key}"},
timeout=timeout,
)
```
### Timeout guidance by use case
| Use case | Connect | Read | Notes |
| ------------------------ | ------- | ---- | ------------------------- |
| LLM inference (sync) | 10s | 600s | Long generation times |
| Embedding/classification | 10s | 60s | Faster response |
| Async predict (submit) | 10s | 30s | Just submitting the job |
| Streaming | 10s | 600s | Keep open for full stream |
For long-running requests that exceed sync timeouts, use [async inference](/inference/async) with polling.
***
## Implement retries
Transient errors happen at scale and can negatively impact your application's reliability and throughput.
Retry with exponential backoff using libraries like [tenacity](https://tenacity.readthedocs.io/en/stable/).
Only retry on transient errors. Retrying client errors like 400 or 401 wastes
time and can mask bugs in your request payload.
Retry on these status codes and connection errors:
* **429** (rate limited)
* **500** (internal server error)
* **502** (bad gateway)
* **503** (service unavailable)
* **504** (gateway timeout)
* Connection errors ([`ConnectError`](https://www.python-httpx.org/exceptions/), `ReadTimeout`)
Don't retry on these status codes:
* **400** (bad request)
* **401** (unauthorized)
* **403** (forbidden)
* **404** (not found)
* **422** (validation error)
The following example uses httpx with tenacity to retry failed requests with exponential backoff.
```python theme={"system"}
import httpx
from tenacity import retry, stop_after_attempt, wait_exponential, retry_if_exception
def is_retryable(exception):
if isinstance(exception, httpx.HTTPStatusError):
return exception.response.status_code in (429, 500, 502, 503, 504)
return isinstance(exception, (httpx.ConnectError, httpx.ReadTimeout))
@retry(
retry=retry_if_exception(is_retryable),
wait=wait_exponential(multiplier=1, min=1, max=30),
stop=stop_after_attempt(5),
)
def predict(client, payload):
response = client.post("/environments/production/predict", json=payload)
response.raise_for_status()
return response.json()
```
***
## Handle errors
Many errors that look like platform outages actually originate from client-side
misconfiguration. Before opening a support ticket, check whether your error
matches one of these common patterns. If you see `PoolTimeout` or
`Connection refused` under high concurrency, the issue is almost always your
client's pool configuration, not Baseten's servers.
| Error | Likely cause | Resolution |
| -------------------- | ----------------------------------- | ---------------------------------------- |
| `PoolTimeout` | Connection pool exhausted | Increase pool size or reduce concurrency |
| `ConnectTimeout` | Network issue or server unavailable | Check network, then retry |
| `ReadTimeout` | Model taking longer than expected | Increase read timeout for your use case |
| `Connection refused` | Client-side port or pool exhaustion | Increase pool limits, check NAT config |
***
## Monitor connections
Connection problems tend to surface as intermittent failures rather than
complete outages, making them difficult to diagnose without proper monitoring. A
gradually exhausting connection pool won't cause errors until it's completely
full, at which point requests start failing unpredictably.
Watch for these signals:
* **Rising p99 latency** without changes to model performance, which often indicates pool contention.
* **Sporadic `Connection refused` errors** under load, which point to port or pool exhaustion.
* **TCP retransmits** increasing over time, which suggest connections are being dropped and recreated.
If you route traffic through a NAT gateway, monitor port utilization.
Each outbound connection consumes a port, and high-concurrency workloads can exhaust the available port range, causing intermittent connection failures that are difficult to distinguish from server-side issues.
***
## Use with proxies
Enterprise deployments often route traffic through HTTP proxies for security, logging, or network policy enforcement. httpx supports proxy configuration at the client level, so connection pooling and keep-alives continue to work through the proxy.
You may need to increase your pool limits when using a proxy, since the additional network hop increases per-request latency, which means connections are held open longer and the pool drains faster under the same concurrency.
```python theme={"system"}
import httpx
client = httpx.Client(
base_url=f"https://model-{model_id}.api.baseten.co",
headers={"Authorization": f"Api-Key {api_key}"},
proxy="http://corporate-proxy.example.com:8080",
limits=httpx.Limits(max_connections=300),
)
```
***
## Further reading
* [Performance Client](/engines/performance-concepts/performance-client): Handles connection pooling, retries, and concurrency automatically.
* [Async inference](/inference/async): For long-running requests that exceed sync timeout limits.
* [Streaming](/inference/streaming): For streaming model responses.
# Performance client
Source: https://docs.baseten.co/engines/performance-concepts/performance-client
High-performance client library for embeddings, reranking, classification, and generic batch requests
Built in Rust and integrated with Python, Node.js, and native Rust, the *Performance Client* handles concurrent POST requests efficiently.
It releases the Python GIL while executing requests, enabling simultaneous sync and async usage.
[Benchmarks](https://www.baseten.co/blog/your-client-code-matters-10x-higher-embedding-throughput-with-python-and-rust/) show the Performance Client reaches 1200+ requests per second per client.
Use it with **Baseten deployments** or **third-party providers** like OpenAI.
***
## Install the client
Install the Performance Client:
```bash theme={"system"}
uv pip install baseten_performance_client>=0.1.0
```
To install the Performance Client for JavaScript, use npm:
```bash theme={"system"}
npm install baseten-performance-client
```
***
## Get started
To initialize the Performance Client in Python, import the class and provide your base URL and API key:
```python theme={"system"}
from baseten_performance_client import PerformanceClient
client = PerformanceClient(
base_url="https://model-YOUR_MODEL_ID.api.baseten.co/environments/production/sync",
api_key="YOUR_API_KEY"
)
```
To initialize the Performance Client with JavaScript, require the package and create a new instance:
```javascript theme={"system"}
const { PerformanceClient } = require("baseten-performance-client");
const client = new PerformanceClient(
"https://model-YOUR_MODEL_ID.api.baseten.co/environments/production/sync",
process.env.BASETEN_API_KEY
);
```
The client also works with third-party providers like OpenAI by replacing the `base_url`.
### Advanced setup
Configure HTTP version selection and *connection pooling* for optimal performance.
To configure HTTP version and connection pooling in Python, use the `http_version` parameter and `HttpClientWrapper`:
```python theme={"system"}
from baseten_performance_client import PerformanceClient, HttpClientWrapper
# HTTP/1.1 (default, better for high concurrency)
client_http1 = PerformanceClient(BASE_URL, API_KEY, http_version=1)
# HTTP/2 (not recommended on Baseten)
client_http2 = PerformanceClient(BASE_URL, API_KEY, http_version=2)
# Connection pooling for multiple clients
wrapper = HttpClientWrapper(http_version=1)
client1 = PerformanceClient(base_url="https://api1.example.com", client_wrapper=wrapper)
client2 = PerformanceClient(base_url="https://api2.example.com", client_wrapper=wrapper)
```
To configure HTTP version and connection pooling with JavaScript, pass the version as the third argument and use `HttpClientWrapper`:
```javascript theme={"system"}
const { PerformanceClient, HttpClientWrapper } = require('baseten-performance-client');
// HTTP/1.1 (default, better for high concurrency)
const clientHttp1 = new PerformanceClient(BASE_URL, API_KEY, 1);
// HTTP/2
const clientHttp2 = new PerformanceClient(BASE_URL, API_KEY, 2);
// Connection pooling for multiple clients
const wrapper = new HttpClientWrapper(1);
const pooledClient1 = new PerformanceClient(BASE_URL_1, API_KEY, 1, wrapper);
const pooledClient2 = new PerformanceClient(BASE_URL_2, API_KEY, 1, wrapper);
```
***
## Core features
### Embeddings
The client provides efficient embedding requests with configurable *batching*, concurrency, and latency optimizations. Compatible with [BEI](/engines/bei/overview).
To generate embeddings with Python, configure a `RequestProcessingPreference` and call `client.embed()`:
```python theme={"system"}
from baseten_performance_client import PerformanceClient, RequestProcessingPreference
client = PerformanceClient(base_url=BASE_URL, api_key=API_KEY)
texts = ["Hello world", "Example text", "Another sample"] * 10
preference = RequestProcessingPreference(
batch_size=16,
max_concurrent_requests=256,
max_chars_per_request=10000,
hedge_delay=0.5,
timeout_s=360,
total_timeout_s=600,
extra_headers={"x-custom-header": "value"}
)
response = client.embed(
input=texts,
model="my_model",
preference=preference
)
print(f"Model used: {response.model}")
print(f"Total tokens used: {response.usage.total_tokens}")
print(f"Total time: {response.total_time:.4f}s")
# Convert to numpy array (requires numpy)
numpy_array = response.numpy()
print(f"Embeddings shape: {numpy_array.shape}")
```
For async usage, call `await client.async_embed(input=texts, model="my_model", preference=preference)`.
To generate embeddings with JavaScript, configure a `RequestProcessingPreference` and call `client.embed()`:
```javascript theme={"system"}
const { PerformanceClient, RequestProcessingPreference } = require('baseten-performance-client');
const client = new PerformanceClient(BASE_URL, API_KEY);
const texts = ["Hello world", "Example text", "Another sample"];
const preference = new RequestProcessingPreference(
32, // maxConcurrentRequests
4, // batchSize
10000, // maxCharsPerRequest
360.0, // timeoutS
0.5 // hedgeDelay
);
const response = await client.embed(
texts, // input
"my_model", // model
null, // encodingFormat
null, // dimensions
null, // user
preference // preference parameter
);
console.log(`Model used: ${response.model}`);
console.log(`Total tokens used: ${response.usage.total_tokens}`);
console.log(`Total time: ${response.total_time.toFixed(4)}s`);
```
### Generic batch POST
Send HTTP requests to any URL with any JSON payload. Compatible with [Engine-Builder-LLM](/engines/engine-builder-llm/overview) and other models. Set `stream=False` for SSE endpoints.
To send batch POST requests with Python, define your payloads and call `client.batch_post()`:
```python theme={"system"}
from baseten_performance_client import PerformanceClient, RequestProcessingPreference
client = PerformanceClient(base_url=BASE_URL, api_key=API_KEY)
payloads = [
{"model": "my_model", "prompt": "Batch request 1", "stream": False},
{"model": "my_model", "prompt": "Batch request 2", "stream": False}
] * 10
preference = RequestProcessingPreference(
max_concurrent_requests=96,
timeout_s=720,
hedge_delay=0.5,
extra_headers={"x-custom-header": "value"}
)
response = client.batch_post(
url_path="/v1/completions",
payloads=payloads,
preference=preference,
method="POST"
)
print(f"Total time: {response.total_time:.4f}s")
```
Supported methods: `GET`, `POST`, `PUT`, `PATCH`, `DELETE`, `HEAD`, `OPTIONS`.
For async usage, call `await client.async_batch_post(url_path, payloads, preference, method)`.
To send batch POST requests with JavaScript, define your payloads and call `client.batchPost()`:
```javascript theme={"system"}
const { PerformanceClient, RequestProcessingPreference } = require('baseten-performance-client');
const client = new PerformanceClient(BASE_URL, API_KEY);
const payloads = [
{ model: "my_model", prompt: "Batch request 1", stream: false },
{ model: "my_model", prompt: "Batch request 2", stream: false }
];
const preference = new RequestProcessingPreference(
32, // maxConcurrentRequests
undefined, // batchSize
undefined, // maxCharsPerRequest
360.0, // timeoutS
0.5 // hedgeDelay
);
const response = await client.batchPost(
"/v1/completions",
payloads,
preference,
"POST"
);
console.log(`Total time: ${response.total_time.toFixed(4)}s`);
```
### Reranking
Rerank documents by relevance to a query. Compatible with [BEI](/engines/bei/overview), [BEI-Bert](/engines/bei/overview), and text-embeddings-inference reranking endpoints.
To rerank documents with Python, provide a query and list of documents to `client.rerank()`:
```python theme={"system"}
from baseten_performance_client import PerformanceClient, RequestProcessingPreference
client = PerformanceClient(base_url=BASE_URL, api_key=API_KEY)
query = "What is the best framework?"
documents = ["Doc 1 text", "Doc 2 text", "Doc 3 text"]
preference = RequestProcessingPreference(
batch_size=16,
max_concurrent_requests=32,
timeout_s=360,
max_chars_per_request=256000,
hedge_delay=0.5,
extra_headers={"x-rerank-header": "value"}
)
response = client.rerank(
query=query,
texts=documents,
model="rerank-model",
return_text=True,
preference=preference
)
for res in response.data:
print(f"Index: {res.index} Score: {res.score}")
```
For async usage, call `await client.async_rerank(query, texts, model, return_text, preference)`.
To rerank documents with JavaScript, provide a query and list of documents to `client.rerank()`:
```javascript theme={"system"}
const { PerformanceClient, RequestProcessingPreference } = require('baseten-performance-client');
const client = new PerformanceClient(BASE_URL, API_KEY);
const query = "What is the best framework?";
const documents = ["Doc 1 text", "Doc 2 text", "Doc 3 text"];
const preference = new RequestProcessingPreference(
32, // maxConcurrentRequests
16, // batchSize
undefined, // maxCharsPerRequest
360.0, // timeoutS
0.5 // hedgeDelay
);
const response = await client.rerank(query, documents, "rerank-model", true, preference);
response.data.forEach(res => console.log(`Index: ${res.index} Score: ${res.score}`));
```
### Classification
Classify text inputs into categories. Compatible with [BEI](/engines/bei/overview) and text-embeddings-inference classification endpoints.
To classify text with Python, provide a list of inputs to `client.classify()`:
```python theme={"system"}
from baseten_performance_client import PerformanceClient, RequestProcessingPreference
client = PerformanceClient(base_url=BASE_URL, api_key=API_KEY)
texts_to_classify = [
"This is great!",
"I did not like it.",
"Neutral experience."
]
preference = RequestProcessingPreference(
batch_size=16,
max_concurrent_requests=32,
timeout_s=360.0,
max_chars_per_request=256000,
hedge_delay=0.5,
extra_headers={"x-classify-header": "value"}
)
response = client.classify(
inputs=texts_to_classify,
model="classification-model",
preference=preference
)
for group in response.data:
for result in group:
print(f"Label: {result.label}, Score: {result.score}")
```
For async usage, call `await client.async_classify(inputs, model, preference)`.
To classify text with JavaScript, provide a list of inputs to `client.classify()`:
```javascript theme={"system"}
const { PerformanceClient, RequestProcessingPreference } = require('baseten-performance-client');
const client = new PerformanceClient(BASE_URL, API_KEY);
const texts = ["This is great!", "I did not like it.", "Neutral experience."];
const preference = new RequestProcessingPreference(32, 16, undefined, 360.0, 0.5);
const response = await client.classify(texts, "classification-model", preference);
response.data.forEach(group => {
group.forEach(result => console.log(`Label: ${result.label}, Score: ${result.score}`));
});
```
***
## Advanced features
### Configure RequestProcessingPreference
The `RequestProcessingPreference` class provides unified configuration for all request processing parameters.
To configure request processing in Python, create a `RequestProcessingPreference` instance:
```python theme={"system"}
from baseten_performance_client import RequestProcessingPreference
preference = RequestProcessingPreference(
max_concurrent_requests=64,
batch_size=32,
timeout_s=30.0,
hedge_delay=0.5,
hedge_budget_pct=0.15,
retry_budget_pct=0.08,
total_timeout_s=300.0
)
```
To configure request processing with JavaScript, create a `RequestProcessingPreference` instance:
```javascript theme={"system"}
const { RequestProcessingPreference } = require('baseten-performance-client');
const preference = new RequestProcessingPreference(
64, // maxConcurrentRequests
32, // batchSize
undefined, // maxCharsPerRequest
30.0, // timeoutS
0.5, // hedgeDelay
undefined, // totalTimeoutS
0.15, // hedgeBudgetPct
0.08 // retryBudgetPct
);
```
#### Parameter reference
| Parameter | Type | Default | Range | Description |
| ------------------------- | ----- | ------- | ----------- | ------------------------------------------- |
| `max_concurrent_requests` | int | 128 | 1-1024 | Maximum parallel requests |
| `batch_size` | int | 128 | 1-1024 | Items per batch |
| `timeout_s` | float | 3600.0 | 1.0-7200.0 | Per-request timeout in seconds |
| `hedge_delay` | float | None | 0.2-30.0 | *Hedge delay* in seconds (see below) |
| `hedge_budget_pct` | float | 0.10 | 0.0-3.0 | Percentage of requests allowed for hedging |
| `retry_budget_pct` | float | 0.05 | 0.0-3.0 | Percentage of requests allowed for retries |
| `total_timeout_s` | float | None | ≥timeout\_s | Total operation timeout |
| `extra_headers` | dict | None | - | Custom headers to include with all requests |
*Hedge delay* sends duplicate requests after a specified delay to reduce p99 latency. After the delay, the request is cloned and raced against the original. The 429 and 5xx errors are always retried automatically.
### Select HTTP version
Choose between HTTP/1.1 and HTTP/2 for optimal performance. HTTP/1.1 is recommended for high concurrency workloads.
To select the HTTP version in Python, use the `http_version` parameter:
```python theme={"system"}
from baseten_performance_client import PerformanceClient
# HTTP/1.1 (default, better for high concurrency)
client_http1 = PerformanceClient(BASE_URL, API_KEY, http_version=1)
# HTTP/2 (better for single requests)
client_http2 = PerformanceClient(BASE_URL, API_KEY, http_version=2)
```
To select the HTTP version with JavaScript, pass the version as the third argument:
```javascript theme={"system"}
const { PerformanceClient } = require('baseten-performance-client');
// HTTP/1.1 (default, better for high concurrency)
const clientHttp1 = new PerformanceClient(BASE_URL, API_KEY, 1);
// HTTP/2 (better for single requests)
const clientHttp2 = new PerformanceClient(BASE_URL, API_KEY, 2);
```
### Share connection pools
Share connection pools across multiple client instances to reduce overhead when connecting to multiple endpoints.
To share a connection pool in Python, create an `HttpClientWrapper` and pass it to each client:
```python theme={"system"}
from baseten_performance_client import PerformanceClient, HttpClientWrapper
wrapper = HttpClientWrapper(http_version=1)
client1 = PerformanceClient(base_url="https://api1.example.com", client_wrapper=wrapper)
client2 = PerformanceClient(base_url="https://api2.example.com", client_wrapper=wrapper)
```
To share a connection pool with JavaScript, create an `HttpClientWrapper` and pass it to each client:
```javascript theme={"system"}
const { PerformanceClient, HttpClientWrapper } = require('baseten-performance-client');
const wrapper = new HttpClientWrapper(1);
const client1 = new PerformanceClient(BASE_URL_1, API_KEY, 1, wrapper);
const client2 = new PerformanceClient(BASE_URL_2, API_KEY, 1, wrapper);
```
### Cancel operations
Cancel long-running operations using `CancellationToken`. The token provides immediate cancellation, resource cleanup, Ctrl+C support, token sharing across operations, and status checking with `is_cancelled()`.
To cancel operations in Python, create a `CancellationToken` and pass it to your preference:
```python theme={"system"}
from baseten_performance_client import (
PerformanceClient,
CancellationToken,
RequestProcessingPreference
)
import threading
import time
client = PerformanceClient(base_url=BASE_URL, api_key=API_KEY)
cancel_token = CancellationToken()
preference = RequestProcessingPreference(
max_concurrent_requests=32,
batch_size=16,
timeout_s=360.0,
cancel_token=cancel_token
)
def long_operation():
try:
response = client.embed(
input=["large batch"] * 1000,
model="embedding-model",
preference=preference
)
print("Operation completed")
except ValueError as e:
if "cancelled" in str(e):
print("Operation was cancelled")
threading.Thread(target=long_operation).start()
time.sleep(2)
cancel_token.cancel()
```
To cancel operations with JavaScript, create a `CancellationToken` and pass it to your preference:
```javascript theme={"system"}
const { PerformanceClient, CancellationToken, RequestProcessingPreference } = require('baseten-performance-client');
const client = new PerformanceClient(BASE_URL, API_KEY);
const cancelToken = new CancellationToken();
const preference = new RequestProcessingPreference(
32, 16, undefined, 360.0, undefined, undefined,
undefined, undefined, undefined, undefined, cancelToken
);
const operation = client.embed(
["large batch"].concat(Array(1000).fill("sample")),
"model",
undefined,
undefined,
undefined,
preference
);
setTimeout(() => cancelToken.cancel(), 2000);
try {
const response = await operation;
console.log("Operation completed");
} catch (error) {
if (error.message.includes("cancelled")) {
console.log("Operation was cancelled");
}
}
```
***
## Handle errors
The client raises standard exceptions for error conditions:
* **`HTTPError`**: Authentication failures (403), server errors (5xx), endpoint not found (404).
* **`Timeout`**: Request or total operation timeout based on `timeout_s` or `total_timeout_s`.
* **`ValueError`**: Invalid input parameters (empty input list, invalid batch size, inconsistent embedding dimensions).
To handle errors in Python, catch the appropriate exception types:
```python theme={"system"}
import requests
from baseten_performance_client import PerformanceClient, RequestProcessingPreference
client = PerformanceClient(base_url=BASE_URL, api_key=API_KEY)
preference = RequestProcessingPreference(timeout_s=30.0)
try:
response = client.embed(input=["text"], model="model", preference=preference)
print(f"Model used: {response.model}")
except requests.exceptions.HTTPError as e:
print(f"HTTP error: {e}, status code: {e.response.status_code}")
except requests.exceptions.Timeout as e:
print(f"Timeout error: {e}")
except ValueError as e:
print(f"Input error: {e}")
```
To handle errors with JavaScript, use a try-catch block and inspect the error object:
```javascript theme={"system"}
const { PerformanceClient, RequestProcessingPreference } = require('baseten-performance-client');
const client = new PerformanceClient(BASE_URL, API_KEY);
const preference = new RequestProcessingPreference(32, 16, undefined, 30.0);
try {
const response = await client.embed(texts, "model", undefined, undefined, undefined, preference);
console.log("Success:", response.model);
} catch (error) {
if (error.response) {
console.log(`HTTP error: ${error.response.status}`);
} else if (error.code === 'TIMEOUT') {
console.log("Timeout error");
} else {
console.log(`Error: ${error.message}`);
}
}
```
***
## Configure the client
### Environment variables
* **`BASETEN_API_KEY`**: Your Baseten API key. Also checks `OPENAI_API_KEY` as fallback.
* **`PERFORMANCE_CLIENT_LOG_LEVEL`**: Logging level. Overrides `RUST_LOG`. Valid values: `trace`, `debug`, `info`, `warn`, `error`. Default: `warn`.
* **`PERFORMANCE_CLIENT_REQUEST_ID_PREFIX`**: Custom prefix for request IDs. Default: `perfclient`.
### Configure logging
To set the logging level, use the `PERFORMANCE_CLIENT_LOG_LEVEL` environment variable:
```bash theme={"system"}
PERFORMANCE_CLIENT_LOG_LEVEL=info python script.py
PERFORMANCE_CLIENT_LOG_LEVEL=debug cargo run
```
The `PERFORMANCE_CLIENT_LOG_LEVEL` variable takes precedence over `RUST_LOG`.
***
## Use with Rust
The Performance Client is also available as a native Rust library.
To use the Performance Client in Rust, add the dependencies and create a `PerformanceClientCore` instance:
```rust theme={"system"}
use baseten_performance_client_core::{PerformanceClientCore, ClientError};
use tokio;
#[tokio::main]
async fn main() -> Result<(), Box> {
let api_key = std::env::var("BASETEN_API_KEY").expect("BASETEN_API_KEY not set");
let base_url = "https://model-YOUR_MODEL_ID.api.baseten.co/environments/production/sync";
let client = PerformanceClientCore::new(base_url, Some(api_key), None, None);
// Generate embeddings
let texts = vec!["Hello world".to_string(), "Example text".to_string()];
let embedding_response = client.embed(
texts,
"my_model".to_string(),
Some(16),
Some(32),
Some(360.0),
Some(256000),
Some(0.5),
Some(360.0),
).await?;
println!("Model: {}", embedding_response.model);
println!("Total tokens: {}", embedding_response.usage.total_tokens);
// Send batch POST requests
let payloads = vec![
serde_json::json!({"model": "my_model", "input": ["Rust sample 1"]}),
serde_json::json!({"model": "my_model", "input": ["Rust sample 2"]}),
];
let batch_response = client.batch_post(
"/v1/embeddings".to_string(),
payloads,
Some(32),
Some(360.0),
Some(0.5),
Some(360.0),
None,
).await?;
println!("Batch POST total time: {:.4}s", batch_response.total_time);
Ok(())
}
```
Add these dependencies to your `Cargo.toml`:
```toml theme={"system"}
[dependencies]
baseten_performance_client_core = "0.1.4"
tokio = { version = "1.0", features = ["full"] }
serde_json = "1.0"
```
***
## Related
* [GitHub: baseten-performance-client](https://github.com/basetenlabs/truss/tree/main/baseten-performance-client): Complete source code and additional examples.
* [Performance benchmarks blog](https://www.baseten.co/blog/your-client-code-matters-10x-higher-embedding-throughput-with-python-and-rust/): Detailed performance analysis and comparisons.
# Quantization guide
Source: https://docs.baseten.co/engines/performance-concepts/quantization-guide
FP8 and FP4 trade-offs and hardware requirements for all engines
*Quantization* trades precision for speed and memory efficiency. This guide covers Baseten's supported formats, hardware requirements, and model-specific recommendations.
## Quantization options
Quantization type availability depends on the engine and GPU.
### Engine support
| **Quantization** | [**BIS-LLM**](/engines/bis-llm/overview) | [**Engine-Builder-LLM**](/engines/engine-builder-llm/overview) | [**BEI**](/engines/bei/overview) |
| ---------------- | ---------------------------------------- | -------------------------------------------------------------- | -------------------------------- |
| `FP8` | ✅ | ✅ | ✅ |
| `FP8_KV` | ✅ | ✅ | ⚠️ |
| `FP4` | ✅ | ✅ | ⚠️ |
| `FP4_KV` | ✅ | ✅ | ⚠️ |
| `FP4_MLP_ONLY` | ✅ | ✅ | ✅ |
### GPU support
| **GPU type** | `FP8` | `FP8_KV` | `FP4` | `FP4_KV` | `FP4_MLP_ONLY` |
| ------------ | ----- | -------- | ----- | -------- | -------------- |
| **L4** | ✅ | ✅ | ❌ | ❌ | ❌ |
| **H100** | ✅ | ✅ | ❌ | ❌ | ❌ |
| **H200** | ✅ | ✅ | ❌ | ❌ | ❌ |
| **B200** | ✅ | ✅ | ✅ | ✅ | ✅ |
## Model recommendations
Some model families have specific quantization requirements that affect accuracy.
### Qwen2 models
Qwen2 retains QKV projection bias (attention bias), while Qwen3, Llama3, Llama2, and most other models remove it. This makes Qwen2 sensitive to symmetric KV cache quantization, so `FP8_KV` causes quality degradation. Use regular `FP8` instead and increase calibration size to 1024 or greater for better accuracy.
### Llama models
Llama variants work well with `FP8_KV` and standard calibration sizes (1024-1536). For B200 deployments, use `FP4_MLP_ONLY` for the best balance of speed and quality.
### BEI models (embeddings)
Use `FP8` for embedding models for causal models. Skip quantization for smaller models since the overhead isn't worth the minimal benefit and Bert is not supported. BEI doesn't support `FP8_KV`.
## Calibration
Quantization requires calibration data to determine optimal scaling factors. Larger models generally need more calibration samples.
### Calibration datasets
The default dataset is `cnn_dailymail` (general news text). For specialized models, or fine-tunes specific to a chat template, use domain-specific datasets when available.
For using a custom dataset, reference the huggingface name under `calib_dataset`, and make sure the dataset has a `train` split with a `text`/`messages` column.
When using the `messages` column, we require the tokenizer of your model to have a `apply_chat_template()` function on which we can apply `apply_chat_template(row["messages"]) for row in rows`.
If you want to use a dataset without preprocessing, you can provide a `text` column.
For chat-based calibration with thinking , we open-sourced [`baseten/quant_calibration_dataset_v1`](https://huggingface.co/datasets/baseten/quant_calibration_dataset_v1), to showcase an example.
### Calibration configuration
```yaml theme={"system"}
quantization_config:
calib_size: 768 # Number of samples
calib_dataset: "abisee/cnn_dailymail" # Dataset name
calib_max_seq_length: 1024 # Max sequence length
```
Increase `calib_size` for larger models. Use domain-specific datasets when available for better accuracy on specialized tasks.
## Hardware requirements
`FP4` quantization requires B200 GPUs. `FP8` runs on L4 and above.
| **Quantization** | **Minimum GPU** | **Recommended GPU** | **Memory reduction** |
| ---------------- | --------------- | ------------------- | -------------------- |
| `FP16`/`BF16` | A100 | H100 | None |
| `FP8` | L4 | H100 | \~50% |
| `FP8_KV` | L4 | H100 | \~60% |
| `FP4` | B200 | B200 | \~75% |
| `FP4_KV` | B200 | B200 | \~80% |
### Configuration examples
**Engine-Builder-LLM:**
```yaml theme={"system"}
trt_llm:
build:
base_model: decoder
quantization_type: fp8
quantization_config:
calib_size: 1024
```
**BIS-LLM:**
```yaml theme={"system"}
trt_llm:
inference_stack: v2
build:
quantization_type: fp8
quantization_config:
calib_size: 1024
runtime:
max_seq_len: 32768
```
**BEI:**
```yaml theme={"system"}
trt_llm:
build:
base_model: encoder
quantization_type: fp8
max_num_tokens: 16384
```
Set `quantization_type` in the build section and add `quantization_config` to customize calibration. BIS-LLM uses `inference_stack: v2` while Engine-Builder-LLM uses `base_model: decoder`.
## Best practices
### When to use quantization
Use `FP8` for production deployments to achieve cost-effective scaling. For memory-constrained environments, `FP8_KV` or `FP4` variants provide additional memory reduction. Quantization becomes essential for models over 15B parameters where memory and cost savings are significant.
### When to avoid quantization
Skip quantization when maximum accuracy is critical. Use `FP16`/`BF16` instead. Small models under 8B parameters see minimal benefit from quantization. BEI-Bert models don't support quantization at all. During research and development, `FP16` provides faster iteration without calibration overhead.
### Optimization tips
Use calibration datasets that match your domain for best accuracy. Test quantized models with your specific data before production deployment. Monitor the accuracy vs. performance trade-off and consider your hardware constraints when selecting quantization type.
## Related
* [Engine-Builder-LLM configuration](/engines/engine-builder-llm/engine-builder-config): Dense model configuration.
* [BIS-LLM configuration](/engines/bis-llm/bis-llm-config): MoE model configuration.
* [BEI configuration](/engines/bei/bei-reference): Embedding model configuration.
# Structured outputs
Source: https://docs.baseten.co/engines/performance-concepts/structured-outputs
JSON schema validation and controlled text generation across all engines
Structured outputs let you generate text that conforms to specific JSON schemas, providing reliable data extraction and controlled text generation. This feature is supported by Baseten engines like [BIS-LLM](/engines/bis-llm/overview) and [Engine-Builder-LLM](/engines/engine-builder-llm/overview), as well as other inference frameworks like [vLLM](/examples/vllm) and [SGLang](/examples/sglang).
## Quick start
Structured outputs require two components: a Pydantic schema defining your expected output format, and an API call that enforces that schema.
### Define a schema
```python theme={"system"}
from pydantic import BaseModel
class Task(BaseModel):
title: str
priority: str # "low", "medium", "high"
due_date: str
description: str
```
Each field requires a type annotation. The model's response will conform to these types exactly.
### Generate structured output
```python theme={"system"}
import os
from pydantic import BaseModel
from openai import OpenAI
class Task(BaseModel):
title: str
priority: str
due_date: str
description: str
client = OpenAI(
api_key=os.environ['BASETEN_API_KEY'],
base_url="https://model-xxxxxx.api.baseten.co/environments/production/sync/v1"
)
response = client.beta.chat.completions.parse(
model="not-required",
messages=[
{"role": "user", "content": "Create a task for: Review the quarterly report by next Friday"}
],
response_format=Task
)
task = response.choices[0].message.parsed
print(f"Task: {task.title}")
print(f"Priority: {task.priority}")
```
Point `base_url` to your model's production endpoint. Pass your Pydantic class to `response_format` and use `beta.chat.completions.parse` instead of the regular `create` method.
The response includes a `parsed` attribute with your data already converted to a `Task` object, so no JSON parsing is needed.
## Engine support
Structured outputs are compatible with:
* **Engine-Builder-LLM**, except when Lookahead speculative decoding is configured.
* **BIS-LLM**: except for a few exceptions like overlap scheduler enabled.
### Model support
All Engine-Builder-LLM and BIS-LLM models support structured outputs out of the box with no additional configuration required.
## Best practices
### Schema design
* **Keep schemas simple**: 2-3 levels max nesting for best results.
* **Use basic types**: str, int, float, bool when possible.
* **Set defaults**: Provide reasonable default values for optional fields.
* **Descriptive names**: Use clear, descriptive field names.
### Prompt engineering
* **Low temperature**: Use 0.1-0.3 for consistent outputs.
* **Provide schema**: Dump the model schema and few-shot examples into context.
* **Provide context**: Give background for complex schemas.
## Related
* [Engine-Builder-LLM overview](/engines/engine-builder-llm/overview): Dense model documentation.
* [BIS-LLM overview](/engines/bis-llm/overview): MoE model documentation.
* [Quantization guide](/engines/performance-concepts/quantization-guide): `FP8`/`FP4` trade-offs.
# Embeddings with BEI
Source: https://docs.baseten.co/examples/bei
Serve embedding, reranking, and classification models
Baseten Embeddings Inference is Baseten's solution for production grade inference on embedding, classification and reranking models using TensorRT-LLM.
With Baseten Embeddings Inference you get the following benefits:
* Lowest-latency inference across any embedding solution (vLLM, SGlang, Infinity, TEI, Ollama)1
* Highest-throughput inference across any embedding solution (vLLM, SGlang, Infinity, TEI, Ollama) - thanks to XQA kernels, FP8 and dynamic batching.2
* High parallelism: up to 1400 client embeddings per second
* Cached model weights for fast vertical scaling and high availability - no Hugging Face hub dependency at runtime
* Ahead-of-time compilation, memory allocation and fp8 post-training quantization
### Getting started with embedding models:
Embedding models are LLMs without a lm\_head for language generation.
Typical architectures that are supported for embeddings are `LlamaModel`, `BertModel`, `RobertaModel` or `Gemma2Model`, and contain the safetensors, config, tokenizer and sentence-transformer config files.
A good example is the repo [BAAI/bge-multilingual-gemma2](https://huggingface.co/BAAI/bge-multilingual-gemma2).
To deploy a model for embeddings, set the following config in your local directory.
```yaml config.yaml theme={"system"}
model_name: BEI-Linq-Embed-Mistral
resources:
accelerator: H100_40GB
use_gpu: true
trt_llm:
build:
base_model: encoder
checkpoint_repository:
# for a different model, change the repo to e.g. to "Salesforce/SFR-Embedding-Mistral"
# "BAAI/bge-en-icl" or "BAAI/bge-m3"
repo: "Linq-AI-Research/Linq-Embed-Mistral"
revision: main
source: HF
# only Llama, Mistral and Qwen Models support quantization.
# others, use: "quantization_type: no_quant"
quantization_type: fp8
```
With `config.yaml` in your local directory, you can deploy the model to Baseten.
```bash theme={"system"}
truss push --promote
```
Deployed embedding models are OpenAI compatible without any additional settings.
You may use the client code below to consume the model.
```python theme={"system"}
from openai import OpenAI
import os
client = OpenAI(
api_key=os.environ['BASETEN_API_KEY'],
# add the deployment URL
base_url="https://model-xxxxxx.api.baseten.co/environments/production/sync/v1"
)
embedding = client.embeddings.create(
input=["Baseten Embeddings are fast.", "Embed this sentence!"],
model="not-required"
)
```
### Example deployment of classification, reranking, and classification models
Besides embedding models, BEI deploys high-throughput rerank and classification models.
You can identify suitable architectures by their `ForSequenceClassification` suffix in the Hugging Face repo.
The use-case for these models is either Reward Modeling, Reranking documents in RAG or tasks like content moderation.
```yaml theme={"system"}
model_name: BEI-mixedbread-rerank-large-v2-fp8
resources:
accelerator: H100_40GB
cpu: '1'
memory: 10Gi
use_gpu: true
trt_llm:
build:
base_model: encoder
checkpoint_repository:
repo: michaelfeil/mxbai-rerank-large-v2-seq
revision: main
source: HF
# only Llama, Mistral and Qwen Models support quantization
quantization_type: fp8
```
As OpenAI does not offer reranking or classification, we are sending a simple request to the endpoint.
Depending on the model, you might want to apply a specific prompt template first.
```python theme={"system"}
import requests
import os
headers = {
f"Authorization": f"Api-Key {os.environ['BASETEN_API_KEY']}"
}
# model specific prompt for mixedbread's reranker v2.
prompt = (
"<|endoftext|><|im_start|>system\nYou are Qwen, created by Alibaba Cloud. You are a helpful assistant.\n<|im_end|>\n<|im_start|>user\n"
"query: {query} \ndocument: {doc} \nYou are a search relevance expert who evaluates how well documents match search queries. For each query-document pair, carefully analyze the semantic relationship between them, then provide your binary relevance judgment (0 for not relevant, 1 for relevant).\nRelevance:<|im_end|>\n<|im_start|>assistant\n"
).format(query="What is Baseten?",doc="Baseten is a fast inference provider.")
requests.post(
headers=headers,
url="https://model-xxxxxx.api.baseten.co/environments/production/sync/predict",
json={
"inputs": prompt,
"raw_scores": True,
}
)
```
### Benchmarks and performance optimizations
Embedding models on BEI are fast, and offer currently the fastest implementation for embeddings across all open-source and closed-source providers.
The team behind the implementation is the authors of [infinity](https://github.com/michaelfeil/infinity).
We recommend using fp8 quantization for Llama, Mistral, and Qwen2 models on L4 or newer (L4, H100, H200, and B200).
Quality difference between fp8 and bfloat16 is often negligible: embedding models often retain >99% cosine similarity between both precisions,
and reranking models retain the ranking order despite a difference in the retained output.
For more details, check out the [technical launch post](https://www.baseten.co/blog/how-we-built-high-throughput-embedding-inference-with-tensorrt-llm/).
The team at Baseten has additional options for sharing cached model weights across replicas - for very fast horizontal scaling.
Please contact us to enable this option.
### Deploy custom or fine-tuned models on BEI
We support the deployment of the below models, as well as all finetuned variants of these models (same architecture & customized weights).
The following repositories are supported - this list is not exhaustive.
| Model Repository | Architecture | Function |
| ------------------------------------------------------------------------------------------------------------- | ----------------------------------- | ------------------- |
| [`Salesforce/SFR-Embedding-Mistral`](https://huggingface.co/Salesforce/SFR-Embedding-Mistral) | MistralModel | embedding |
| [`BAAI/bge-m3`](https://huggingface.co/BAAI/bge-m3) | BertModel | embedding |
| [`BAAI/bge-multilingual-gemma2`](https://huggingface.co/BAAI/bge-multilingual-gemma2) | Gemma2Model | embedding |
| [`mixedbread-ai/mxbai-embed-large-v1`](https://huggingface.co/mixedbread-ai/mxbai-embed-large-v1) | BertModel | embedding |
| [`BAAI/bge-large-en-v1.5`](https://huggingface.co/BAAI/bge-large-en-v1.5) | BertModel | embedding |
| [`allenai/Llama-3.1-Tulu-3-8B-RM`](https://huggingface.co/allenai/Llama-3.1-Tulu-3-8B-RM) | LlamaForSequenceClassification | classifier |
| [`ncbi/MedCPT-Cross-Encoder`](https://huggingface.co/ncbi/MedCPT-Cross-Encoder) | BertForSequenceClassification | reranker/classifier |
| [`SamLowe/roberta-base-go_emotions`](https://huggingface.co/SamLowe/roberta-base-go_emotions) | XLMRobertaForSequenceClassification | classifier |
| [`mixedbread/mxbai-rerank-large-v2-seq`](https://huggingface.co/michaelfeil/mxbai-rerank-large-v2-seq) | Qwen2ForSequenceClassification | reranker/classifier |
| [`BAAI/bge-en-icl`](https://huggingface.co/BAAI/bge-en-icl) | LlamaModel | embedding |
| [`BAAI/bge-reranker-v2-m3`](https://huggingface.co/BAAI/bge-reranker-v2-m3) | BertForSequenceClassification | reranker/classifier |
| [`Skywork/Skywork-Reward-Llama-3.1-8B-v0.2`](https://huggingface.co/Skywork/Skywork-Reward-Llama-3.1-8B-v0.2) | LlamaForSequenceClassification | classifier |
| [`Snowflake/snowflake-arctic-embed-l`](https://huggingface.co/Snowflake/snowflake-arctic-embed-l) | BertModel | embedding |
| [`nomic-ai/nomic-embed-code`](https://huggingface.co/nomic-ai/nomic-embed-code) | Qwen2Model | embedding |
1 measured on H100-HBM3 (bert-large-335M, for BAAI/bge-en-icl: 9ms)
2 measured on H100-HBM3 (leading model architecture on MTEB, MistralModel-7B)
# Transcribe audio with Chains
Source: https://docs.baseten.co/examples/chains-audio-transcription
Process hours of audio in seconds using efficient chunking, distributed inference, and optimized GPU resources.
This guide walks through building an audio transcription pipeline using Chains. You'll break down large media files, distribute transcription tasks across autoscaling deployments, and leverage high-performance GPUs for rapid inference.
# 1. Overview
This Chain enables fast, high-quality transcription by:
* **Partitioning** long files (10+ hours) into smaller segments.
* **Detecting silence** to optimize split points.
* **Parallelizing inference** across multiple GPU-backed deployments.
* **Batching requests** to maximize throughput.
* **Using range downloads** for efficient data streaming.
* Leveraging `asyncio` for concurrent execution.
# 2. Chain structure
Transcription is divided into two processing layers:
1. **Macro chunks:** Large segments (\~300s) split from the source media file. These are processed in parallel to handle massive files efficiently.
2. **Micro chunks:** Smaller segments (\~5–30s) extracted from macro chunks and sent to the Whisper model for transcription.
# 3. Implementing the Chainlets
## `Transcribe` (Entrypoint Chainlet)
Handles transcription requests and dispatches tasks to worker Chainlets.
Function signature:
```python theme={"system"}
async def run_remote(
self,
media_url: str,
params: data_types.TranscribeParams
) -> data_types.TranscribeOutput:
```
**Steps:**
* Validates that the media source supports **range downloads**.
* Uses **FFmpeg** to extract metadata and duration.
* Splits the file into **macro chunks**, optimizing split points at silent sections.
* Dispatches **macro chunk tasks** to the MacroChunkWorker for processing.
* Collects **micro chunk transcriptions**, merges results, and returns the final text.
**Example request:**
```bash theme={"system"}
curl -X POST $INVOCATION_URL \
-H "Authorization: Api-Key $BASETEN_API_KEY" \
-d ''
```
```json theme={"system"}
{
"media_url": "http://commondatastorage.googleapis.com/gtv-videos-bucket/sample/TearsOfSteel.mp4",
"params": {
"micro_chunk_size_sec": 30,
"macro_chunk_size_sec": 300
}
}
```
## `MacroChunkWorker` (Processing Chainlet)
Processes **macro chunks** by:
* **Extracting** relevant time segments using **FFmpeg**.
* **Streaming audio** instead of downloading full files for low latency.
* **Splitting segments** at silent points.
* **Encoding** audio in base64 for efficient transfer.
* **Distributing micro chunks** to the Whisper model for transcription.
This Chainlet **runs in parallel** with multiple instances autoscaled dynamically.
## `WhisperModel` (Inference Model)
A separately deployed **Whisper** model Chainlet handles speech-to-text transcription.
* Deployed **independently** to allow fast iteration on business logic without redeploying the model.
* Used **across different Chains** or accessed directly as a standalone model.
* Supports **multiple environments** (e.g., dev, prod) using the same instance.
Whisper can also be deployed as a **standard Truss model**, separate from the Chain.
# 4. Optimizing performance
Even for very large files, **processing time remains bounded** by parallel execution.
## Key performance tuning parameters:
* `micro_chunk_size_sec` → Balance GPU utilization and inference latency.
* `macro_chunk_size_sec` → Adjust chunk size for optimal parallelism.
* **Autoscaling settings** → Tune concurrency and replica counts for load balancing.
Example speedup:
```json theme={"system"}
{
"input_duration_sec": 734.26,
"processing_duration_sec": 82.42,
"speedup": 8.9
}
```
# 5. Deploy and run the Chain
## Deploy WhisperModel first:
```bash theme={"system"}
truss chains push whisper_chainlet.py
```
Copy the **invocation URL** and update `WHISPER_URL` in `transcribe.py`.
## Deploy the transcription Chain:
```bash theme={"system"}
truss chains push transcribe.py
```
## Run transcription on a sample file:
```bash theme={"system"}
curl -X POST $INVOCATION_URL \
-H "Authorization: Api-Key $BASETEN_API_KEY" \
-d ''
```
***
# Next steps
* Learn more about [Chains](/development/chain/overview).
* Optimize GPU **autoscaling** for peak efficiency.
* Extend the pipeline with **custom business logic**.
# RAG pipeline with Chains
Source: https://docs.baseten.co/examples/chains-build-rag
Build a RAG (retrieval-augmented generation) pipeline with Chains
[Learn more about Chains](/development/chain/overview)
## Prerequisites
Install [Truss](https://pypi.org/project/truss/):
```bash theme={"system"}
uv venv && source .venv/bin/activate
uv pip install --upgrade truss
```
```bash theme={"system"}
python3 -m venv .venv && source .venv/bin/activate
pip install --upgrade truss
```
```bash theme={"system"}
python3 -m venv .venv && .venv\Scripts\activate
pip install --upgrade truss
```
You also need a [Baseten account](https://app.baseten.co/signup) with an [API key](https://app.baseten.co/settings/account/api_keys).
If you want to run this example in
[local debugging mode](/development/chain/localdev#test-a-chain-locally), you'll also need to
install chromadb:
```shell theme={"system"}
uv pip install chromadb
```
The complete code used in this tutorial can also be found in the
[Chains examples repo](https://github.com/basetenlabs/truss/tree/main/truss-chains/examples/rag).
# Overview
Retrieval-augmented generation (RAG) is a multi-model pipeline for generating
context-aware answers from LLMs.
There are a number of ways to build a RAG system. This tutorial shows a minimum
viable implementation with a basic vector store and retrieval function. It's
intended as a starting point to show how Chains helps you flexibly combine model
inference and business logic.
In this tutorial, we'll build a simple RAG pipeline for a hypothetical alumni
matching service for a university. The system:
1. Takes a bio with information about a new graduate
2. Uses a vector database to retrieve semantically similar bios of other alums
3. Uses an LLM to explain why the new graduate should meet the selected alums
4. Returns the writeup from the LLM
Let's dive in!
## Building the Chain
Create a file `rag.py` in a new directory with:
```sh theme={"system"}
mkdir rag
touch rag/rag.py
cd rag
```
Our RAG Chain is composed of three parts:
* `VectorStore`, a Chainlet that implements a vector database with a retrieval
function.
* `LLMClient`, a Stub for connecting to a deployed LLM.
* `RAG`, the entrypoint Chainlet that orchestrates the RAG pipeline and
has `VectorStore` and `LLMClient` as dependencies.
We'll examine these components one by one and then see how they all work
together.
### Vector store Chainlet
A real production RAG system would use a hosted vector database with a massive
number of stored embeddings. For this example, we're using a small local vector
store built with `chromadb` to stand in for a more complex system.
The Chainlet has three parts:
* [`remote_config`](/reference/sdk/chains#remote-configuration), which
configures a Docker image on deployment with dependencies.
* `__init__()`, which runs once when the Chainlet is spun up, and creates the
vector database with ten sample bios.
* [`run_remote()`](/development/chain/concepts#run-remote-chaining-chainlets), which runs
each time the Chainlet is called and is the sole public interface for the
Chainlet.
```python rag/rag.py theme={"system"}
import truss_chains as chains
# Create a Chainlet to serve as our vector database.
class VectorStore(chains.ChainletBase):
# Add chromadb as a dependency for deployment.
remote_config = chains.RemoteConfig(
docker_image=chains.DockerImage(
pip_requirements=["chromadb"]
)
)
# Runs once when the Chainlet is deployed or scaled up.
def __init__(self):
# Import Chainlet-specific dependencies in init, not at the top of
# the file.
import chromadb
self._chroma_client = chromadb.EphemeralClient()
self._collection = self._chroma_client.create_collection(name="bios")
# Sample documents are hard-coded for your convenience
documents = [
"Angela Martinez is a tech entrepreneur based in San Francisco. As the founder and CEO of a successful AI startup, she is a leading figure in the tech community. Outside of work, Angela enjoys hiking the trails around the Bay Area and volunteering at local animal shelters.",
"Ravi Patel resides in New York City, where he works as a financial analyst. Known for his keen insight into market trends, Ravi spends his weekends playing chess in Central Park and exploring the city's diverse culinary scene.",
"Sara Kim is a digital marketing specialist living in San Francisco. She helps brands build their online presence with creative strategies. Outside of work, Sara is passionate about photography and enjoys hiking the trails around the Bay Area.",
"David O'Connor calls New York City his home and works as a high school teacher. He is dedicated to inspiring the next generation through education. In his free time, David loves running along the Hudson River and participating in local theater productions.",
"Lena Rossi is an architect based in San Francisco. She designs sustainable and innovative buildings that contribute to the city's skyline. When she's not working, Lena enjoys practicing yoga and exploring art galleries.",
"Akio Tanaka lives in Tokyo and is a software developer specializing in mobile apps. Akio is an avid gamer and enjoys attending eSports tournaments. He also has a passion for cooking and often experiments with new recipes in his spare time.",
"Maria Silva is a nurse residing in New York City. She is dedicated to providing compassionate care to her patients. Maria finds joy in gardening and often spends her weekends tending to her vibrant flower beds and vegetable garden.",
"John Smith is a journalist based in San Francisco. He reports on international politics and has a knack for uncovering compelling stories. Outside of work, John is a history buff who enjoys visiting museums and historical sites.",
"Aisha Mohammed lives in Tokyo and works as a graphic designer. She creates visually stunning graphics for a variety of clients. Aisha loves to paint and often showcases her artwork in local exhibitions.",
"Carlos Mendes is an environmental engineer in San Francisco. He is passionate about developing sustainable solutions for urban areas. In his leisure time, Carlos enjoys surfing and participating in beach clean-up initiatives."
]
# Add all documents to the database
self._collection.add(
documents=documents,
ids=[f"id{n}" for n in range(len(documents))]
)
# Runs each time the Chainlet is called
async def run_remote(self, query: str) -> list[str]:
# This call to includes embedding the query string.
results = self._collection.query(query_texts=[query], n_results=2)
if results is None or not results:
raise ValueError("No bios returned from the query")
if not results["documents"] or not results["documents"][0]:
raise ValueError("Bios are empty")
return results["documents"][0]
```
### LLM inference stub
Now that we can retrieve relevant bios from the vector database, we need to pass
that information to an LLM to generate our final output.
Chains can integrate previously deployed models using a Stub. Like Chainlets,
Stubs implement
[`run_remote()`](/development/chain/concepts#run-remote-chaining-chainlets), but as a call
to the deployed model.
For our LLM, we'll use Phi-3 Mini Instruct, a small-but-mighty open source LLM.
One-click model deployment from Baseten's model library.
While the model is deploying, be sure to note down the models' invocation URL from
the model dashboard for use in the next step.
To use our deployed LLM in the RAG Chain, we define a Stub:
```python rag/rag.py theme={"system"}
class LLMClient(chains.StubBase):
# Runs each time the Stub is called
async def run_remote(self, new_bio: str, bios: list[str]) -> str:
# Use the retrieved bios to augment the prompt -- here's the "A" in RAG!
prompt = f"""You are matching alumni of a college to help them make connections. Explain why the person described first would want to meet the people selected from the matching database.
Person you're matching: {new_bio}
People from database: {" ".join(bios)}"""
# Call the deployed model.
resp = await self._remote.predict_async(json_payload={
"messages": [{"role": "user", "content": prompt}],
"stream" : False
})
return resp["output"][len(prompt) :].strip()
```
### RAG entrypoint Chainlet
The entrypoint to a Chain is the Chainlet that specifies the public-facing input
and output of the Chain and orchestrates calls to dependencies.
The `__init__` function in this Chainlet takes two new arguments:
* Add dependencies to any Chainlet with
[`chains.depends()`](/reference/sdk/chains#truss-chains-depends). Only
Chainlets, not Stubs, need to be added in this fashion.
* Use
[`chains.depends_context()`](/reference/sdk/chains#truss-chains-depends-context)
to inject a context object at runtime. This context object is required to
initialize the `LLMClient` stub.
* Visit your [baseten workspace](https://app.baseten.co/models) to find your
the URL of the previously deployed Phi-3 model and insert if as value
for `LLM_URL`.
```python rag/rag.py theme={"system"}
# Insert the URL from the previously deployed Phi-3 model.
LLM_URL = ...
@chains.mark_entrypoint
class RAG(chains.ChainletBase):
# Runs once when the Chainlet is spun up
def __init__(
self,
# Declare dependency chainlets.
vector_store: VectorStore = chains.depends(VectorStore),
context: chains.DeploymentContext = chains.depends_context(),
):
self._vector_store = vector_store
# The stub needs the context for setting up authentication.
self._llm = LLMClient.from_url(LLM_URL, context)
# Runs each time the Chain is called
async def run_remote(self, new_bio: str) -> str:
# Use the VectorStore Chainlet for context retrieval.
bios = await self._vector_store.run_remote(new_bio)
# Use the LLMClient Stub for augmented generation.
contacts = await self._llm.run_remote(new_bio, bios)
return contacts
```
## Testing locally
Because our Chain uses a Stub for the LLM call, we can run the whole Chain
locally without any GPU resources.
Before running the Chainlet, make sure to set your Baseten API key as an
environment variable `BASETEN_API_KEY`.
```python rag/rag.py theme={"system"}
if __name__ == "__main__":
import os
import asyncio
with chains.run_local(
# This secret is needed even locally, because part of this chain
# calls the separately deployed Phi-3 model. Only the Chainlets
# actually run locally.
secrets={"baseten_chain_api_key": os.environ["BASETEN_API_KEY"]}
):
rag_client = RAG()
result = asyncio.get_event_loop().run_until_complete(
rag_client.run_remote(
"""
Sam just moved to Manhattan for his new job at a large bank.
In college, he enjoyed building sets for student plays.
"""
)
)
print(result)
```
We can run our Chain locally:
```sh theme={"system"}
python rag.py
```
After a few moments, we should get a recommendation for why Sam should meet the
alumni selected from the database.
## Deploying to production
Once we're satisfied with our Chain's local behavior, we can deploy it to
Baseten. To deploy the Chain, run:
```sh theme={"system"}
truss chains push rag.py
```
This deploys the Chain as a published deployment. Once it's running, call it
from its API endpoint.
You can do this in the console with cURL:
```sh theme={"system"}
curl -X POST 'https://chain-5wo86nn3.api.baseten.co/production/run_remote' \
-H "Authorization: Api-Key $BASETEN_API_KEY" \
-d '{"new_bio": "Sam just moved to Manhattan for his new job at a large bank.In college, he enjoyed building sets for student plays."}'
```
Alternatively, you can also integrate this in a Python application:
```python call_chain.py theme={"system"}
import requests
import os
# Insert the URL from the deployed rag chain. You can get it from the CLI
# output or the status page, e.g.
# "https://chain-6wgeygoq.api.baseten.co/production/run_remote".
RAG_CHAIN_URL = ""
baseten_api_key = os.environ["BASETEN_API_KEY"]
if not RAG_CHAIN_URL:
raise ValueError("Please insert the URL for the RAG chain.")
resp = requests.post(
RAG_CHAIN_URL,
headers={"Authorization": f"Api-Key {baseten_api_key}"},
json={"new_bio": new_bio},
)
print(resp.json())
```
The published deployment has access to full autoscaling settings and will
scale to zero when not in use.
To iterate on the Chain during development, use `truss chains push --watch rag.py`
to create a development deployment with live code patching.
# Deploy a ComfyUI project
Source: https://docs.baseten.co/examples/comfyui
Deploy your ComfyUI workflow as an API endpoint
In this example, we'll deploy an **anime style transfer** ComfyUI workflow using truss.
This example won't require any Python code, but there are a few pre-requisites to get started.
Pre-Requisites:
1. Convert your ComfyUI workflow to an **API compatible JSON format**. The regular JSON format that is used to export Comfy workflows will not work here.
2. Have a list of the models your workflow requires along with URLs to where each model can be downloaded
## Setup
Clone the truss-examples repository and navigate to the `comfyui-truss` directory
```bash theme={"system"}
git clone https://github.com/basetenlabs/truss-examples.git
cd truss-examples/comfyui-truss
```
This repository already contains all the files we need to deploy our ComfyUI workflow.
There are just two files we need to modify:
1. `config.yaml`
2. `data/comfy_ui_workflow.json`
## Setting up the `config.yaml`
```yaml theme={"system"}
build_commands:
- git clone https://github.com/comfyanonymous/ComfyUI.git
- cd ComfyUI && git checkout b1fd26fe9e55163f780bf9e5f56bf9bf5f035c93 && pip install -r requirements.txt
- cd ComfyUI/custom_nodes && git clone https://github.com/LykosAI/ComfyUI-Inference-Core-Nodes --recursive && cd ComfyUI-Inference-Core-Nodes && pip install -e .[cuda12]
- cd ComfyUI/custom_nodes && git clone https://github.com/ZHO-ZHO-ZHO/ComfyUI-Gemini --recursive && cd ComfyUI-Gemini && pip install -r requirements.txt
- cd ComfyUI/custom_nodes && git clone https://github.com/kijai/ComfyUI-Marigold --recursive && cd ComfyUI-Marigold && pip install -r requirements.txt
- cd ComfyUI/custom_nodes && git clone https://github.com/omar92/ComfyUI-QualityOfLifeSuit_Omar92 --recursive
- cd ComfyUI/custom_nodes && git clone https://github.com/Fannovel16/comfyui_controlnet_aux --recursive && cd comfyui_controlnet_aux && pip install -r requirements.txt
- cd ComfyUI/models/controlnet && wget -O control-lora-canny-rank256.safetensors https://huggingface.co/stabilityai/control-lora/resolve/main/control-LoRAs-rank256/control-lora-canny-rank256.safetensors
- cd ComfyUI/models/controlnet && wget -O control-lora-depth-rank256.safetensors https://huggingface.co/stabilityai/control-lora/resolve/main/control-LoRAs-rank256/control-lora-depth-rank256.safetensors
- cd ComfyUI/models/checkpoints && wget -O dreamshaperXL_v21TurboDPMSDE.safetensors https://civitai.com/api/download/models/351306
- cd ComfyUI/models/loras && wget -O StudioGhibli.Redmond-StdGBRRedmAF-StudioGhibli.safetensors https://huggingface.co/artificialguybr/StudioGhibli.Redmond-V2/resolve/main/StudioGhibli.Redmond-StdGBRRedmAF-StudioGhibli.safetensors
environment_variables: {}
external_package_dirs: []
model_metadata: {}
model_name: Anime Style Transfer
python_version: py310
requirements:
- websocket-client
- accelerate
- opencv-python
resources:
accelerator: H100
use_gpu: true
secrets: {}
system_packages:
- wget
- ffmpeg
- libgl1-mesa-glx
```
The main part that needs to get filled out is under `build_commands`. Build commands are shell commands that get run during the build stage of the docker image.
In this example, the first two lines clone the ComfyUI repository and install the python requirements.
The latter commands install various custom nodes and models and place them in their respective directory within the ComfyUI repository.
## Modifying `data/comfy_ui_workflow.json`
The `comfy_ui_workflow.json` contains the entire ComfyUI workflow in an API compatible format. This is the workflow that will get executed by the ComfyUI server.
Here is the workflow we will be using for this example.
```json theme={"system"}
{
"1": {
"inputs": {
"ckpt_name": "dreamshaperXL_v21TurboDPMSDE.safetensors"
},
"class_type": "CheckpointLoaderSimple",
"_meta": {
"title": "Load Checkpoint"
}
},
"3": {
"inputs": {
"image": "{{input_image}}",
"upload": "image"
},
"class_type": "LoadImage",
"_meta": {
"title": "Load Image"
}
},
"4": {
"inputs": {
"text": [
"160",
0
],
"clip": [
"154",
1
]
},
"class_type": "CLIPTextEncode",
"_meta": {
"title": "CLIP Text Encode (Prompt)"
}
},
"12": {
"inputs": {
"strength": 0.8,
"conditioning": [
"131",
0
],
"control_net": [
"13",
0
],
"image": [
"71",
0
]
},
"class_type": "ControlNetApply",
"_meta": {
"title": "Apply ControlNet"
}
},
"13": {
"inputs": {
"control_net_name": "control-lora-canny-rank256.safetensors"
},
"class_type": "ControlNetLoader",
"_meta": {
"title": "Load ControlNet Model"
}
},
"15": {
"inputs": {
"strength": 0.8,
"conditioning": [
"12",
0
],
"control_net": [
"16",
0
],
"image": [
"18",
0
]
},
"class_type": "ControlNetApply",
"_meta": {
"title": "Apply ControlNet"
}
},
"16": {
"inputs": {
"control_net_name": "control-lora-depth-rank256.safetensors"
},
"class_type": "ControlNetLoader",
"_meta": {
"title": "Load ControlNet Model"
}
},
"18": {
"inputs": {
"seed": 995352869972963,
"denoise_steps": 4,
"n_repeat": 10,
"regularizer_strength": 0.02,
"reduction_method": "median",
"max_iter": 5,
"tol": 0.001,
"invert": true,
"keep_model_loaded": true,
"n_repeat_batch_size": 2,
"use_fp16": true,
"scheduler": "LCMScheduler",
"normalize": true,
"model": "marigold-lcm-v1-0",
"image": [
"3",
0
]
},
"class_type": "MarigoldDepthEstimation",
"_meta": {
"title": "MarigoldDepthEstimation"
}
},
"19": {
"inputs": {
"images": [
"71",
0
]
},
"class_type": "PreviewImage",
"_meta": {
"title": "Preview Image"
}
},
"20": {
"inputs": {
"images": [
"18",
0
]
},
"class_type": "PreviewImage",
"_meta": {
"title": "Preview Image"
}
},
"21": {
"inputs": {
"seed": 358881677137626,
"steps": 20,
"cfg": 7,
"sampler_name": "dpmpp_2m_sde",
"scheduler": "karras",
"denoise": 0.7000000000000001,
"model": [
"154",
0
],
"positive": [
"15",
0
],
"negative": [
"4",
0
],
"latent_image": [
"25",
0
]
},
"class_type": "KSampler",
"_meta": {
"title": "KSampler"
}
},
"25": {
"inputs": {
"pixels": [
"70",
0
],
"vae": [
"1",
2
]
},
"class_type": "VAEEncode",
"_meta": {
"title": "VAE Encode"
}
},
"27": {
"inputs": {
"samples": [
"21",
0
],
"vae": [
"1",
2
]
},
"class_type": "VAEDecode",
"_meta": {
"title": "VAE Decode"
}
},
"70": {
"inputs": {
"upscale_method": "lanczos",
"megapixels": 1,
"image": [
"3",
0
]
},
"class_type": "ImageScaleToTotalPixels",
"_meta": {
"title": "ImageScaleToTotalPixels"
}
},
"71": {
"inputs": {
"low_threshold": 50,
"high_threshold": 150,
"resolution": 1024,
"image": [
"3",
0
]
},
"class_type": "CannyEdgePreprocessor",
"_meta": {
"title": "Canny Edge"
}
},
"123": {
"inputs": {
"images": [
"27",
0
]
},
"class_type": "PreviewImage",
"_meta": {
"title": "Preview Image"
}
},
"131": {
"inputs": {
"text": [
"159",
0
],
"clip": [
"154",
1
]
},
"class_type": "CLIPTextEncode",
"_meta": {
"title": "CLIP Text Encode (Prompt)"
}
},
"152": {
"inputs": {
"text": "{{prompt}}"
},
"class_type": "Text _O",
"_meta": {
"title": "Text_1"
}
},
"154": {
"inputs": {
"lora_name": "StudioGhibli.Redmond-StdGBRRedmAF-StudioGhibli.safetensors",
"strength_model": 0.6,
"strength_clip": 1,
"model": [
"1",
0
],
"clip": [
"1",
1
]
},
"class_type": "LoraLoader",
"_meta": {
"title": "Load LoRA"
}
},
"156": {
"inputs": {
"text_1": [
"152",
0
],
"text_2": [
"158",
0
]
},
"class_type": "ConcatText_Zho",
"_meta": {
"title": "✨ConcatText_Zho"
}
},
"157": {
"inputs": {
"text": "StdGBRedmAF,Studio Ghibli,"
},
"class_type": "Text _O",
"_meta": {
"title": "Text _2"
}
},
"158": {
"inputs": {
"text": "looking at viewer, anime artwork, anime style, key visual, vibrant, studio anime, highly detailed"
},
"class_type": "Text _O",
"_meta": {
"title": "Text _O"
}
},
"159": {
"inputs": {
"text_1": [
"156",
0
],
"text_2": [
"157",
0
]
},
"class_type": "ConcatText_Zho",
"_meta": {
"title": "✨ConcatText_Zho"
}
},
"160": {
"inputs": {
"text": "photo, deformed, black and white, realism, disfigured, low contrast"
},
"class_type": "Text _O",
"_meta": {
"title": "Text _O"
}
}
}
```
**Important:**
If you look at the JSON file above, you'll notice we have templatized a few items using the **`{{handlebars}}`** templating style.
If there are any inputs in your ComfyUI workflow that should be variables such as input prompts, images, etc, you should templatize them using the handlebars format.
In this example workflow, there are two inputs: **`{{input_image}}`** and **`{{prompt}}`**
When making an API call to this workflow, we will be able to pass in any variable for these two inputs.
## Deploying the Workflow to Baseten
Once you have both your `config.yaml` and `data/comfy_ui_workflow.json` filled out we can deploy this workflow just like any other model on Baseten.
Install Truss:
```bash theme={"system"}
uv venv && source .venv/bin/activate
uv pip install truss --upgrade
```
```bash theme={"system"}
python3 -m venv .venv && source .venv/bin/activate
pip install truss --upgrade
```
```bash theme={"system"}
python3 -m venv .venv && .venv\Scripts\activate
pip install truss --upgrade
```
Then deploy:
```bash theme={"system"}
truss push
```
## Running inference
When you deploy the truss, it will spin up a new deployment in your Baseten account. Each deployment will expose a REST API endpoint which we can use to call this workflow.
```python theme={"system"}
import requests
import os
import base64
from PIL import Image
from io import BytesIO
# Replace the empty string with your model id below
model_id = ""
baseten_api_key = os.environ["BASETEN_API_KEY"]
BASE64_PREAMBLE = "data:image/png;base64,"
def pil_to_b64(pil_img):
buffered = BytesIO()
pil_img.save(buffered, format="PNG")
img_str = base64.b64encode(buffered.getvalue()).decode("utf-8")
return img_str
def b64_to_pil(b64_str):
return Image.open(BytesIO(base64.b64decode(b64_str.replace(BASE64_PREAMBLE, ""))))
values = {
"prompt": "american Shorthair",
"input_image": {"type": "image", "data": pil_to_b64(Image.open("/path/to/cat.png"))}
}
resp = requests.post(
f"https://model-{model_id}.api.baseten.co/production/predict",
headers={"Authorization": f"Api-Key {baseten_api_key}"},
json={"workflow_values": values}
)
res = resp.json()
results = res.get("result")
for item in results:
if item.get("format") == "png":
data = item.get("data")
img = b64_to_pil(data)
img.save(f"pet-style-transfer-1.png")
```
If you recall, we templatized two variables in our workflow: `prompt` and `input_image`. In our API call we can specify the values for these two variables like so:
```json theme={"system"}
values = {
"prompt": "Maltipoo",
"input_image": {"type": "image", "data": pil_to_b64(Image.open("/path/to/dog.png"))}
}
```
If your workflow contains more variables, simply add them to the dictionary above.
The API call returns an image in the form of a base64 string, which we convert to a PNG image.
# Customize a model
Source: https://docs.baseten.co/examples/customize-a-model
Deploy a model with custom Python code using the Truss Model class.
Most models on Baseten deploy with just a `config.yaml` and an inference engine. But when you need custom preprocessing, postprocessing, or want to run a model architecture that the built-in engines don't support, you can write Python code in a `model.py` file. Truss provides a `Model` class with three methods (`__init__`, `load`, and `predict`) that give you full control over how your model initializes, loads weights, and handles requests.
This guide walks through deploying [Phi-3-mini-4k-instruct](https://huggingface.co/microsoft/Phi-3-mini-4k-instruct), a 3.8B parameter LLM, using custom Python code. If you haven't deployed a config-only model yet, start with [Deploy your first model](/examples/deploy-your-first-model).
## Set up your environment
Before you begin, [sign up](https://app.baseten.co/signup) or [sign in](https://app.baseten.co/login) to Baseten.
### Install Truss
[Truss](https://pypi.org/project/truss/) is Baseten's model packaging framework. It handles containerization, dependencies, and deployment configuration.
[uv](https://docs.astral.sh/uv/) is a fast Python package manager:
```sh theme={"system"}
uv venv && source .venv/bin/activate
uv pip install truss
```
These commands create a virtual environment, activate it, and install Truss:
```sh theme={"system"}
python3 -m venv .venv && source .venv/bin/activate
pip install --upgrade truss
```
These commands create a virtual environment, activate it, and install Truss:
```sh theme={"system"}
python3 -m venv .venv && .venv\Scripts\activate
pip install --upgrade truss
```
New accounts include free credits. This guide uses less than \$1 in GPU costs.
***
## Create a Truss project
Create a new Truss:
```sh theme={"system"}
truss init phi-3-mini && cd phi-3-mini
```
When prompted, give your Truss a name like `Phi 3 Mini`.
This command scaffolds a project with the following structure:
```
phi-3-mini/
model/
__init__.py
model.py
config.yaml
data/
packages/
```
The key files are:
* `model/model.py`: Your model code with `load()` and `predict()` methods.
* `config.yaml`: Dependencies, resources, and deployment settings.
* `data/`: Optional directory for data files bundled with your model.
* `packages/`: Optional directory for local Python packages.
Truss uses this structure to build and deploy your model automatically. You
define your model in `model.py` and your infrastructure in `config.yaml`, no
Dockerfiles or container management required.
***
## Implement model code
Replace the contents of `model/model.py` with the following code. This loads [Phi-3-mini-4k-instruct](https://huggingface.co/microsoft/Phi-3-mini-4k-instruct) using the `transformers` library and PyTorch:
```python model/model.py theme={"system"}
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
class Model:
def __init__(self, **kwargs):
self._model = None
self._tokenizer = None
def load(self):
self._model = AutoModelForCausalLM.from_pretrained(
"microsoft/Phi-3-mini-4k-instruct",
device_map="cuda",
torch_dtype="auto"
)
self._tokenizer = AutoTokenizer.from_pretrained(
"microsoft/Phi-3-mini-4k-instruct"
)
def predict(self, request):
messages = request.pop("messages")
model_inputs = self._tokenizer.apply_chat_template(
messages, tokenize=False, add_generation_prompt=True
)
inputs = self._tokenizer(model_inputs, return_tensors="pt").to("cuda")
with torch.no_grad():
outputs = self._model.generate(input_ids=inputs["input_ids"], max_length=256)
return {"output": self._tokenizer.decode(outputs[0], skip_special_tokens=True)}
```
Truss models follow a three-method pattern that separates initialization from inference:
| Method | When it's called | What to do here |
| ---------- | ------------------------------------ | ---------------------------------------------------------- |
| `__init__` | Once when the class is created | Initialize variables, store configuration, set secrets. |
| `load` | Once at startup, before any requests | Load model weights, tokenizers, and other heavy resources. |
| `predict` | On every API request | Process input, run inference, return response. |
The `load` method runs during the container's cold start, before your model receives traffic. This keeps expensive operations (like downloading large model weights) out of the request path.
### Understand the request/response flow
The `predict` method receives `request`, a dictionary containing the JSON body from the API call:
```python theme={"system"}
# API call with: {"messages": [{"role": "user", "content": "Hello"}]}
def predict(self, request):
messages = request.pop("messages") # Extract from request
# ... run inference ...
return {"output": result} # Return dict becomes JSON response
```
Whatever dictionary you return becomes the API response. You control the input parameters and output format.
### GPU and memory patterns
A few patterns in this code are common across GPU models:
* **`device_map="cuda"`**: Loads model weights directly to GPU.
* **`.to("cuda")`**: Moves input tensors to GPU for inference.
* **`torch.no_grad()`**: Disables gradient tracking to save memory (gradients aren't needed for inference).
***
## Configure dependencies and GPU
The `config.yaml` file defines your model's environment and compute resources.
### Set Python version and dependencies
```yaml config.yaml theme={"system"}
python_version: py311
requirements:
- six==1.17.0
- accelerate==0.30.1
- einops==0.8.0
- transformers==4.41.2
- torch==2.3.0
```
**Key configuration options:**
| Field | Purpose | Example |
| ----------------- | ----------------------------------------- | --------------------------------- |
| `python_version` | Python version for your container. | `py39`, `py310`, `py311`, `py312` |
| `requirements` | Python packages to install (pip format). | `torch==2.3.0` |
| `system_packages` | System-level dependencies (apt packages). | `ffmpeg`, `libsm6` |
For the complete list of configuration options, see the [Truss reference config](/reference/truss-configuration).
Always pin exact versions (e.g., `torch==2.3.0` not `torch>=2.0`). This ensures reproducible builds and your model behaves the same way every time it's deployed.
### Allocate a GPU
The `resources` section specifies what hardware your model runs on:
```yaml config.yaml theme={"system"}
resources:
accelerator: T4
use_gpu: true
```
Match your GPU to your model's VRAM requirements. For Phi-3-mini (approximately 7.6 GB), a T4 (16 GB) provides headroom for inference.
| GPU | VRAM | Good for |
| ---- | -------- | -------------------------------------------- |
| T4 | 16 GB | Small models, embeddings, fine-tuned models. |
| L4 | 24 GB | Medium models (7B parameters). |
| A10G | 24 GB | Medium models, image generation. |
| A100 | 40/80 GB | Large models (13B-70B parameters). |
| H100 | 80 GB | Very large models, high throughput. |
A rough rule for estimating VRAM: 2 GB per billion parameters for float16 models. A 7B model needs approximately 14 GB VRAM minimum.
***
## Deploy the model
### Authenticate with Baseten
Generate an API key from [Baseten settings](https://app.baseten.co/settings/account/api_keys), then log in:
```sh theme={"system"}
truss login
```
You should see:
```output theme={"system"}
💻 Let's add a Baseten remote!
🤫 Quietly paste your API_KEY:
```
Paste your API key when prompted. Truss saves your credentials for future deployments.
### Push your model to Baseten
```sh theme={"system"}
truss push --watch
```
You should see:
```output theme={"system"}
✨ Model Phi 3 Mini was successfully pushed ✨
🪵 View logs for your deployment at https://app.baseten.co/models/abc1d2ef/logs/xyz123
```
The logs URL contains your model ID, the string after `/models/` (e.g., `abc1d2ef`). You'll need this to call the model's API. You can also find it in your [Baseten dashboard](https://app.baseten.co/models/).
***
## Call the model API
After the deployment shows "Active" in the dashboard, call the model API:
From your Truss project directory, run:
```sh theme={"system"}
truss predict --data '{"messages": [{"role": "user", "content": "What is AGI?"}]}'
```
You should see:
```output theme={"system"}
Calling predict on development deployment...
{
"output": "AGI stands for Artificial General Intelligence..."
}
```
The Truss CLI uses your saved credentials and automatically targets the correct deployment.
Replace `YOUR_MODEL_ID` with your model ID (e.g., `abc1d2ef`):
```sh theme={"system"}
curl -X POST https://model-YOUR_MODEL_ID.api.baseten.co/development/predict \
-H "Authorization: Api-Key $BASETEN_API_KEY" \
-H "Content-Type: application/json" \
-d '{"messages": [{"role": "user", "content": "What is AGI?"}]}'
```
You should see:
```output theme={"system"}
{"output": "AGI stands for Artificial General Intelligence..."}
```
Replace `YOUR_MODEL_ID` with your model ID:
```python main.py theme={"system"}
import requests
import os
model_id = "YOUR_MODEL_ID" # Replace with your model ID (e.g., "abc1d2ef")
baseten_api_key = os.environ["BASETEN_API_KEY"]
resp = requests.post(
f"https://model-{model_id}.api.baseten.co/development/predict",
headers={"Authorization": f"Api-Key {baseten_api_key}"},
json={
"messages": [
{"role": "user", "content": "What is AGI?"}
]
}
)
print(resp.json())
```
You should see:
```output theme={"system"}
{"output": "AGI stands for Artificial General Intelligence..."}
```
***
## Use live reload for development
To avoid long deploy times when testing changes, use live reload:
```sh theme={"system"}
truss watch
```
You should see:
```output theme={"system"}
🪵 View logs for your deployment at https://app.baseten.co/models//logs/
🚰 Attempting to sync truss with remote
No changes observed, skipping patching.
👀 Watching for changes to truss...
```
When you save changes to `model.py`, Truss automatically patches the deployed model:
```output theme={"system"}
Changes detected, creating patch...
Created patch to update model code file: model/model.py
Model Phi 3 Mini patched successfully.
```
This saves time by patching only the updated code without rebuilding Docker containers or restarting the model server.
***
## Promote to production
Once you're happy with the model, deploy it to production:
```sh theme={"system"}
truss push --promote
```
This changes the API endpoint from `/development/predict` to `/production/predict`:
```sh theme={"system"}
curl -X POST https://model-YOUR_MODEL_ID.api.baseten.co/production/predict \
-H "Authorization: Api-Key $BASETEN_API_KEY" \
-H "Content-Type: application/json" \
-d '{"messages": [{"role": "user", "content": "What is AGI?"}]}'
```
Your model ID is the string after `/models/` in the logs URL from `truss push`. You can also find it in your [Baseten dashboard](https://app.baseten.co/models/).
***
## Next steps
Full reference for dependencies, secrets, resources, and deployment settings.
Advanced patterns including streaming, async, and custom health checks.
Scale GPU replicas based on demand with configurable concurrency targets.
Deploy a model with just a config file, no custom Python needed.
# Deploy your first model
Source: https://docs.baseten.co/examples/deploy-your-first-model
Deploy an open-source LLM to Baseten with just a config file and get an OpenAI-compatible API endpoint.
Deploying a model to Baseten turns a Hugging Face model into a production-ready API endpoint. You write a `config.yaml` that specifies the model, the hardware, and the engine, then `truss push` builds a TensorRT-optimized container and deploys it. No Python code, no Dockerfile, no container management.
This guide walks through deploying [Qwen 2.5 3B Instruct](https://huggingface.co/Qwen/Qwen2.5-3B-Instruct), a small but capable LLM, from a config file to a production API. You'll set up Truss, write a config, deploy to Baseten, call the model's OpenAI-compatible endpoint, and promote to production.
## Set up your environment
Before you begin, [sign up](https://app.baseten.co/signup) or [sign in](https://app.baseten.co/login) to Baseten.
### Install Truss
[Truss](https://pypi.org/project/truss/) is Baseten's open-source framework for packaging models into deployable containers.
[uv](https://docs.astral.sh/uv/) is a fast Python package manager:
```sh theme={"system"}
uv venv && source .venv/bin/activate
uv pip install truss
```
These commands create a virtual environment, activate it, and install Truss:
```sh theme={"system"}
python3 -m venv .venv && source .venv/bin/activate
pip install --upgrade truss
```
These commands create a virtual environment, activate it, and install Truss:
```sh theme={"system"}
python3 -m venv .venv && .venv\Scripts\activate
pip install --upgrade truss
```
### Authenticate with Baseten
Generate an API key from [Settings > API keys](https://app.baseten.co/settings/account/api_keys), then log in:
```sh theme={"system"}
truss login
```
Paste your API key when prompted:
```output theme={"system"}
💻 Let's add a Baseten remote!
🤫 Quietly paste your API_KEY:
```
You can skip the interactive prompt by setting `BASETEN_API_KEY` as an environment variable:
```bash theme={"system"}
export BASETEN_API_KEY="paste-your-api-key-here"
```
New accounts include free credits. This guide uses an L4 GPU, one of the most cost-effective options available.
***
## Create a Truss project
Scaffold a new project:
```sh theme={"system"}
truss init qwen-2.5-3b && cd qwen-2.5-3b
```
When prompted, name the model `Qwen 2.5 3B`.
```output theme={"system"}
? 📦 Name this model: Qwen 2.5 3B
Truss Qwen 2.5 3B was created in ~/qwen-2.5-3b
```
This creates a directory with a `config.yaml`, a `model/` directory, and supporting files. For engine-based deployments like this one, you only need `config.yaml`. The `model/` directory is for [custom Python code](/examples/customize-a-model) when you need custom preprocessing, postprocessing, or unsupported model architectures.
***
## Write the config
Replace the contents of `config.yaml` with:
```yaml config.yaml theme={"system"}
model_name: Qwen-2.5-3B
resources:
accelerator: L4
use_gpu: true
trt_llm:
build:
base_model: decoder
checkpoint_repository:
source: HF
repo: "Qwen/Qwen2.5-3B-Instruct"
max_seq_len: 8192
quantization_type: fp8
tensor_parallel_count: 1
```
That's the entire deployment specification.
* `model_name` identifies the model in your Baseten dashboard.
* `resources` selects an L4 GPU (24 GB VRAM), which is plenty for a 3B parameter model.
* `trt_llm` tells Baseten to use [Engine-Builder-LLM](/engines/engine-builder-llm/overview), which compiles the model with TensorRT-LLM for optimized inference.
* `checkpoint_repository` points to the model weights on Hugging Face. Qwen 2.5 3B Instruct is ungated, so no access token is needed.
* `quantization_type: fp8` compresses weights to 8-bit floating point, cutting memory usage roughly in half with negligible quality loss.
* `max_seq_len: 8192` sets the maximum context length for requests.
***
## Deploy
Push the model to Baseten:
We'll start by deploying in development mode so we can iterate quickly:
```sh theme={"system"}
truss push --watch
```
You should see:
```output theme={"system"}
✨ Model Qwen 2.5 3B was successfully pushed ✨
🪵 View logs for your deployment at https://app.baseten.co/models/abc1d2ef/logs/xyz123
👀 Watching for changes to truss...
```
The logs URL contains your model ID, the string after `/models/` (e.g., `abc1d2ef`). You'll need this to call the model's API. You can also find it in your [Baseten dashboard](https://app.baseten.co/models/).
Baseten now downloads the model weights from Hugging Face, compiles them with TensorRT-LLM, and deploys the resulting container to an L4 GPU. You can watch progress in the logs linked above.
***
## Call the model
Engine-based deployments serve an OpenAI-compatible API. Once the deployment shows "Active" in the dashboard, call it using the OpenAI SDK or cURL. Replace `{model_id}` with your model ID from the deployment output.
Install the OpenAI SDK if you don't have it:
```sh theme={"system"}
uv pip install openai
```
Create a chat completion:
```python theme={"system"}
import os
from openai import OpenAI
client = OpenAI(
api_key=os.environ["BASETEN_API_KEY"],
base_url="https://model-{model_id}.api.baseten.co/environments/development/sync/v1",
)
response = client.chat.completions.create(
model="Qwen-2.5-3B",
messages=[
{"role": "user", "content": "What is machine learning?"}
],
)
print(response.choices[0].message.content)
```
```sh theme={"system"}
curl -s https://model-{model_id}.api.baseten.co/environments/development/sync/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Api-Key $BASETEN_API_KEY" \
-d '{
"model": "Qwen-2.5-3B",
"messages": [
{"role": "user", "content": "What is machine learning?"}
]
}'
```
You should see a response like:
```output theme={"system"}
Machine learning is a branch of artificial intelligence where systems learn
patterns from data to make predictions or decisions without being explicitly
programmed for each task...
```
Any code that works with the OpenAI SDK works with your deployment. Just point the `base_url` at your model's endpoint.
***
## Iterate with live reload
When you change your `config.yaml` and want to test quickly, use live reload:
```sh theme={"system"}
truss watch
```
You should see:
```output theme={"system"}
🪵 View logs for your deployment at https://app.baseten.co/models//logs/
🚰 Attempting to sync truss with remote
No changes observed, skipping patching.
👀 Watching for changes to truss...
```
When you save changes, Truss automatically syncs them with the deployed model. This saves time by patching without a full rebuild.
If you stopped the watch session, you can re-attach with:
```sh theme={"system"}
truss watch
```
This creates a production deployment with its own endpoint. The API URL changes from `/environments/development/` to `/environments/production/`:
```python theme={"system"}
client = OpenAI(
api_key=os.environ["BASETEN_API_KEY"],
base_url="https://model-{model_id}.api.baseten.co/environments/production/sync/v1",
)
```
```sh theme={"system"}
curl -s https://model-{model_id}.api.baseten.co/environments/production/sync/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Api-Key $BASETEN_API_KEY" \
-d '{
"model": "Qwen-2.5-3B",
"messages": [
{"role": "user", "content": "What is machine learning?"}
]
}'
```
Your model ID is the string after `/models/` in the logs URL from `truss push`. You can also find it in your [Baseten dashboard](https://app.baseten.co/models/).
***
## Next steps
Tune max sequence length, batch size, quantization, and runtime settings.
Add custom Python code when you need preprocessing, postprocessing, or unsupported model architectures.
Configure replicas, concurrency targets, and scale-to-zero for production traffic.
# Dockerized model
Source: https://docs.baseten.co/examples/docker
Deploy any model in a pre-built Docker container
In this example, we deploy a dockerized model for [infinity embedding server](https://github.com/michaelfeil/infinity), a high-throughput, low-latency REST API server for serving vector embeddings.
# Setting up the `config.yaml`
To deploy a dockerized model, all you need is a `config.yaml`. It specifies how to build your Docker image, start the server, and manage resources. Let’s break down each section.
## Base image
Sets the foundational Docker image to a lightweight Python 3.11 environment.
```yaml config.yaml theme={"system"}
base_image:
image: python:3.11-slim
```
## Docker server configuration
Configures the server's startup command, health check endpoints, prediction endpoint, and the port on which the server will run.
```yaml config.yaml theme={"system"}
docker_server:
start_command: sh -c "HF_TOKEN=$(cat /secrets/hf_access_token) infinity_emb v2 --batch-size 64 --model-id BAAI/bge-small-en-v1.5 --revision main"
readiness_endpoint: /health
liveness_endpoint: /health
predict_endpoint: /embeddings
server_port: 7997
```
## Build commands (optional)
Pre-downloads model weights during the build phase to ensure the model is ready at container startup.
```yaml config.yaml theme={"system"}
build_commands: # optional step to download the weights of the model into the image
- sh -c "HF_TOKEN=$(cat /secrets/hf_access_token) infinity_emb v2 --preload-only --no-model-warmup --model-id BAAI/bge-small-en-v1.5 --revision main"
```
## Configure resources
Note that we need an L4 to run this model.
```yaml config.yaml theme={"system"}
resources:
accelerator: L4
use_gpu: true
```
## Requirements
Lists the Python package dependencies required for the infinity embedding server.
```yaml config.yaml theme={"system"}
requirements:
- infinity-emb[all]==0.0.72
```
## Runtime settings
Sets the server to handle up to 40 concurrent inferences to manage load efficiently.
```yaml config.yaml theme={"system"}
runtime:
predict_concurrency: 40
```
## Environment variables
Defines essential environment variables including the Hugging Face access token, request batch size, queue size limit, and a flag to disable tracking.
```yaml config.yaml theme={"system"}
environment_variables:
hf_access_token: null
# constrain api to at most 256 sentences per request, for better load-balancing
INFINITY_MAX_CLIENT_BATCH_SIZE: 256
# constrain model to a max backpressure of INFINITY_MAX_CLIENT_BATCH_SIZE * predict_concurrency = 10241 requests
INFINITY_QUEUE_SIZE: 10241
DO_NOT_TRACK: 1
```
# Deploy dockerized model
Deploy the model like you would other Trusses, with:
```bash theme={"system"}
truss push infinity-embedding-server
```
# Image generation
Source: https://docs.baseten.co/examples/image-generation
Building a text-to-image model with Flux Schnell
In this example, we go through a Truss that serves a text-to-image model. We
use Flux Schnell, which is one of the highest performing text-to-image models out
there today.
# Set up imports and torch settings
In this example, we use the Hugging Face diffusers library to build our text-to-image model.
```python model/model.py theme={"system"}
import base64
import math
import random
import logging
from io import BytesIO
import numpy as np
import torch
from diffusers import FluxPipeline
from PIL import Image
logging.basicConfig(level=logging.INFO)
MAX_SEED = np.iinfo(np.int32).max
```
# Define the `Model` class and load function
In the `load` function of the Truss, we implement logic involved in
downloading and setting up the model. For this model, we use the
`FluxPipeline` class in `diffusers` to instantiate our Flux pipeline,
and configure a number of relevant parameters.
See the [diffusers docs](https://huggingface.co/docs/diffusers/index) for details
on all of these parameters.
```python model/model.py theme={"system"}
class Model:
def __init__(self, **kwargs):
self.pipe = None
self.repo_id = "black-forest-labs/FLUX.1-schnell"
def load(self):
self.pipe = FluxPipeline.from_pretrained(self.repo_id, torch_dtype=torch.bfloat16).to("cuda")
```
This is a utility function for converting a PIL image to base64.
```python model/model.py theme={"system"}
def convert_to_b64(self, image: Image) -> str:
buffered = BytesIO()
image.save(buffered, format="JPEG")
img_b64 = base64.b64encode(buffered.getvalue()).decode("utf-8")
return img_b64
```
# Define the predict function
The `predict` function contains the actual inference logic. The steps here are:
* Setting up the generation params. These include things like the prompt, image width, image height, number of inference steps, etc.
* Running the Diffusion Pipeline
* Convert the resulting image to base64 and return it
```python model/model.py theme={"system"}
def predict(self, model_input):
seed = model_input.get("seed")
prompt = model_input.get("prompt")
prompt2 = model_input.get("prompt2")
max_sequence_length = model_input.get(
"max_sequence_length", 256
) # 256 is max for FLUX.1-schnell
guidance_scale = model_input.get(
"guidance_scale", 0.0
) # 0.0 is the only value for FLUX.1-schnell
num_inference_steps = model_input.get(
"num_inference_steps", 4
) # schnell is timestep-distilled
width = model_input.get("width", 1024)
height = model_input.get("height", 1024)
if not math.isclose(guidance_scale, 0.0):
logging.warning(
"FLUX.1-schnell does not support guidance_scale other than 0.0"
)
guidance_scale = 0.0
if not seed:
seed = random.randint(0, MAX_SEED)
if len(prompt.split()) > max_sequence_length:
logging.warning(
"FLUX.1-schnell does not support prompts longer than 256 tokens, truncating"
)
tokens = prompt.split()
prompt = " ".join(tokens[: min(len(tokens), max_sequence_length)])
generator = torch.Generator().manual_seed(seed)
image = self.pipe(
prompt=prompt,
guidance_scale=guidance_scale,
max_sequence_length=max_sequence_length,
num_inference_steps=num_inference_steps,
width=width,
height=height,
output_type="pil",
generator=generator,
).images[0]
b64_results = self.convert_to_b64(image)
return {"data": b64_results}
```
# Setting up the `config.yaml`
Running Flux Schnell requires a handful of Python libraries, including
`diffusers`, `transformers`, and others.
```yaml config.yaml theme={"system"}
external_package_dirs: []
model_cache:
- repo_id: black-forest-labs/FLUX.1-schnell
allow_patterns:
- "*.json"
- "*.safetensors"
ignore_patterns:
- "flux1-schnell.safetensors"
model_metadata:
example_model_input: {"prompt": 'black forest gateau cake spelling out the words "FLUX SCHNELL", tasty, food photography, dynamic shot'}
model_name: Flux.1-schnell
python_version: py311
requirements:
- git+https://github.com/huggingface/diffusers.git@v0.32.2
- transformers
- accelerate
- sentencepiece
- protobuf
resources:
accelerator: H100_40GB
use_gpu: true
secrets: {}
system_packages:
- ffmpeg
- libsm6
- libxext6
```
## Configuring resources for Flux Schnell
Note that we need an H100 40GB GPU to run this model.
```yaml config.yaml theme={"system"}
resources:
accelerator: H100_40GB
use_gpu: true
secrets: {}
```
## System packages
Running diffusers requires `ffmpeg` and a couple other system
packages.
```yaml config.yaml theme={"system"}
system_packages:
- ffmpeg
- libsm6
- libxext6
```
## Enabling caching
Flux Schnell is a large model, and downloading it could take several minutes. This means
that the cold start time for this model is long. We can solve that by using our build
caching feature. This moves the model download to the build stage of your model.
Caching the model will take about 15 minutes initially but you will get \~20s cold starts
subsequently.
To enable caching, add the following to the config:
```yaml theme={"system"}
model_cache:
- repo_id: black-forest-labs/FLUX.1-schnell
allow_patterns:
- "*.json"
- "*.safetensors"
ignore_patterns:
- "flux1-schnell.safetensors"
```
# Deploy the model
Deploy the model like you would other Trusses, with:
```bash theme={"system"}
truss push flux/schnell
```
# Run an inference
Use a Python script to call the model once it's deployed and parse its response. We parse the resulting base64-encoded string output into an actual image file: `output_image.jpg`.
```python infer.py theme={"system"}
import httpx
import os
import base64
from PIL import Image
from io import BytesIO
# Replace the empty string with your model id below
model_id = ""
baseten_api_key = os.environ["BASETEN_API_KEY"]
# Function used to convert a base64 string to a PIL image
def b64_to_pil(b64_str):
return Image.open(BytesIO(base64.b64decode(b64_str)))
data = {
"prompt": 'red velvet cake spelling out the words "FLUX SCHNELL", tasty, food photography, dynamic shot'
}
# Call model endpoint
res = httpx.post(
f"https://model-{model_id}.api.baseten.co/production/predict",
headers={"Authorization": f"Api-Key {baseten_api_key}"},
json=data
)
# Get output image
res = res.json()
output = res.get("data")
# Convert the base64 model output to an image
img = b64_to_pil(output)
img.save("output_image.jpg")
```
# Building with Baseten
Source: https://docs.baseten.co/examples/overview
These examples cover a variety of use cases on Baseten, from [deploying your first LLM](/examples/deploy-your-first-model) and [image generation](/examples/image-generation) to [transcription](/examples/chains-audio-transcription), [embeddings](/examples/bei), and [RAG pipelines](/examples/chains-build-rag). Whether you're optimizing inference with [TensorRT-LLM](/examples/tensorrt-llm) or deploying a model with [Truss](/development/model/overview), these guides help you build and scale efficiently.
## Choosing the right engine
Not sure which engine to use? Check out our [engine documentation](/engines) to:
* **Select the appropriate engine** for your model architecture (embeddings, dense LLMs, or MoE models)
* **Understand performance trade-offs** between different engine options
* **Configure advanced features** like quantization and speculative decoding
* **Optimize for your specific use case** with engine-specific guidance
## Featured examples
## Training
Train and fine-tune models with Baseten's scalable training infrastructure. From [fine-tuning large language models](/training/getting-started) to training custom models, our platform provides the tools and compute you need.
Our training infrastructure supports popular frameworks including VERL, Megatron, and Unsloth, as well as models trained directly with Hugging Face Transformers.
# Deploy LLMs with SGLang
Source: https://docs.baseten.co/examples/sglang
Deploy models with SGLang on Baseten
[SGLang](https://docs.sglang.ai/) is a high-performance serving framework for LLMs that supports a wide range of models and optimization techniques. This guide deploys an SGLang model as a custom Docker server on Baseten.
## Example: Deploy Qwen 2.5 3B on an L4
This configuration serves [Qwen 2.5 3B](https://huggingface.co/Qwen/Qwen2.5-3B-Instruct) with SGLang on an L4 GPU. The deployment process is the same for larger models like [GLM-4.7](https://huggingface.co/zai-org/GLM-4.7). Adjust the `resources` and `start_command` to match your model's requirements.
## Set up your environment
Before you deploy a model, you'll need three setup steps.
Create an [API key](https://app.baseten.co/settings/api_keys) and save it as an environment variable:
```sh theme={"system"}
export BASETEN_API_KEY="abcd.123456"
```
Some models require that you accept terms and conditions on Hugging Face before deployment. To prevent issues:
1. Accept the license for any gated models you wish to access, like [Gemma 3](https://huggingface.co/google/gemma-3-27b-it).
2. Create a read-only [user access token](https://huggingface.co/docs/hub/en/security-tokens) from your Hugging Face account.
3. Add the `hf_access_token` secret [to your Baseten workspace](https://app.baseten.co/settings/secrets).
Install [Truss](https://pypi.org/project/truss/) and the OpenAI SDK:
```sh theme={"system"}
uv venv && source .venv/bin/activate
uv pip install --upgrade truss openai
```
```sh theme={"system"}
python3 -m venv .venv && source .venv/bin/activate
pip install --upgrade truss openai
```
```sh theme={"system"}
python3 -m venv .venv && .venv\Scripts\activate
pip install --upgrade truss openai
```
## Configure the model
Create a directory with a `config.yaml` file:
```sh theme={"system"}
mkdir qwen-2-5-3b-sglang
touch qwen-2-5-3b-sglang/config.yaml
```
Copy the following configuration into `config.yaml`:
```yaml config.yaml theme={"system"}
model_metadata:
example_model_input:
messages:
- role: system
content: "You are a helpful assistant."
- role: user
content: "What does Tongyi Qianwen mean?"
stream: true
model: Qwen/Qwen2.5-3B-Instruct
max_tokens: 512
temperature: 0.6
tags:
- openai-compatible
model_name: Qwen 2.5 3B SGLang
base_image:
image: lmsysorg/sglang:v0.5.8.post1
docker_server:
start_command: sh -c "truss-transfer-cli && python3 -m sglang.launch_server --model-path /app/model_cache/qwen --served-model-name Qwen/Qwen2.5-3B-Instruct --host 0.0.0.0 --port 8000"
readiness_endpoint: /health
liveness_endpoint: /health
predict_endpoint: /v1/chat/completions
server_port: 8000
model_cache:
- repo_id: Qwen/Qwen2.5-3B-Instruct
revision: aa8e72537993ba99e69dfaafa59ed015b17504d1
use_volume: true
volume_folder: qwen
resources:
accelerator: L4
use_gpu: true
runtime:
predict_concurrency: 32
health_checks:
restart_check_delay_seconds: 300
restart_threshold_seconds: 300
stop_traffic_threshold_seconds: 120
environment_variables:
hf_access_token: null
```
The `base_image` specifies the [SGLang Docker image](https://hub.docker.com/r/lmsysorg/sglang/tags). The `model_cache` pre-downloads the model from Hugging Face and stores it on a [cached volume](/development/model/model-cache). At startup, `truss-transfer-cli` loads the cached weights into `/app/model_cache/qwen`, then SGLang serves the model with `--served-model-name` to set the model identifier for the OpenAI-compatible API. The `readiness_endpoint` and `liveness_endpoint` use `/health`, which returns 200 once the server is running. The `health_checks` give the server time to load the model before Baseten checks readiness.
## Deploy the model
Push the model to Baseten to start the deployment:
```sh theme={"system"}
truss push qwen-2-5-3b-sglang
```
You should see output like:
```
Deploying truss using L4:4x16 instance type.
Model Qwen 2.5 3B SGLang was successfully pushed.
View logs at https://app.baseten.co/models/XXXXXXX/logs/XXXXXXX
```
Copy the model URL from the output for the next step.
## Call the model
Call the deployed model with the OpenAI client:
```python call_model.py theme={"system"}
import os
from openai import OpenAI
model_url = "https://model-XXXXXXX.api.baseten.co/environments/production/sync/v1"
client = OpenAI(
base_url=model_url,
api_key=os.environ.get("BASETEN_API_KEY"),
)
stream = client.chat.completions.create(
model="Qwen/Qwen2.5-3B-Instruct",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What does Tongyi Qianwen mean?"}
],
stream=True,
)
for chunk in stream:
if chunk.choices[0].delta.content is not None:
print(chunk.choices[0].delta.content, end="")
```
Replace the `model_url` with the URL from your deployment output.
# Speculative decoding examples
Source: https://docs.baseten.co/examples/speculative-decoding
Lookahead decoding configurations for faster inference
Speculative decoding with [lookahead decoding](/engines/engine-builder-llm/lookahead-decoding) accelerates inference for predictable workloads using n-gram patterns.
## Quick start
```yaml theme={"system"}
trt_llm:
build:
speculator:
enable_b10_lookahead: true
speculative_decoding_mode: LOOKAHEAD_DECODING
lookahead_windows_size: 8
lookahead_ngram_size: 1
lookahead_verification_set_size: 1
```
## Engine compatibility
| Feature | [Engine-Builder-LLM](/engines/engine-builder-llm/overview) | [BIS-LLM](/engines/bis-llm/overview) |
| ---------------------- | ---------------------------------------------------------- | ------------------------------------ |
| **Lookahead decoding** | ✅ Supported | ✅ Gated Feature |
| **Structured outputs** | ❌ Incompatible | ✅ Supported |
| **Tool calling** | ❌ Incompatible | ✅ Supported |
| **Eagle speculation** | ❌ Not supported | ✅ Gated Feature |
## Configuration examples
### Code generation (Qwen2.5-Coder)
```yaml theme={"system"}
model_name: Qwen2.5-Coder-7B-Lookahead
resources:
accelerator: H100
trt_llm:
build:
base_model: decoder
checkpoint_repository:
source: HF
repo: "Qwen/Qwen2.5-Coder-7B-Instruct"
quantization_type: fp8_kv
speculator:
enable_b10_lookahead: true
speculative_decoding_mode: LOOKAHEAD_DECODING
lookahead_windows_size: 3
lookahead_ngram_size: 8
lookahead_verification_set_size: 3
```
### Large model (Llama-3.3-70B)
```yaml theme={"system"}
model_name: Llama-3.3-70B-Lookahead
resources:
accelerator: H100:2
trt_llm:
build:
base_model: decoder
checkpoint_repository:
source: HF
repo: "meta-llama/Llama-3.3-70B-Instruct"
quantization_type: fp8_kv
tensor_parallel_count: 2
speculator:
enable_b10_lookahead: true
speculative_decoding_mode: LOOKAHEAD_DECODING
lookahead_windows_size: 3
lookahead_ngram_size: 5
lookahead_verification_set_size: 3
```
## Parameter tuning
See [lookahead decoding documentation](/engines/engine-builder-llm/lookahead-decoding) for detailed parameter explanations.
**Quick guidelines:**
* **lookahead\_windows\_size**: 1-7 (set to 1 for predictable content, 3 or 5 for others.)
* **lookahead\_ngram\_size**: 4-32 (large for code, smaller for creative tasks)
* **lookahead\_verification\_set\_size**: Usually equal to lookahead\_windows\_size
## Use cases
| Use case | lookahead\_windows\_size | lookahead\_ngram\_size | Why |
| ----------------------- | ------------------------ | ---------------------- | ------------------------------ |
| **Code generation** | 7 | 3 | Code patterns, smaller n-grams |
| **free form JSON/YAML** | 5 | 5 | Balanced for structured data |
| **Template completion** | 7-10 | 5-7 | Highly predictable content |
## Limitations
❌ **Not compatible with:**
* [Structured outputs](/engines/performance-concepts/structured-outputs) - Use BIS-LLM instead
* [Function calling](/engines/performance-concepts/function-calling) - Use BIS-LLM instead
* BIS-LLM engine - V2 stack doesn't support lookahead that is self-serviceable.
## Related
* [Lookahead decoding guide](/engines/engine-builder-llm/lookahead-decoding) - Complete reference config.
* [Engine-Builder-LLM overview](/engines/engine-builder-llm/overview) - Dense model engine.
* [BIS-LLM overview](/engines/bis-llm/overview) - MoE engine with structured outputs.
* [Quantization guide](/engines/performance-concepts/quantization-guide) - Performance optimization.
# LLM with Streaming
Source: https://docs.baseten.co/examples/streaming
Building an LLM with streaming output
In this example, we go through a Truss that serves the Qwen 7B Chat LLM, and streams the output to the client.
# Why streaming?
For certain ML models, generations can take a long time. Especially with LLMs, a long output could take
10-20 seconds to generate. However, because LLMs generate tokens in sequence, useful output can be
made available to users sooner. To support this, in Truss, we support streaming output.
# Set up the imports
In this example, we use the HuggingFace transformers library to build a text generation model.
```python model/model.py theme={"system"}
from threading import Thread
from typing import Dict
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, TextIteratorStreamer
from transformers.generation import GenerationConfig
```
# Define the load function
In the `load` function of the Truss, we implement logic
involved in downloading the chat version of the Qwen 7B model and loading it into memory.
```python model/model.py theme={"system"}
class Model:
def __init__(self, **kwargs):
self.model = None
self.tokenizer = None
def load(self):
self.tokenizer = AutoTokenizer.from_pretrained(
"Qwen/Qwen-7B-Chat", trust_remote_code=True
)
self.model = AutoModelForCausalLM.from_pretrained(
"Qwen/Qwen-7B-Chat", device_map="auto", trust_remote_code=True
).eval()
```
# Define the preprocess function
In the `preprocess` function of the Truss, we set up a `generate_args` dictionary with some generation arguments from the inference request to be used in the `predict` function.
```python model/model.py theme={"system"}
def preprocess(self, request: dict) -> dict:
generate_args = {
"max_new_tokens": request.get("max_new_tokens", 512),
"temperature": request.get("temperature", 0.5),
"top_p": request.get("top_p", 0.95),
"top_k": request.get("top_k", 40),
"repetition_penalty": 1.0,
"no_repeat_ngram_size": 0,
"use_cache": True,
"do_sample": True,
"eos_token_id": self.tokenizer.eos_token_id,
"pad_token_id": self.tokenizer.pad_token_id,
}
request["generate_args"] = generate_args
return request
```
# Define the predict function
In the `predict` function of the Truss, we implement the actual
inference logic.
The two main steps are:
* Tokenize the input
* Call the model's `generate` function if we're not streaming the output, otherwise call the `stream` helper function
```python model/model.py theme={"system"}
def predict(self, request: Dict):
stream = request.pop("stream", False)
prompt = request.pop("prompt")
generation_args = request.pop("generate_args")
input_ids = self.tokenizer(prompt, return_tensors="pt").input_ids.cuda()
if stream:
return self.stream(input_ids, generation_args)
with torch.no_grad():
output = self.model.generate(inputs=input_ids, **generation_args)
return self.tokenizer.decode(output[0])
```
## Define the `stream` helper function
In this helper function, we'll instantiate the `TextIteratorStreamer` object, which we'll later use for
returning the LLM output to users.
```python model/model.py theme={"system"}
def stream(self, input_ids: list, generation_args: dict):
streamer = TextIteratorStreamer(self.tokenizer)
```
When creating the generation parameters, ensure to pass the `streamer` object
that we created previously.
```python model/model.py theme={"system"}
generation_config = GenerationConfig(**generation_args)
generation_kwargs = {
"input_ids": input_ids,
"generation_config": generation_config,
"return_dict_in_generate": True,
"output_scores": True,
"max_new_tokens": generation_args["max_new_tokens"],
"streamer": streamer,
}
```
Spawn a thread to run the generation, so that it does not block the main
thread.
```python model/model.py theme={"system"}
with torch.no_grad():
# Begin generation in a separate thread
thread = Thread(target=self.model.generate, kwargs=generation_kwargs)
thread.start()
```
In Truss, the way to achieve streaming output is to return a generator
that yields content. In this example, we yield the output of the `streamer`,
which produces output and yields it until the generation is complete.
We define this `inner` function to create our generator.
```python model/model.py theme={"system"}
# Yield generated text as it becomes available
def inner():
for text in streamer:
yield text
thread.join()
return inner()
```
# Setting up the `config.yaml`
Running Qwen 7B requires torch, transformers,
and a few other related libraries.
```yaml config.yaml theme={"system"}
model_name: qwen-7b-chat
model_metadata:
example_model_input:
prompt: What is the meaning of life?
requirements:
- accelerate==0.23.0
- tiktoken==0.5.1
- einops==0.6.1
- scipy==1.11.3
- transformers_stream_generator==0.0.4
- peft==0.5.0
- deepspeed==0.11.1
- torch==2.0.1
- transformers==4.32.0
```
## Configure resources for Qwen
We will use an L4 to run this model.
```yaml config.yaml theme={"system"}
resources:
accelerator: L4
cpu: "4"
memory: 16Gi
use_gpu: true
```
# Deploy Qwen 7B Chat
Deploy the model like you would other Trusses, with:
```bash theme={"system"}
truss push qwen-7b-chat
```
# Fast LLMs with TensorRT-LLM
Source: https://docs.baseten.co/examples/tensorrt-llm
Optimize LLMs for low latency and high throughput
To get the best performance, we recommend using our [TensorRT-LLM Engine-Builder](/engines/engine-builder-llm/overview) when deploying LLMs. Models deployed with the Engine-Builder are [OpenAI compatible](/inference/calling-your-model), support [structured output](/engines/performance-concepts/structured-outputs) and [function calling](/engines/performance-concepts/function-calling), and offer deploy-time post-training quantization to FP8 with Hopper GPUs and NVFP4 with Blackwell GPUs.
The Engine-Builder supports LLMs from the following families, both foundation models and fine-tunes:
* Llama 3.0 and later (including DeepSeek-R1 distills)
* Qwen 2.5 and later (including Math, Coder, and DeepSeek-R1 distills)
* Mistral (all LLMs)
You can find preset Engine-Builder configs for common models in the [Engine-Builder reference](/engines/engine-builder-llm/engine-builder-config).
The Engine-Builder does not support vision-language models like Llama 3.2 11B or Pixtral. For these models, we recommend [vLLM](/examples/vllm).
## Example: Deploy Qwen 2.5 3B on an H100
This configuration builds an inference engine to serve [Qwen 2.5 3B](https://huggingface.co/Qwen/Qwen2.5-3B-Instruct) on an H100 GPU. Running this model is fast and cheap, making it a good example for documentation, but the process of deploying it is very similar to larger models like [GLM-4.7](https://huggingface.co/zai-org/GLM-4.7).
## Setup
Before you deploy a model, you'll need three quick setup steps.
Create an [API key](https://app.baseten.co/settings/api_keys) and save it as an environment variable:
```sh theme={"system"}
export BASETEN_API_KEY="abcd.123456"
```
Some models require that you accept terms and conditions on Hugging Face before deployment. To prevent issues:
1. Accept the license for any gated models you wish to access, like [Gemma 3](https://huggingface.co/google/gemma-3-27b-it).
2. Create a read-only [user access token](https://huggingface.co/docs/hub/en/security-tokens) from your Hugging Face account.
3. Add the `hf_access_token` secret [to your Baseten workspace](https://app.baseten.co/settings/secrets).
Install [Truss](https://pypi.org/project/truss/) and the OpenAI SDK:
```sh theme={"system"}
uv venv && source .venv/bin/activate
uv pip install --upgrade truss openai
```
```sh theme={"system"}
python3 -m venv .venv && source .venv/bin/activate
pip install --upgrade truss openai
```
```sh theme={"system"}
python3 -m venv .venv && .venv\Scripts\activate
pip install --upgrade truss openai
```
## Configuration
Start with an empty configuration file.
```sh theme={"system"}
mkdir qwen-2-5-3b-engine
touch qwen-2-5-3b-engine/config.yaml
```
This configuration file specifies model information and Engine-Builder arguments. You can find details on each config option in the [Engine-Builder reference](/engines/engine-builder-llm/engine-builder-config).
Below is an example for Qwen 2.5 3B.
```yaml config.yaml theme={"system"}
model_metadata:
example_model_input: # Loads sample request into Baseten playground
messages:
- role: system
content: "You are a helpful assistant."
- role: user
content: "What does Tongyi Qianwen mean?"
stream: true
max_tokens: 512
temperature: 0.6 # Check recommended temperature per model
repo_id: Qwen/Qwen2.5-3B-Instruct
model_name: Qwen 2.5 3B Instruct
python_version: py39
resources: # Engine-Builder GPU cannot be changed post-deployment
accelerator: H100
use_gpu: true
secrets: {}
trt_llm:
build:
base_model: decoder
checkpoint_repository:
repo: Qwen/Qwen2.5-3B-Instruct
source: HF
num_builder_gpus: 1
quantization_type: no_quant # `fp8_kv` often recommended for large models
max_seq_len: 32768 # option to very the max sequence length, e.g. 131072 for Llama models
tensor_parallel_count: 1 # Set equal to number of GPUs
plugin_configuration:
use_paged_context_fmha: true
use_fp8_context_fmha: false # Set to true when using `fp8_kv`
paged_kv_cache: true
runtime:
batch_scheduler_policy: max_utilization
enable_chunked_context: true
request_default_max_tokens: 32768 # 131072 for Llama models
```
## Deployment
Pushing the model to Baseten kicks off a multi-stage build and deployment process.
```sh theme={"system"}
truss push qwen-2-5-3b-engine
```
Upon deployment, check your terminal logs or Baseten account to find the URL for the model server.
## Inference
This model is OpenAI compatible and can be called using the OpenAI client.
```python theme={"system"}
import os
from openai import OpenAI
# https://model-XXXXXXX.api.baseten.co/environments/production/sync/v1
model_url = ""
client = OpenAI(
base_url=model_url,
api_key=os.environ.get("BASETEN_API_KEY"),
)
stream = client.chat.completions.create(
model="baseten",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What does Tongyi Qianwen mean?"}
],
stream=True,
)
for chunk in stream:
if chunk.choices[0].delta.content is not None:
print(chunk.choices[0].delta.content, end="")
```
That's it! You have successfully deployed and called an LLM optimized with the TensorRT-LLM Engine-Builder. Check the [Engine-Builder reference](/engines/engine-builder-llm/engine-builder-config) for details on each config option.
# Text to speech
Source: https://docs.baseten.co/examples/text-to-speech
Building a text-to-speech model with Kokoro
In this example, we go through a Truss that serves Kokoro, a frontier text-to-speech model.
# Set up imports
We import necessary libraries and enable Hugging Face file transfers. We also download the NLTK tokenizer data.
```python model/model.py theme={"system"}
import logging
import os
os.environ["HF_HUB_ENABLE_HF_TRANSFER"] = "1"
import base64
import io
import sys
import time
import nltk
import numpy as np
import scipy.io.wavfile as wav
import torch
from huggingface_hub import snapshot_download
from nltk.tokenize import sent_tokenize
from models import build_model
from kokoro import generate
logger = logging.getLogger(__name__)
nltk.download("punkt")
```
# Downloading model weights
We need to prepare model weights by doing the following:
* Create a directory for the model data
* Download the Kokoro model from Hugging Face into the created model data directory
* Add the model data directory to the system path
```python model/model.py theme={"system"}
# Ensure data directory exists
os.makedirs("/app/data/Kokoro-82M", exist_ok=True)
# Download model
snapshot_download(
repo_id="hexgrad/Kokoro-82M",
repo_type="model",
revision="c97b7bbc3e60f447383c79b2f94fee861ff156ac",
local_dir="/app/data/Kokoro-82M",
ignore_patterns=["*.onnx", "kokoro-v0_19.pth", "demo/"],
max_workers=8,
)
# Add data_dir to the system path
sys.path.append("/app/data/Kokoro-82M")
```
# Define the `Model` class and `load` function
In the `load` function of the Truss, we download and set up the model. This `load` function handles setting up the device, loading the model weights, and loading the default voice. We also define the available voices.
```python model/model.py theme={"system"}
class Model:
def __init__(self, **kwargs):
self._data_dir = kwargs["data_dir"]
self.model = None
self.device = None
self.default_voice = None
self.voices = None
return
def load(self):
logger.info("Starting setup...")
self.device = "cuda" if torch.cuda.is_available() else "cpu"
logger.info(f"Using device: {self.device}")
# Load model
logger.info("Loading model...")
model_path = "/app/data/Kokoro-82M/fp16/kokoro-v0_19-half.pth"
logger.info(f"Model path: {model_path}")
if not os.path.exists(model_path):
logger.info(f"Error: Model file not found at {model_path}")
raise FileNotFoundError(f"Model file not found at {model_path}")
try:
self.model = build_model(model_path, self.device)
logger.info("Model loaded successfully")
except Exception as e:
logger.info(f"Error loading model: {str(e)}")
raise
# Load default voice
logger.info("Loading default voice...")
voice_path = "/app/data/Kokoro-82M/voices/af.pt"
if not os.path.exists(voice_path):
logger.info(f"Error: Voice file not found at {voice_path}")
raise FileNotFoundError(f"Voice file not found at {voice_path}")
try:
self.default_voice = torch.load(voice_path).to(self.device)
logger.info("Default voice loaded successfully")
except Exception as e:
logger.info(f"Error loading default voice: {str(e)}")
raise
# Dictionary of available voices
self.voices = {
"default": "af",
"bella": "af_bella",
"sarah": "af_sarah",
"adam": "am_adam",
"michael": "am_michael",
"emma": "bf_emma",
"isabella": "bf_isabella",
"george": "bm_george",
"lewis": "bm_lewis",
"nicole": "af_nicole",
"sky": "af_sky",
}
return
```
# Define the `predict` function
The `predict` function contains the actual inference logic. The steps here are:
* Process input text and handle voice selection
* Chunk text for long inputs
* Generate audio
* Convert resulting audio to base64 and return it
```python model/model.py theme={"system"}
def predict(self, model_input):
# Run model inference here
start = time.time()
text = str(model_input.get("text", "Hi, I'm kokoro"))
voice = str(model_input.get("voice", "af"))
speed = float(model_input.get("speed", 1.0))
logger.info(
f"Text has {len(text)} characters. Using voice {voice} and speed {speed}."
)
if voice != "af":
voicepack = torch.load(f"/app/data/Kokoro-82M/voices/{voice}.pt").to(
self.device
)
else:
voicepack = self.default_voice
if len(text) >= 400:
logger.info("Text is longer than 400 characters, splitting into sentences.")
wavs = []
def group_sentences(text, max_length=400):
sentences = sent_tokenize(text)
# Split long sentences
while max([len(sent) for sent in sentences]) > max_length:
max_sent = max(sentences, key=len)
sentences_before = sentences[: sentences.index(max_sent)]
sentences_after = sentences[sentences.index(max_sent) + 1 :]
new_sentences = [
s.strip() + "." for s in max_sent.split(".") if s.strip()
]
sentences = sentences_before + new_sentences + sentences_after
return sentences
sentences = group_sentences(text)
logger.info(f"Processing {len(sentences)} chunks. Starting generation...")
for sent in sentences:
if sent.strip():
audio, _ = generate(
self.model, sent.strip(), voicepack, lang=voice[0], speed=speed
)
# Remove potential artifacts at the end
audio = audio[:-2000] if len(audio) > 2000 else audio
wavs.append(audio)
# Concatenate all audio chunks
audio = np.concatenate(wavs)
else:
logger.info("No splitting needed. Generating audio...")
audio, _ = generate(self.model, text, voicepack, lang=voice[0], speed=speed)
# Write audio to in-memory buffer
buffer = io.BytesIO()
wav.write(buffer, 24000, audio)
wav_bytes = buffer.getvalue()
duration_seconds = len(audio) / 24000
logger.info(
f"Generation took {time.time()-start} seconds to generate {duration_seconds:.2f} seconds of audio"
)
return {"base64": base64.b64encode(wav_bytes).decode("utf-8")}
```
# Setting up the `config.yaml`
Running Kokoro requires a handful of Python libraries, including `torch`, `transformers`, and others.
```yaml config.yaml theme={"system"}
build_commands:
- python3 -c "import nltk; nltk.download('punkt'); nltk.download('punkt_tab')"
environment_variables: {}
model_metadata:
example_model_input: {"text": "Kokoro is a frontier TTS model for its size of 82 million parameters (text in/audio out). On 25 Dec 2024, Kokoro v0.19 weights were permissively released in full fp32 precision under an Apache 2.0 license. As of 2 Jan 2025, 10 unique Voicepacks have been released, and a .onnx version of v0.19 is available.In the weeks leading up to its release, Kokoro v0.19 was the #1🥇 ranked model in TTS Spaces Arena. Kokoro had achieved higher Elo in this single-voice Arena setting over other models, using fewer parameters and less data. Kokoro's ability to top this Elo ladder suggests that the scaling law (Elo vs compute/data/params) for traditional TTS models might have a steeper slope than previously expected.", "voice": "af", "speed": 1.0}
model_name: kokoro
python_version: py311
requirements:
- torch==2.5.1
- transformers==4.48.0
- scipy==1.15.1
- phonemizer==3.3.0
- nltk==3.9.1
- numpy
- huggingface_hub[hf_transfer]
- hf_transfer==0.1.9
- munch==4.0.0
resources:
accelerator: T4
use_gpu: true
runtime:
predict_concurrency: 1
secrets: {}
system_packages:
- espeak-ng
```
## Configuring resources for Kokoro
Note that we need a T4 GPU to run this model.
```yaml config.yaml theme={"system"}
resources:
accelerator: T4
use_gpu: true
```
## System packages
Running Kokoro requires `espeak-ng` to synthesize speech output.
```yaml config.yaml theme={"system"}
system_packages:
- espeak-ng
```
# Deploy the model
Deploy the model like you would other Trusses by running the following command:
```bash theme={"system"}
truss push kokoro
```
# Run an inference
Use a Python script to call the deployed model and parse its response. In this example, the script sends text input to the model and saves the returned audio (decoded from base64) as a WAV file: `output.wav`.
```python infer.py theme={"system"}
import httpx
import base64
import os
# Replace the empty string with your model id below
model_id = ""
baseten_api_key = os.environ["BASETEN_API_KEY"]
with httpx.Client() as client:
# Make the API request
resp = client.post(
f"https://model-{model_id}.api.baseten.co/production/predict",
headers={"Authorization": f"Api-Key {baseten_api_key}"},
json={"text": "Hello world", "voice": "af", "speed": 1.0},
timeout=None,
)
# Get the base64 encoded audio
response_data = resp.json()
audio_base64 = response_data["base64"]
# Decode the base64 string
audio_bytes = base64.b64decode(audio_base64)
# Write to a WAV file
with open("output.wav", "wb") as f:
f.write(audio_bytes)
print("Audio saved to output.wav")
```
# Run any LLM with vLLM
Source: https://docs.baseten.co/examples/vllm
Deploy models with vLLM on Baseten
[vLLM](https://docs.vllm.ai/) supports a wide range of models and performance optimizations. This guide deploys a vLLM model as a custom Docker server on Baseten.
## Example: Deploy Qwen 2.5 3B on an L4
This configuration serves [Qwen 2.5 3B](https://huggingface.co/Qwen/Qwen2.5-3B-Instruct) with vLLM on an L4 GPU. The deployment process is the same for larger models like [GLM-4.7](https://huggingface.co/zai-org/GLM-4.7). Adjust the `resources` and `start_command` to match your model's requirements.
## Set up your environment
Before you deploy a model, you'll need three setup steps.
Create an [API key](https://app.baseten.co/settings/api_keys) and save it as an environment variable:
```sh theme={"system"}
export BASETEN_API_KEY="abcd.123456"
```
Some models require that you accept terms and conditions on Hugging Face before deployment. To prevent issues:
1. Accept the license for any gated models you wish to access, like [Gemma 3](https://huggingface.co/google/gemma-3-27b-it).
2. Create a read-only [user access token](https://huggingface.co/docs/hub/en/security-tokens) from your Hugging Face account.
3. Add the `hf_access_token` secret [to your Baseten workspace](https://app.baseten.co/settings/secrets).
Install [Truss](https://pypi.org/project/truss/) and the OpenAI SDK:
```sh theme={"system"}
uv venv && source .venv/bin/activate
uv pip install --upgrade truss openai
```
```sh theme={"system"}
python3 -m venv .venv && source .venv/bin/activate
pip install --upgrade truss openai
```
```sh theme={"system"}
python3 -m venv .venv && .venv\Scripts\activate
pip install --upgrade truss openai
```
## Configure the model
Create a directory with a `config.yaml` file:
```sh theme={"system"}
mkdir qwen-2-5-3b-vllm
touch qwen-2-5-3b-vllm/config.yaml
```
Copy the following configuration into `config.yaml`:
```yaml config.yaml theme={"system"}
model_metadata:
example_model_input:
messages:
- role: system
content: "You are a helpful assistant."
- role: user
content: "What does Tongyi Qianwen mean?"
stream: true
model: Qwen/Qwen2.5-3B-Instruct
max_tokens: 512
temperature: 0.6
tags:
- openai-compatible
model_name: Qwen 2.5 3B vLLM
base_image:
image: vllm/vllm-openai:v0.15.1
docker_server:
start_command: sh -c "truss-transfer-cli && vllm serve /app/model_cache/qwen --served-model-name Qwen/Qwen2.5-3B-Instruct --host 0.0.0.0 --port 8000 --enable-prefix-caching"
readiness_endpoint: /health
liveness_endpoint: /health
predict_endpoint: /v1/chat/completions
server_port: 8000
model_cache:
- repo_id: Qwen/Qwen2.5-3B-Instruct
revision: aa8e72537993ba99e69dfaafa59ed015b17504d1
use_volume: true
volume_folder: qwen
resources:
accelerator: L4
use_gpu: true
runtime:
predict_concurrency: 256
health_checks:
restart_check_delay_seconds: 300
restart_threshold_seconds: 300
stop_traffic_threshold_seconds: 120
environment_variables:
hf_access_token: null
```
The `base_image` specifies the [vLLM Docker image](https://hub.docker.com/r/vllm/vllm-openai/tags). The `model_cache` pre-downloads the model from Hugging Face and stores it on a [cached volume](/development/model/model-cache). At startup, `truss-transfer-cli` loads the cached weights into `/app/model_cache/qwen`, then vLLM serves the model with `--served-model-name` to set the model identifier for the OpenAI-compatible API. The `health_checks` give the server time to load the model before Baseten checks readiness.
## Deploy the model
Push the model to Baseten to start the deployment:
```sh theme={"system"}
truss push qwen-2-5-3b-vllm
```
You should see output like:
```
Deploying truss using L4:4x16 instance type.
Model Qwen 2.5 3B vLLM was successfully pushed.
View logs at https://app.baseten.co/models/XXXXXXX/logs/XXXXXXX
```
Copy the model URL from the output for the next step.
## Call the model
Call the deployed model with the OpenAI client:
```python call_model.py theme={"system"}
import os
from openai import OpenAI
model_url = "https://model-XXXXXXX.api.baseten.co/environments/production/sync/v1"
client = OpenAI(
base_url=model_url,
api_key=os.environ.get("BASETEN_API_KEY"),
)
stream = client.chat.completions.create(
model="Qwen/Qwen2.5-3B-Instruct",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What does Tongyi Qianwen mean?"}
],
stream=True,
)
for chunk in stream:
if chunk.choices[0].delta.content is not None:
print(chunk.choices[0].delta.content, end="")
```
Replace the `model_url` with the URL from your deployment output.
# Async inference
Source: https://docs.baseten.co/inference/async
Run asynchronous inference on deployed models
Async inference is a *fire and forget* pattern for model requests. Instead of
waiting for a response, you receive a request ID immediately while inference
runs in the background. When complete, results are delivered to your webhook
endpoint.
Async requests work with any deployed model, you don't need code changes.
Requests can queue for up to 72 hours and run for up to 1 hour. Async inference is not
compatible with streaming output.
Use async inference for:
* **Long-running tasks** that would otherwise hit request timeouts.
* **Batch processing** where you don't need immediate responses.
* **Priority queuing** to serve VIP customers faster.
Baseten does not store model outputs. If webhook delivery fails after all retries,
your data is lost. See [Webhook delivery](#webhook-delivery) for mitigation
strategies.
## Quick start
Create an HTTPS endpoint to receive results.
Use [this Repl](https://replit.com/@baseten-team/Baseten-Async-Inference-Starter-Code) as a starting point, or deploy to any service that can receive POST requests.
Call your model's `/async_predict` endpoint with your webhook URL:
```python theme={"system"}
import requests
import os
model_id = "YOUR_MODEL_ID"
webhook_endpoint = "YOUR_WEBHOOK_ENDPOINT"
baseten_api_key = os.environ["BASETEN_API_KEY"]
# Call the async_predict endpoint of the production deployment
resp = requests.post(
f"https://model-{model_id}.api.baseten.co/production/async_predict",
headers={"Authorization": f"Api-Key {baseten_api_key}"},
json={
"model_input": {"prompt": "hello world!"},
"webhook_endpoint": webhook_endpoint,
# "priority": 0,
# "max_time_in_queue_seconds": 600,
},
)
print(resp.json())
```
You'll receive a `request_id` immediately.
When inference completes, Baseten sends a POST request to your webhook with the model output.
See [Webhook payload](#webhook-payload) for the response format.
**Chains** support async inference through `async_run_remote`.
Inference requests to the entrypoint are queued, but internal Chainlet-to-Chainlet calls run synchronously.
## How async works
Async inference decouples request submission from processing, letting you queue work without waiting for results.
### Request lifecycle
When you submit an async request:
1. You call `/async_predict` and immediately receive a `request_id`.
2. Your request enters a queue managed by the Async Request Service.
3. A background worker picks up your request and calls your model's predict endpoint.
4. Your model runs inference and returns a response.
5. Baseten sends the response to your webhook URL using POST.
The `max_time_in_queue_seconds` parameter controls how long a request waits
before expiring. It defaults to 10 minutes but can extend to 72 hours.
### Autoscaling behavior
The async queue is decoupled from model scaling. Requests queue successfully
even when your model has zero replicas.
When your model is scaled to zero:
1. Your request enters the queue while the model has no running replicas.
2. The queue processor attempts to call your model, triggering the autoscaler.
3. Your request waits while the model cold-starts.
4. Once the model is ready, inference runs and completes.
5. Baseten delivers the result to your webhook.
If the model doesn't become ready within `max_time_in_queue_seconds`, the
request expires with status `EXPIRED`. Set this parameter to account for your
model's startup time. For models with long cold starts, consider keeping minimum
replicas running using
[autoscaling settings](/deployment/autoscaling/overview).
### Async priority
Async requests are subject to two levels of priority: how they compete with sync
requests for model capacity, and how they're ordered relative to other async
requests in the queue.
#### Sync vs async concurrency
Sync and async requests share your model's concurrency pool, controlled by
`predict_concurrency` in your model configuration:
```yaml config.yaml theme={"system"}
runtime:
predict_concurrency: 10
```
The `predict_concurrency` setting defines how many requests your model can
process simultaneously per replica. When both sync and async requests are in
flight, sync requests take priority. The queue processor monitors your model's
capacity and backs off when it receives 429 responses, ensuring sync traffic
isn't starved.
For example, if your model has `predict_concurrency=10` and 8 sync requests are
running, only 2 slots remain for async requests. The remaining async requests
stay queued until capacity frees up.
#### Async queue priority
Within the async queue itself, you can control processing order using the
`priority` parameter. This is useful for serving specific requests faster or
ensuring critical batch jobs run before lower-priority work.
```python theme={"system"}
import requests
import os
model_id = "YOUR_MODEL_ID"
webhook_endpoint = "YOUR_WEBHOOK_URL"
baseten_api_key = os.environ["BASETEN_API_KEY"]
resp = requests.post(
f"https://model-{model_id}.api.baseten.co/production/async_predict",
headers={"Authorization": f"Api-Key {baseten_api_key}"},
json={
"webhook_endpoint": webhook_endpoint,
"model_input": {"prompt": "hello world!"},
"priority": 0,
},
)
print(resp.json())
```
The `priority` parameter accepts values 0, 1, or 2. Lower values indicate higher
priority: a request with `priority: 0` is processed before requests with
`priority: 1` or `priority: 2`. If you don't specify a priority, requests
default to priority 1.
Use priority 0 sparingly for truly urgent requests. If all requests are marked
priority 0, the prioritization has no effect.
## Webhooks
Baseten delivers async results to your webhook endpoint when inference completes.
### Request format
When inference completes, Baseten sends a POST request to your webhook with these headers and body:
```text theme={"system"}
POST /your-webhook-path HTTP/2.0
Content-Type: application/json
X-BASETEN-REQUEST-ID: 9876543210abcdef1234567890fedcba
X-BASETEN-SIGNATURE: v1=abc123...
```
The `X-BASETEN-REQUEST-ID` header contains the request ID for correlating webhooks with your original requests.
The `X-BASETEN-SIGNATURE` header is only included if a [webhook secret](#secure-webhooks) is configured.
Webhook endpoints must use HTTPS (except `localhost` for development). Baseten
supports HTTP/2 and HTTP/1.1 connections.
```json theme={"system"}
{
"request_id": "9876543210abcdef1234567890fedcba",
"model_id": "abc123",
"deployment_id": "def456",
"type": "async_request_completed",
"time": "2024-04-30T01:01:08.883423Z",
"data": { "output": "model response here" },
"errors": []
}
```
The body contains the `request_id` matching your original `/async_predict`
response, along with `model_id` and `deployment_id` identifying which deployment
ran the request. The `data` field contains your model output, or `null` if an
error occurred. The `errors` array is empty on success, or contains error
objects on failure.
### Webhook delivery
If all delivery attempts fail, your model output is permanently lost.
Baseten delivers webhooks on a best-effort basis with automatic retries:
| Setting | Value |
| --------------- | -------------------------- |
| Total attempts | 3 (1 initial + 2 retries). |
| Backoff | 1 second, then 4 seconds. |
| Timeout | 10 seconds per attempt. |
| Retryable codes | 500, 502, 503, 504. |
**To prevent data loss:**
1. **Save outputs in your model.** Use the `postprocess()` function to write to
cloud storage:
```python theme={"system"}
import json
import boto3
class Model:
# ...
def postprocess(self, model_output):
s3 = boto3.client("s3")
s3.put_object(
Bucket="my-bucket",
Key=f"outputs/{self.context.get('request_id')}.json",
Body=json.dumps(model_output)
)
return model_output
```
This will process your model output and save it to your desired location.
The `postprocess` method runs after inference completes. Use
`self.context.get('request_id')` to access the async request ID for correlating
outputs with requests.
2. **Use a reliable endpoint.** Deploy your webhook to a highly available
service like a cloud function or message queue.
### Secure webhooks
Create a webhook secret in the
[Secrets tab](https://app.baseten.co/settings/secrets) to verify requests are
from Baseten.
When configured, Baseten includes an `X-BASETEN-SIGNATURE` header:
```text theme={"system"}
X-BASETEN-SIGNATURE: v1=abc123...
```
To validate, compute an HMAC-SHA256 of the request body using your secret and compare:
```python theme={"system"}
import hashlib
import hmac
def verify_signature(body: bytes, signature: str, secret: str) -> bool:
expected = hmac.new(secret.encode(), body, hashlib.sha256).hexdigest()
actual = signature.replace("v1=", "").split(",")[0]
return hmac.compare_digest(expected, actual)
```
The function computes an HMAC-SHA256 hash of the raw request body using your
webhook secret. It extracts the signature value after `v1=` and uses
`compare_digest` for timing-safe comparison to prevent timing attacks.
Rotate secrets periodically. During rotation, both old and new secrets remain
valid for 24 hours.
## Manage requests
You can check the status of async requests or cancel them while they're queued.
### Check request status
To check the status of an async request, call the status endpoint with your request ID:
```python theme={"system"}
import requests
import os
model_id = "YOUR_MODEL_ID"
request_id = "YOUR_REQUEST_ID"
baseten_api_key = os.environ["BASETEN_API_KEY"]
resp = requests.get(
f"https://model-{model_id}.api.baseten.co/async_request/{request_id}",
headers={"Authorization": f"Api-Key {baseten_api_key}"}
)
print(resp.json())
```
Status is available for 1 hour after completion. See the
[status API reference](/reference/inference-api/status-endpoints/get-async-request-status)
for details.
| Status | Description |
| ---------------- | ------------------------------------------------ |
| `QUEUED` | Waiting in queue. |
| `IN_PROGRESS` | Currently processing. |
| `SUCCEEDED` | Completed successfully. |
| `FAILED` | Failed after retries. |
| `EXPIRED` | Exceeded `max_time_in_queue_seconds`. |
| `CANCELED` | Canceled by user. |
| `WEBHOOK_FAILED` | Inference succeeded but webhook delivery failed. |
### Cancel a request
Only `QUEUED` requests can be canceled. To cancel a request, call the cancel
endpoint with your request ID:
```python theme={"system"}
import requests
import os
model_id = "YOUR_MODEL_ID"
request_id = "YOUR_REQUEST_ID"
baseten_api_key = os.environ["BASETEN_API_KEY"]
resp = requests.delete(
f"https://model-{model_id}.api.baseten.co/async_request/{request_id}",
headers={"Authorization": f"Api-Key {baseten_api_key}"}
)
print(resp.json())
```
For more information, see the [cancel async request API reference](/reference/inference-api/predict-endpoints/cancel-async-request).
## Error codes
When inference fails, the webhook payload returns an `errors` array:
```json theme={"system"}
{
"errors": [{ "code": "MODEL_PREDICT_ERROR", "message": "Details here" }]
}
```
| Code | HTTP | Description | Retried |
| ----------------------- | ------- | ------------------------------- | ------- |
| `MODEL_NOT_READY` | 400 | Model is loading or starting. | Yes |
| `MODEL_DOES_NOT_EXIST` | 404 | Model or deployment not found. | No |
| `MODEL_INVALID_INPUT` | 422 | Invalid input format. | No |
| `MODEL_PREDICT_ERROR` | 500 | Exception in `model.predict()`. | Yes |
| `MODEL_UNAVAILABLE` | 502/503 | Model crashed or scaling. | Yes |
| `MODEL_PREDICT_TIMEOUT` | 504 | Inference exceeded timeout. | Yes |
### Inference retries
When inference fails with a retryable error, Baseten automatically retries the
request using exponential backoff. Configure this behavior with
`inference_retry_config`:
```python theme={"system"}
import requests
import os
model_id = "YOUR_MODEL_ID"
webhook_endpoint = "YOUR_WEBHOOK_URL"
baseten_api_key = os.environ["BASETEN_API_KEY"]
resp = requests.post(
f"https://model-{model_id}.api.baseten.co/production/async_predict",
headers={"Authorization": f"Api-Key {baseten_api_key}"},
json={
"model_input": {"prompt": "hello world!"},
"webhook_endpoint": webhook_endpoint,
"inference_retry_config": {
"max_attempts": 3,
"initial_delay_ms": 1000,
"max_delay_ms": 5000
}
},
)
print(resp.json())
```
| Parameter | Range | Default | Description |
| ------------------ | -------- | ------- | ------------------------------------------------ |
| `max_attempts` | 1-10 | 3 | Total inference attempts including the original. |
| `initial_delay_ms` | 0-10,000 | 1000 | Delay before the first retry (ms). |
| `max_delay_ms` | 0-60,000 | 5000 | Maximum delay between retries (ms). |
Retries use exponential backoff with a multiplier of 2. With the default
configuration, delays progress as: 1s → 2s → 4s → 5s (capped at `max_delay_ms`).
Only requests that fail with retryable error codes (500, 502, 503, 504) are
retried. Non-retryable errors like invalid input (422) or model not found (404)
fail immediately.
Inference retries are distinct from [webhook delivery retries](#webhook-delivery).
Inference retries happen when calling your model fails. Webhook retries happen
when delivering results to your endpoint fails.
## Rate limits
There are rate limits for the async predict endpoint and the status polling endpoint.
If you exceed these limits, you will receive a 429 status code.
| Endpoint | Limit |
| -------------------------------------------- | ----------------------------------- |
| Predict endpoint requests (`/async_predict`) | 12,000 requests/minute (org-level). |
| Status polling | 20 requests/second. |
| Cancel request | 20 requests/second. |
Use webhooks instead of polling to avoid status endpoint limits. Contact
[support@baseten.co](mailto:support@baseten.co) to request increases.
## Observability
Async metrics are available on the
[Metrics tab](/observability/metrics#async-queue-metrics) of your model
dashboard:
* **Inference latency/volume**: includes async requests.
* **Time in async queue**: time spent in `QUEUED` state.
* **Async queue size**: number of queued requests.
## Related
Fork this Repl to quickly set up a webhook endpoint for testing async inference.
Configure webhook secrets in your Baseten settings to secure webhook delivery.
# Call your model
Source: https://docs.baseten.co/inference/calling-your-model
Run inference on deployed models
Once deployed, your model is accessible through an [API endpoint](/reference/inference-api/overview). To make an inference request, you'll need:
* **Model ID**: Found in the Baseten dashboard or returned when you deploy.
* **[API key](/organization/api-keys)**: Authenticates your requests.
* **JSON-serializable model input**: The data your model expects.
## Authentication
Include your API key in the `Authorization` header:
```sh theme={"system"}
curl -X POST https://model-YOUR_MODEL_ID.api.baseten.co/environments/production/predict \
-H "Authorization: Api-Key $BASETEN_API_KEY" \
-H "Content-Type: application/json" \
-d '{"prompt": "Hello, world!"}'
```
In Python with requests:
```python theme={"system"}
import requests
import os
api_key = os.environ["BASETEN_API_KEY"]
model_id = "YOUR_MODEL_ID"
response = requests.post(
f"https://model-{model_id}.api.baseten.co/environments/production/predict",
headers={"Authorization": f"Api-Key {api_key}"},
json={"prompt": "Hello, world!"},
)
print(response.json())
```
## Predict API endpoints
Baseten provides multiple endpoints for different inference modes:
* [`/predict`](/reference/inference-api/overview#predict-endpoints): Standard synchronous inference.
* [`/async_predict`](/reference/inference-api/overview#predict-endpoints): Asynchronous inference for long-running tasks.
Endpoints are available for environments and all deployments. See the [API reference](/reference/inference-api/overview) for details.
## Sync API endpoints
Custom servers support both `predict` endpoints as well as a special `sync` endpoint. By using the `sync` endpoint you are able to call different routes in your custom server.
```
https://model-{model-id}.api.baseten.co/environments/{production}/sync/{route}
```
Here are a few examples that show how the sync endpoint maps to the custom server's routes.
* `https://model-{model_id}.../sync/health` -> `/health`
* `https://model-{model_id}.../sync/items` -> `/items`
* `https://model-{model_id}.../sync/items/123` -> `/items/123`
## OpenAI SDK
When deploying a model with Engine-Builder, you will get an OpenAI compatible server. If you are already using one of the OpenAI SDKs, you'll simply need to update the base url to your Baseten model URL and include your Baseten API Key.
```python theme={"system"}
import os
from openai import OpenAI
model_id = "abcdef" # TODO: replace with your model id
api_key = os.environ.get("BASETEN_API_KEY")
model_url = f"https://model-{model_id}.api.baseten.co/environments/production/sync/v1"
client = OpenAI(
base_url=model_url,
api_key=api_key,
)
stream = client.chat.completions.create(
model="baseten",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What is the capital of France?"}
],
stream=True,
)
for chunk in stream:
if chunk.choices[0].delta.content is not None:
print(chunk.choices[0].delta.content, end="")
```
## Alternative invocation methods
* **Truss CLI**: [`truss predict`](/reference/cli/truss/predict)
* **Model Dashboard**: "Playground" button in the Baseten UI
# Concepts
Source: https://docs.baseten.co/inference/concepts
Inference on Baseten is designed for flexibility, efficiency, and scalability. Models can be served [synchronously](/inference/calling-your-model), [asynchronously](/inference/async), or with [streaming](/inference/streaming) to meet different performance and latency needs.
* [Synchronously](/inference/calling-your-model) inference is ideal for low-latency, real-time responses.
* [Asynchronously](/inference/async) inference handles long-running tasks efficiently without blocking resources.
* [Streaming](/inference/streaming) inference delivers partial results as they become available for faster response times.
Baseten supports various input and output formats, including structured data, binary files, and function calls, making it adaptable to different workloads.
# Integrations
Source: https://docs.baseten.co/inference/integrations
Integrate your models with tools like LangChain, LiteLLM, and more.
Use frontier open-source models like Kimi K2 Thinking, GLM 4.6, DeepSeek V3.1 inside your IDE with Baseten and Cline.
Build agents with human-in-the-loop powered by Baseten LLMs and HumanLayer.
Use Baseten models within your RAG applications with LlamaIndex.
Use your Baseten models with LangChain V1.0 to build workflows and agents.
Use your Baseten models in LiteLLM projects.
Build real-time voice agents with TTS models hosted on Baseten.
Use frontier open-source models like Kimi K2 Thinking, GLM 4.6, DeepSeek V3.1 inside your IDE with Baseten and Roo Code.
Power your Next.js web apps using Baseten models through AI SDK v5.
Want to integrate Baseten with your platform or project? Reach out to
[support@baseten.co](mailto:support@baseten.co) and we'll help with building
and marketing the integration.
# Model I/O in binary
Source: https://docs.baseten.co/inference/output-format/binary
Decode and save binary model output
Baseten and Truss natively support model I/O in binary and use msgpack encoding for efficiency.
## Deploy a basic Truss for binary I/O
If you need a deployed model to try the invocation examples below, follow these steps to create and deploy a super basic Truss that accepts and returns binary data. The Truss performs no operations and is purely illustrative.
To create a Truss, run:
```sh theme={"system"}
truss init binary_test
```
This creates a Truss in a new directory `binary_test`. By default, newly created Trusses implement an identity function that returns the exact input they are given.
Optionally, modify `binary_test/model/model.py` to log that the data received is of type `bytes`:
```python binary_test/model/model.py theme={"system"}
def predict(self, model_input):
# Run model inference here
print(f"Input type: {type(model_input['byte_data'])}")
return model_input
```
Deploy the Truss to Baseten with:
```sh theme={"system"}
truss push --watch
```
## Send raw bytes as model input
To send binary data as model input:
1. Set the `content-type` HTTP header to `application/octet-stream`
2. Use `msgpack` to encode the data or file
3. Make a POST request to the model
This code sample assumes you have a file `Gettysburg.mp3` in the current working directory. You can download the [11-second file from our CDN](https://cdn.baseten.co/docs/production/Gettysburg.mp3) or replace it with your own file.
```python call_model.py theme={"system"}
import os
import requests
import msgpack
model_id = "MODEL_ID" # Replace this with your model ID
deployment = "development" # `development`, `production`, or a deployment ID
baseten_api_key = os.environ["BASETEN_API_KEY"]
# Specify the URL to which you want to send the POST request
url = f"https://model-{model_id}.api.baseten.co/{deployment}/predict"
headers={
"Authorization": f"Api-Key {baseten_api_key}",
"content-type": "application/octet-stream",
}
with open('Gettysburg.mp3', 'rb') as file:
response = requests.post(
url,
headers=headers,
data=msgpack.packb({'byte_data': file.read()})
)
print(response.status_code)
print(response.headers)
```
To support certain types like numpy and datetime values, you may need to
extend client-side `msgpack` encoding with the same [encoder and decoder used
by
Truss](https://github.com/basetenlabs/truss/blob/main/truss/templates/shared/serialization.py).
## Parse raw bytes from model output
To use the output of a non-streaming model response, decode the response content.
```python call_model.py theme={"system"}
# Continues `call_model.py` from above
binary_output = msgpack.unpackb(response.content)
# Change extension if not working with mp3 data
with open('output.mp3', 'wb') as file:
file.write(binary_output["byte_data"])
```
## Streaming binary outputs
You can also stream output as binary. This is useful for sending large files or reading binary output as it is generated.
In the `model.py`, you must create a streaming output.
```python model/model.py theme={"system"}
# Replace the predict function in your Truss
def predict(self, model_input):
import os
current_dir = os.path.dirname(__file__)
file_path = os.path.join(current_dir, "tmpfile.txt")
with open(file_path, mode="wb") as file:
file.write(bytes(model_input["text"], encoding="utf-8"))
def iterfile():
# Get the directory of the current file
current_dir = os.path.dirname(__file__)
# Construct the full path to the .wav file
file_path = os.path.join(current_dir, "tmpfile.txt")
with open(file_path, mode="rb") as file_like:
yield from file_like
return iterfile()
```
Then, in your client, you can use streaming output directly without decoding.
```python stream_model.py theme={"system"}
import os
import requests
import json
model_id = "MODEL_ID" # Replace this with your model ID
deployment = "development" # `development`, `production`, or a deployment ID
baseten_api_key = os.environ["BASETEN_API_KEY"]
# Specify the URL to which you want to send the POST request
url = f"https://model-{model_id}.api.baseten.co/{deployment}/predict"
headers={
"Authorization": f"Api-Key {baseten_api_key}",
}
s = requests.Session()
with s.post(
# Endpoint for production deployment, see API reference for more
f"https://model-{model_id}.api.baseten.co/{deployment}/predict",
headers={"Authorization": f"Api-Key {baseten_api_key}"},
data=json.dumps({"text": "Lorem Ipsum"}),
# Include stream=True as an argument so the requests library knows to stream
stream=True,
) as response:
for token in response.iter_content(1):
print(token) # Prints bytes
```
# Model I/O with files
Source: https://docs.baseten.co/inference/output-format/files
Call models by passing a file or URL
Baseten supports a wide variety of file-based I/O approaches. These examples show our recommendations for working with files during model inference, whether local or remote, public or private, in the Truss or in your invocation code.
## Files as input
### Example: Send a file with JSON-serializable content
The Truss CLI has a `-f` flag to pass file input. If you're using the API endpoint via Python, get file contents with the standard `f.read()` function.
```sh Truss CLI theme={"system"}
truss predict -f input.json
```
```python Python script theme={"system"}
import urllib3
import json
model_id = ""
# Read secrets from environment variables
baseten_api_key = os.environ["BASETEN_API_KEY"]
# Read input as JSON
with open("input.json", "r") as f:
data = json.loads(f.read())
resp = urllib3.request(
"POST",
# Endpoint for production deployment, see API reference for more
f"https://model-{model_id}.api.baseten.co/production/predict",
headers={"Authorization": f"Api-Key {baseten_api_key}"},
json=data
)
print(resp.json())
```
### Example: Send a file with non-serializable content
The `-f` flag for `truss predict` only applies to JSON-serializable content. For other files, like the audio files required by MusicGen Melody, the file content needs to be base64 encoded before it is sent.
```python theme={"system"}
import urllib3
model_id = ""
# Read secrets from environment variables
baseten_api_key = os.environ["BASETEN_API_KEY"]
# Open a local file
with open("mymelody.wav", "rb") as f: # mono wav file, 48khz sample rate
# Convert file contents into JSON-serializable format
encoded_data = base64.b64encode(f.read())
encoded_str = encoded_data.decode("utf-8")
# Define the data payload
data = {"prompts": ["happy rock", "energetic EDM", "sad jazz"], "melody": encoded_str, "duration": 8}
# Make the POST request
response = requests.post(url, headers=headers, data=data)
resp = urllib3.request(
"POST",
# Endpoint for production deployment, see API reference for more
f"https://model-{model_id}.api.baseten.co/production/predict",
headers={"Authorization": f"Api-Key {baseten_api_key}"},
json=data
)
data = resp.json()["data"]
# Save output to files
for idx, clip in enumerate(data):
with open(f"clip_{idx}.wav", "wb") as f:
f.write(base64.b64decode(clip))
```
### Example: Send a URL to a public file
Rather than encoding and serializing a file to send in the HTTP request, you can instead write a Truss that takes a URL as input and loads the content in the `preprocess()` function.
Here's an example from [Whisper in the model library](https://www.baseten.co/library/whisper/).
```python theme={"system"}
from tempfile import NamedTemporaryFile
import requests
# Get file content without blocking GPU
def preprocess(self, request):
resp = requests.get(request["url"])
return {"content": resp.content}
# Use file content in model inference
def predict(self, model_input):
with NamedTemporaryFile() as fp:
fp.write(model_input["content"])
result = whisper.transcribe(
self._model,
fp.name,
temperature=0,
best_of=5,
beam_size=5,
)
segments = [
{"start": r["start"], "end": r["end"], "text": r["text"]}
for r in result["segments"]
]
return {
"language": whisper.tokenizer.LANGUAGES[result["language"]],
"segments": segments,
"text": result["text"],
}
```
## Files as output
### Example: Save model output to local file
When saving model output to a local file, there's nothing Baseten-specific about the code. Just use the standard `>` operator in bash or `file.write()` function in Python to save the model output.
```sh Truss CLI theme={"system"}
truss predict -d '"Model input!"' > output.json
```
```python Python script theme={"system"}
import urllib3
import json
model_id = ""
# Read secrets from environment variables
baseten_api_key = os.environ["BASETEN_API_KEY"]
# Call model
resp = urllib3.request(
"POST",
# Endpoint for production deployment, see API reference for more
f"https://model-{model_id}.api.baseten.co/production/predict",
headers={"Authorization": f"Api-Key {baseten_api_key}"},
json=json.dumps("Model input!")
)
# Write results to file
with open("output.json", "w") as f:
f.write(resp.json())
```
Output for some models, like image and audio generation models, may need to be decoded before you save it. See our [image generation example](/examples/image-generation) for how to parse base64 output.
# Streaming
Source: https://docs.baseten.co/inference/streaming
How to call a model that has a streaming-capable endpoint.
Any model could be packaged with support for streaming output, but it only makes sense to do so for models where:
* Generating a complete output takes a relatively long time.
* The first tokens of output are useful without the context of the rest of the output.
* Reducing the time to first token improves the user experience.
LLMs in chat applications are the perfect use case for streaming model output.
## Example: Streaming with Mistral
[Mistral 7B Instruct](https://www.baseten.co/library/mistral-7b-instruct) from Baseten's model library is a recent LLM with streaming support. Invocation should be the same for any other model library LLM as well as any Truss that follows the same standard.
[Deploy Mistral 7B Instruct](https://www.baseten.co/library/mistral-7b-instruct) or a similar LLM to run the following examples.
### Truss CLI
The Truss CLI has built-in support for streaming model output.
```sh theme={"system"}
truss predict -d '{"prompt": "What is the Mistral wind?", "stream": true}'
```
### API endpoint
When using a streaming endpoint with cURL, use the `--no-buffer` flag to stream output as it is received.
As with all cURL invocations, you'll need a model ID and API key.
```sh theme={"system"}
curl -X POST https://app.baseten.co/models/MODEL_ID/predict \
-H 'Authorization: Api-Key YOUR_API_KEY' \
-d '{"prompt": "What is the Mistral wind?", "stream": true}' \
--no-buffer
```
### Python application
Let's take things a step further and look at how to integrate streaming output with a Python application.
```python theme={"system"}
import requests
import json
import os
# Model ID for production deployment
model_id = ""
# Read secrets from environment variables
baseten_api_key = os.environ["BASETEN_API_KEY"]
# Open session to enable streaming
s = requests.Session()
with s.post(
# Endpoint for production deployment, see API reference for more
f"https://model-{model_id}.api.baseten.co/production/predict",
headers={"Authorization": f"Api-Key {baseten_api_key}"},
# Include "stream": True in the data dict so the model knows to stream
data=json.dumps({
"prompt": "What even is AGI?",
"stream": True,
"max_new_tokens": 4096
}),
# Include stream=True as an argument so the requests library knows to stream
stream=True,
) as resp:
# Print the generated tokens as they get streamed
for content in resp.iter_content():
print(content.decode("utf-8"), end="", flush=True)
```
# Export to Datadog
Source: https://docs.baseten.co/observability/export-metrics/datadog
Export metrics from Baseten to Datadog
The Baseten metrics endpoint can be integrated with [OpenTelemetry Collector](https://opentelemetry.io/docs/collector/) by configuring a Prometheus receiver that scrapes the endpoint. This allows Baseten metrics to be pushed to a variety of popular exporters. See the [OpenTelemetry registry](https://opentelemetry.io/ecosystem/registry/?component=exporter) for a full list.
**Using OpenTelemetry Collector to push to Datadog**
```yaml config.yaml theme={"system"}
receivers:
# Configure a Prometheus receiver to scrape the Baseten metrics endpoint.
prometheus:
config:
scrape_configs:
- job_name: 'baseten'
scrape_interval: 60s
metrics_path: '/metrics'
scheme: https
authorization:
type: "Api-Key"
credentials: "{BASETEN_API_KEY}"
static_configs:
- targets: ['app.baseten.co']
processors:
batch:
exporters:
# Configure a Datadog exporter.
datadog:
api:
key: "{DATADOG_API_KEY}"
service:
pipelines:
metrics:
receivers: [prometheus]
processors: [batch]
exporters: [datadog]
```
# Export to Grafana Cloud
Source: https://docs.baseten.co/observability/export-metrics/grafana
Export metrics from Baseten to Grafana Cloud
The Baseten + Grafana Cloud integration enables you to get real-time inference metrics within your existing Grafana setup.
## Video tutorial
See below for step-by-step details from the video.
## Set up the integration
For a visual guide, please follow along with the video above.
Open your Grafana Cloud account:
1. Navigate to "Home > Connections > Add new connection".
2. In the search bar, type `Metrics Endpoint` and select it.
3. Give your scrape job a name like `baseten_metrics_scrape`.
4. Set the scrape job URL to `https://app.baseten.co/metrics`.
5. Leave the scrape interval set to `Every minute`.
6. Select `Bearer` for authentication credentials.
7. In your Baseten account, generate a metrics-only workspace API key.
8. In Grafana, enter the Bearer Token as `Api-Key abcd.1234567890` where the latter value is replaced by your API key.
9. Use the "Test Connection" button to ensure everything is entered correctly.
10. Click "Save Scrape Job."
11. Click "Install."
12. In your integrations list, select your new export and go through the "Enable" flow shown on video.
Now, you can navigate to your Dashboards tab, where you'll see your data! Please note that it can take a couple of minutes for data to arrive and only new data will be scraped, not historical metrics.
## Build a Grafana dashboard
Importing the data is a great first step, but you'll need a dashboard to properly visualize the incoming information.
We've prepared a basic dashboard to get you started, which you can import by:
1. Downloading `baseten_grafana_dashboard.json` from [this GitHub Gist](https://gist.github.com/philipkiely-baseten/9952e7592775ce1644944fb644ba2a9c).
2. Selecting "New > Import" from the dropdown in the top-right corner of the Dashboard page.
3. Dropping in the provided JSON file.
For visual reference in navigating the dashboard, please see the video above.
# Export to New Relic
Source: https://docs.baseten.co/observability/export-metrics/new-relic
Export metrics from Baseten to New Relic
Export Baseten metrics to New Relic by integrating with [OpenTelemetry Collector](https://opentelemetry.io/docs/collector/). This involves configuring a Prometheus receiver that scrapes Baseten's metrics endpoint and configuring a New Relic exporter to send the metrics to your observability backend.
**Using OpenTelemetry Collector to push to New Relic**
```yaml config.yaml theme={"system"}
receivers:
# Configure a Prometheus receiver to scrape the Baseten metrics endpoint.
prometheus:
config:
scrape_configs:
- job_name: 'baseten'
scrape_interval: 60s
metrics_path: '/metrics'
scheme: https
authorization:
type: "Api-Key"
credentials: "{BASETEN_API_KEY}"
static_configs:
- targets: ['app.baseten.co']
processors:
batch:
exporters:
# Configure a New Relic exporter. Visit New Relic documentation to get your regional otlp endpoint.
otlphttp/newrelic:
endpoint: https://otlp.nr-data.net
headers:
api-key: "{NEW_RELIC_KEY}"
service:
pipelines:
metrics:
receivers: [prometheus]
processors: [batch]
exporters: [otlphttp/newrelic]
```
# Overview
Source: https://docs.baseten.co/observability/export-metrics/overview
Export metrics from Baseten to your observability stack
Baseten provides a metrics endpoint in Prometheus format, allowing integration with observability tools like Prometheus, OpenTelemetry Collector, Datadog Agent, and Vector.
## Setting up metrics scraping
Use the Authorization header with a [Baseten API key](https://app.baseten.co/settings/api_keys):
```json theme={"system"}
{"Authorization": "Api-Key YOUR_API_KEY"}
```
Recommended 1-minute interval (metrics update every 30 seconds).
## Supported integrations
Baseten metrics can be collected via [OpenTelemetry Collector](https://opentelemetry.io/docs/collector/) and exported to:
* [Prometheus](/observability/export-metrics/prometheus)
* [Datadog](/observability/export-metrics/datadog)
* [Grafana](/observability/export-metrics/grafana)
* [New Relic](/observability/export-metrics/new-relic)
For available metrics, see the [supported metrics reference](/observability/export-metrics/supported-metrics).
## Rate limits
* **6 requests per minute per organization**
* Exceeding this limit results in **HTTP 429 (Too Many Requests)** responses.
* To stay within limits, use a **1-minute scrape interval**.
# Export to Prometheus
Source: https://docs.baseten.co/observability/export-metrics/prometheus
Export metrics from Baseten to Prometheus
To integrate with Prometheus, specify the Baseten metrics endpoint in a scrape config. For example:
```yaml prometheus.yml theme={"system"}
global:
scrape_interval: 60s
scrape_configs:
- job_name: 'baseten'
metrics_path: '/metrics'
authorization:
type: "Api-Key"
credentials: "{BASETEN_API_KEY}"
static_configs:
- targets: ['app.baseten.co']
scheme: https
```
See the Prometheus docs for more details on [getting started](https://prometheus.io/docs/prometheus/latest/getting_started/) and [configuration options](https://prometheus.io/docs/prometheus/latest/configuration/configuration/).
# Metrics support matrix
Source: https://docs.baseten.co/observability/export-metrics/supported-metrics
Which metrics can be exported
## `baseten_inference_requests_total`
Cumulative number of requests to the model.
Type: `counter`
Labels:
The ID of the model.
The name of the model.
The ID of the deployment.
The status code of the response.
Whether the request was an [async inference request](/inference/async).
The environment that the deployment corresponds to. Empty if the deployment is not associated with an environment.
The phase of the deployment in the [promote to production process](/deployment/deployments#environments-and-promotion). Empty if the deployment is not associated with an environment.
Possible values:
* `"promoting"`
* `"stable"`
## `baseten_end_to_end_response_time_seconds`
End-to-end response time in seconds.
Type: `histogram`
Labels:
The ID of the model.
The name of the model.
The ID of the deployment.
The status code of the response.
Whether the request was an [async inference request](/inference/async).
The environment that the deployment corresponds to. Empty if the deployment is not associated with an environment.
The phase of the deployment in the [promote to production process](/deployment/deployments#environments-and-promotion). Empty if the deployment is not associated with an environment.
Possible values:
* `"promoting"`
* `"stable"`
## `baseten_container_cpu_usage_seconds_total`
Cumulative CPU time consumed by the container in core-seconds.
Type: `counter`
Labels:
The ID of the model.
The name of the model.
The ID of the deployment.
The ID of the replica.
The environment that the deployment corresponds to. Empty if the deployment is
not associated with an environment.
The phase of the deployment in the [promote to production process](/deployment/deployments#environments-and-promotion). Empty if the deployment is not associated with an environment.
Possible values:
* `"promoting"`
* `"stable"`
## `baseten_replicas_active`
Number of replicas ready to serve model requests.
Type: `gauge`
Labels:
The ID of the model.
The name of the model.
The ID of the deployment.
The environment that the deployment corresponds to. Empty if the deployment is
not associated with an environment.
The phase of the deployment in the [promote to production process](/deployment/deployments#environments-and-promotion). Empty if the deployment is not associated with an environment.
Possible values:
* `"promoting"`
* `"stable"`
## `baseten_replicas_starting`
Number of replicas starting up--that is, either waiting for resources to be available or loading the model.
Type: `gauge`
Labels:
The ID of the model.
The name of the model.
The ID of the deployment.
The environment that the deployment corresponds to. Empty if the deployment is
not associated with an environment.
The phase of the deployment in the [promote to production process](/deployment/deployments#environments-and-promotion). Empty if the deployment is not associated with an environment.
Possible values:
* `"promoting"`
* `"stable"`
## `baseten_container_cpu_memory_working_set_bytes`
Working set memory usage of the container in bytes.
Type: `gauge`
Labels:
The ID of the model.
The name of the model.
The ID of the deployment.
The ID of the replica.
The environment that the deployment corresponds to. Empty if the deployment is not associated with an environment.
The phase of the deployment in the [promote to production process](/deployment/deployments#environments-and-promotion). Empty if the deployment is not associated with an environment.
Possible values:
* `"promoting"`
* `"stable"`
## `baseten_request_size_bytes`
Request size in bytes. Proxy for input tokens.
Type: `histogram`
Labels:
The ID of the model.
The name of the model.
The ID of the deployment.
The status code of the response.
Whether the request was an [async inference request](/inference/async).
The environment that the deployment corresponds to. Empty if the deployment is not associated with an environment.
The phase of the deployment in the [promote to production process](/deployment/deployments#environments-and-promotion). Empty if the deployment is not associated with an environment.
Possible values:
* `"promoting"`
* `"stable"`
## `baseten_response_size_bytes`
Response size in bytes. Proxy for generated tokens.
Type: `histogram`
Labels:
The ID of the model.
The name of the model.
The ID of the deployment.
The status code of the response.
Whether the request was an [async inference request](/inference/async).
The environment that the deployment corresponds to. Empty if the deployment is not associated with an environment.
The phase of the deployment in the [promote to production process](/deployment/deployments#environments-and-promotion). Empty if the deployment is not associated with an environment.
Possible values:
* `"promoting"`
* `"stable"`
## `baseten_time_to_first_byte_seconds`
Time to first byte/write in seconds. Proxy for time-to-first-token (TTFT).
Type: `histogram`
Labels:
The ID of the model.
The name of the model.
The ID of the deployment.
The status code of the response.
Whether the request was an [async inference request](/inference/async).
The environment that the deployment corresponds to. Empty if the deployment is not associated with an environment.
The phase of the deployment in the [promote to production process](/deployment/deployments#environments-and-promotion). Empty if the deployment is not associated with an environment.
Possible values:
* `"promoting"`
* `"stable"`
## `baseten_time_in_async_queue_seconds`
Time async requests spend queued before processing.
Type: `histogram`
Labels:
The ID of the model.
The name of the model.
The ID of the deployment.
The environment that the deployment corresponds to. Empty if the deployment is not associated with an environment.
The phase of the deployment in the [promote to production process](/deployment/deployments#environments-and-promotion). Empty if the deployment is not associated with an environment.
Possible values:
* `"promoting"`
* `"stable"`
## `baseten_async_queue_size`
Number of queued async requests over time.
Type: `gauge`
Labels:
The ID of the model.
The name of the model.
The ID of the deployment.
The environment that the deployment corresponds to. Empty if the deployment is not associated with an environment.
The phase of the deployment in the [promote to production process](/deployment/deployments#environments-and-promotion). Empty if the deployment is not associated with an environment.
Possible values:
* `"promoting"`
* `"stable"`
## `baseten_gpu_memory_used`
GPU memory used in MiB.
Type: `gauge`
Labels:
The ID of the model.
The name of the model.
The ID of the deployment.
The ID of the replica.
The ID of the GPU.
The environment that the deployment corresponds to. Empty if the deployment is not associated with an environment.
The phase of the deployment in the [promote to production process](/deployment/deployments#environments-and-promotion). Empty if the deployment is not associated with an environment.
Possible values:
* `"promoting"`
* `"stable"`
## `baseten_gpu_utilization`
GPU utilization as a percentage (between 0 and 100).
Type: `gauge`
Labels:
The ID of the model.
The name of the model.
The ID of the deployment.
The ID of the replica.
The ID of the GPU.
The environment that the deployment corresponds to. Empty if the deployment is not associated with an environment.
The phase of the deployment in the [promote to production process](/deployment/deployments#environments-and-promotion). Empty if the deployment is not associated with an environment.
Possible values:
* `"promoting"`
* `"stable"`
## `baseten_ongoing_websocket_connections`
Number of ongoing websocket connections.
Type: `gauge`
Labels:
The ID of the model.
The name of the model.
The ID of the deployment.
The environment that the deployment corresponds to. Empty if the deployment is not associated with an environment.
The phase of the deployment in the [promote to production process](/deployment/deployments#environments-and-promotion). Empty if the deployment is not associated with an environment.
Possible values:
* `"promoting"`
* `"stable"`
## `baseten_concurrent_requests`
Total number of concurrent inference requests for a deployment, including both requests currently being serviced by replicas and requests waiting in the queue. This is the primary signal that drives [autoscaling](/deployment/autoscaling/overview) decisions.
Type: `gauge`
Labels:
The ID of the model.
The name of the model.
The ID of the deployment.
The environment that the deployment corresponds to. Empty if the deployment is not associated with an environment.
The phase of the deployment in the [promote to production process](/deployment/deployments#environments-and-promotion). Empty if the deployment is not associated with an environment.
Possible values:
* `"promoting"`
* `"stable"`
# Status and health
Source: https://docs.baseten.co/observability/health
Every model deployment in your Baseten workspace has a status to represent its activity and health.
## Model statuses
**Healthy states:**
* **Active**: The deployment is active and available. It can be called with `truss predict` or from its API endpoints.
* **Scaled to zero**: The deployment is active but is not consuming resources. It will automatically start up when called, then scale back to zero after traffic ceases.
* **Starting up**: The deployment is starting up from a scaled to zero state after receiving a request.
* **Inactive**: The deployment is unavailable and is not consuming resources. It may be manually reactivated.
**Error states:**
* **Unhealthy**: The deployment is active but is in an unhealthy state due to errors while running, such as an external service it relies on going down or a problem in your Truss that prevents it from responding to requests.
* **Build failed**: The deployment is not active due to a Docker build failure.
* **Deployment failed**: The deployment is not active due to a model deployment failure.
## Fixing unhealthy deployments
If you have an unhealthy or failed deployment, check the model logs to see if there's any indication of what the problem is. You can try deactivating and reactivating your deployment to see if the issue goes away. In the case of an external service outage, you may need to wait for the service to come back up before your deployment works again. For issues inside your Truss, you'll need to diagnose your code to see what is making it unresponsive.
# Metrics
Source: https://docs.baseten.co/observability/metrics
Understand the load and performance of your model
The Metrics tab in the model dashboard provides deployment-specific insights
into model load and performance. Use the dropdowns to filter by environment or
deployment and time range.
## Inference volume
Tracks the request rate over time, segmented by HTTP status codes:
* `2xx`: 🟢 Successful requests
* `4xx`: 🟡 Client errors
* `5xx`: 🔴 Server errors (includes model prediction exceptions)
Note that for non-HTTP models and Chains (WebSockets and gRPC), the status codes will
reflect the status codes for those protocols.
***
## Response time
Measured at different percentiles (p50, p90, p95, p99):
* **End-to-end response time:** Includes cold starts, queuing, and inference (excludes client-side latency). Reflects real-world performance.
* **Inference time:** Covers only model execution, including pre/post-processing. Useful for optimizing single-replica performance.
* **Time to first byte:** Measures the time-to-first-byte time distribution, including any queueing and routing time. A proxy for TTFT.
***
## Request and response size
Measured at different percentiles (p50, p90, p95, p99):
* **Request size:** Tracks the request size distribution. A proxy for input tokens.
* **Response size:** Tracks the response size distribution. A proxy for generated tokens.
***
## Replicas
Tracks the number of **active** and **starting** replicas:
* **Starting:** Waiting for resources or loading the model.
* **Active:** Ready to serve requests.
* For development deployments, a replica is considered active while running the live reload server.
***
## Concurrent requests
Tracks the total number of in-progress inference requests across replicas, including both requests currently being serviced and requests waiting in the queue.
This is the primary signal that drives [autoscaling](/deployment/autoscaling/overview) decisions. For the full metric definition and labels, see [`baseten_concurrent_requests`](/observability/export-metrics/supported-metrics#baseten_concurrent_requests).
***
## CPU usage and memory
Displays resource utilization across replicas. Metrics are averaged and may not capture short spikes.
### Considerations:
* **High CPU/memory usage**: May degrade performance. Consider upgrading to a larger instance type.
* **Low CPU/memory usage**: Possible overprovisioning. Switch to a smaller instance to reduce costs.
***
## GPU usage and memory
Shows GPU utilization across replicas.
* **GPU usage**: Percentage of time a kernel function occupies the GPU.
* **GPU memory**: Total memory used.
### Considerations:
* **High GPU load**: Can slow inference. Check response time metrics.
* **High memory usage**: May cause out-of-memory failures.
* **Low utilization**: May indicate overprovisioning. Consider a smaller GPU.
***
## Async queue metrics
* **Time in Async Queue**: Time spent in the async queue before execution (p50, p90, p95, p99).
* **Async Queue Size**: Number of queued async requests.
### Considerations:
* Large queue size indicates requests are queued faster than they are processed.
* To improve async throughput, increase the max replicas or adjust autoscaling concurrency.
***
## Using metrics for autoscaling
Use these metrics to diagnose autoscaling behavior and tune your settings.
### Key metrics to watch
| Metric | What it tells you |
| --------------------------------- | ------------------------------------------------------------------------------------- |
| **Concurrent requests** | Shows total demand (queued + active). This is the signal driving autoscaling. |
| **Replicas** (active vs starting) | Shows scaling activity. Large gaps indicate cold start delays. |
| **Inference volume** | Shows traffic patterns. Use to identify if you have noisy, bursty, or steady traffic. |
| **Response time** (p95, p99) | Shows if scaling is keeping up. Spikes aligned with replica changes indicate thrash. |
| **Async queue size** | Shows backpressure. Growing queue means you need more capacity. |
### Diagnosing autoscaling issues
| You see... | Likely cause | Fix |
| ------------------------------------------------- | --------------------------- | ----------------------------------------------- |
| Latency spikes aligned with replica count changes | Oscillation (thrash) | Increase scale-down delay |
| Replicas at max, latency still degrading | Insufficient capacity | Increase max replicas or concurrency target |
| Large gap between active and starting replicas | Cold start delays | Increase min replicas, check image optimization |
| Traffic high but replicas staying low | Concurrency target too high | Lower concurrency target or target utilization |
| Replicas scaling down too quickly | Scale-down delay too short | Increase scale-down delay |
For solutions to common autoscaling problems, see [Autoscaling troubleshooting](/troubleshooting/deployments#autoscaling-issues).
# Secure model inference
Source: https://docs.baseten.co/observability/security
Keeping your models safe and private
Baseten maintains [SOC 2 Type II certification](https://www.baseten.co/blog/soc-2-type-2) and [HIPAA compliance](https://www.baseten.co/blog/baseten-announces-hipaa-compliance), with robust security measures beyond compliance.
## Data privacy
Baseten does not store model inputs, outputs, or weights by default.
* **Model inputs/outputs**: Inputs for [async inference](/inference/async) are temporarily stored until processed. Outputs are never stored.
* **Model weights**: Loaded dynamically from sources like Hugging Face, GCS, or S3, moving directly to GPU memory.
* Users can enable caching via Truss. You can permanently delete cached weights on request.
* **Postgres data tables**: Existing users may store data in Baseten’s hosted Postgres tables, which can be deleted anytime.
Baseten’s network accelerator optimizes model downloads. [Contact support](mailto:support@baseten.co) to disable it.
## Workload security
Baseten isolates inference workloads to protect users and Baseten’s infrastructure.
* **Container security**:
* Baseten never shares GPUs across users.
* Security tooling: Falco (Sysdig), Gatekeeper (Pod Security Policies).
* Minimal privileges for workloads and nodes to limit incident impact.
* **Network security**:
* Each customer has a dedicated Kubernetes namespace.
* Isolation enforced via [Calico](https://docs.tigera.io/calico/latest/about).
* Nodes run in a private subnet with firewall protections.
* **Pentesting**:
* Extended pentesting by [RunSybil](https://www.runsybil.com/) (ex-OpenAI and CrowdStrike experts).
* Malicious model deployments tested in a dedicated prod-like environment.
## Self-hosted model inference
Baseten offers single-tenant environments and self-hosted deployments. The cloud version is recommended for ease of setup, cost efficiency, and elastic GPU access.
For self-hosting, [contact support](mailto:support@baseten.co).
# Tracing
Source: https://docs.baseten.co/observability/tracing
Investigate the prediction flow in detail
Baseten’s Truss server includes built-in [OpenTelemetry](https://opentelemetry.io/) (OTEL) instrumentation, with support for custom tracing.
Tracing helps diagnose performance bottlenecks but introduces minor overhead, so it is **disabled by default**.
## Exporting builtin trace data to Honeycomb
1. **Create a Honeycomb API** key and add it to[ Baseten secrets](https://app.baseten.co/settings/secrets).
2. **Update** `config.yaml` for the target model:
```yaml config.yaml theme={"system"}
environment_variables:
HONEYCOMB_DATASET: your_dataset_name
runtime:
enable_tracing_data: true
secrets:
HONEYCOMB_API_KEY: '***'
```
3. **Send requests with tracing**
* Provide traceparent headers for distributed tracing.
* If omitted, Baseten generates random trace IDs.
## Adding custom OTEL instrumentation
To define custom spans and events, integrate OTEL directly:
```python model.py theme={"system"}
import time
from typing import Any, Generator
import opentelemetry.exporter.otlp.proto.http.trace_exporter as oltp_exporter
import opentelemetry.sdk.resources as resources
import opentelemetry.sdk.trace as sdk_trace
import opentelemetry.sdk.trace.export as trace_export
from opentelemetry import trace
from opentelemetry.sdk.resources import Resource
from opentelemetry.sdk.trace import TracerProvider
trace.set_tracer_provider(
TracerProvider(resource=Resource.create({resources.SERVICE_NAME: "UserModel"}))
)
tracer = trace.get_tracer(__name__)
trace_provider = trace.get_tracer_provider()
class Model:
def __init__(self, **kwargs) -> None:
honeycomb_api_key = kwargs["secrets"]["HONEYCOMB_API_KEY"]
honeycomb_exporter = oltp_exporter.OTLPSpanExporter(
endpoint="https://api.honeycomb.io/v1/traces",
headers={
"x-honeycomb-team" : honeycomb_api_key,
"x-honeycomb-dataset": "marius_testing_user",
},
)
honeycomb_processor = sdk_trace.export.BatchSpanProcessor(honeycomb_exporter)
trace_provider.add_span_processor(honeycomb_processor)
@tracer.start_as_current_span("load_model")
def load(self):
...
def preprocess(self, model_input):
with tracer.start_as_current_span("preprocess"):
...
return model_input
@tracer.start_as_current_span("predict")
def predict(self, model_input: Any) -> Generator[str, None, None]:
with tracer.start_as_current_span("start-predict") as span:
def inner():
time.sleep(0.01)
for i in range(5):
span.add_event("yield")
yield str(i)
return inner()
```
Baseten’s built-in tracing **does not interfere** with user-defined OTEL implementations.
# Billing and usage
Source: https://docs.baseten.co/observability/usage
Manage payments and track overall Baseten usage
The [billing and usage dashboard](https://app.baseten.co/settings/billing) provides a breakdown of model usage and costs, updated hourly. Usage is tracked per deployment, and any available credits are automatically applied to your bill.
## Billing
### Credits
* New workspaces receive free credits for testing and deployment.
* If credits run out and no payment method is set, models will be deactivated until a payment method is added.
### Payment method
* Payment details can be added or updated on the [billing page](https://app.baseten.co/settings/billing).
* Our payment processor securely stores your payment information, not Baseten.
On the [billing page](https://app.baseten.co/settings/billing), you can set and update your payment method. Your payment information, including credit card numbers and bank information, is always stored securely with our payments processor and not by Baseten directly.
### Invoice history
* View past invoices and payments in the billing dashboard.
* For questions, [contact support](mailto:support@baseten.co).
***
## Usage and billing FAQs
For full details, see our [pricing page](https://www.baseten.co/pricing/), but here are answers to some common questions:
### How exactly is usage calculated?
* Usage is billed per minute while a model is deploying, scaling, or serving requests.
* Costs are based on the [instance type](/deployment/resources#choosing-the-right-instance-type) used.
### How often are payments due?
* Initially, charges occur when usage **exceeds \$50** or at the **end of the month**, whichever comes first.
* After a history of successful payments, billing occurs **monthly**.
### Do you offer volume discounts?
* Volume discounts are available on the **Pro plan**. [Contact support](mailto:support@baseten.co) for details.
### Do you offer education and non-profit discounts?
* Yes, discounts are available for educational and nonprofit ML projects. [Contact support](mailto:support@baseten.co) to apply.
# Access control
Source: https://docs.baseten.co/organization/access
Manage access to your Baseten organization with role-based access control.
Baseten uses role-based access control (RBAC) to manage organization access.
Every organization member has one of two roles.
| Permission | Admin | Member |
| :----------------------- | ----- | ------ |
| Manage members | ✅ | ❌ |
| Manage billing | ✅ | ❌ |
| Deploy models and Chains | ✅ | ✅ |
| Call models | ✅ | ✅ |
**Admins** have full control over the organization, including member management and billing.
**Members** can deploy and call models but can't manage organization settings or other users.
If your organization uses multiple teams, see [Teams](/organization/teams) for information about team-level roles and permissions.
# API keys
Source: https://docs.baseten.co/organization/api-keys
Authenticate requests to Baseten for deployment, inference, and management.
API keys authenticate your requests to Baseten. You need an API key to:
* Deploy models, Chains, and training projects with the Truss CLI.
* Call model endpoints for inference.
* Use the management API.
## API key types
Baseten supports two types of API keys:
**Personal API keys** are tied to your user account. Actions performed with a personal key are attributed to you. Use personal keys for local development and testing.
**Team API keys** are not tied to an individual user. When your organization has [teams](/organization/teams) enabled, team keys can be scoped to a specific team. Team keys can have different permission levels:
* **Full access**: Deploy models, call endpoints, and manage resources.
* **Inference only**: Call model endpoints but cannot deploy or manage.
* **Metrics only**: Export metrics but cannot deploy or call models.
Use team keys for CI/CD pipelines, production applications, and shared automation.
If your organization uses [teams](/organization/teams), Team Admins can create team API keys scoped to their team. See [Teams](/organization/teams) for more information.
### Environment-scoped API keys
Environment-scoped API keys are team API keys restricted to specific [environments](/deployment/environments). Use them for least-privilege access when sharing keys with external partners or production integrations.
You can scope a key in two ways:
* **By environment**: The key can only call models in the selected environments (for example, `production` only, or `production` and `staging`).
* **By environment and model**: The key can only call specific models within the selected environments.
To create an environment-scoped key, select **Manage and call all team models** or **Call certain models** when [creating a team API key](#create-an-api-key), then choose the environments from the **Environment access** dropdown.
## Create an API key
1. Navigate to [API keys](https://app.baseten.co/settings/api_keys) in your account settings.
2. Select **Create API key**.
3) Select **Personal** and click **Next**.
4) Enter a name for the key (lowercase letters, numbers, and hyphens only).
5) Select **Create API key**.
3. Select **Team** and click **Next**.
4. If your organization has multiple teams, select the team.
5. Enter a name for the key (lowercase letters, numbers, and hyphens only).
6. Select the permission level:
* **Manage and call all team models**: Full access to deploy, call, and manage.
* **Call certain models**: Inference-only access.
* **Export model metrics**: Metrics-only access.
7. For **Manage and call all team models** or **Call certain models**, optionally use the **Environment access** dropdown to restrict the key to specific environments.
8. Select **Create API key**.
Copy the key immediately. You won't be able to view it again.
## Use API keys with the CLI
The first time you run `truss push`, the CLI prompts you for your API key and saves it to `~/.trussrc`:
```
$ truss push
💻 Let's add a Baseten remote!
🤫 Quietly paste your API_KEY:
💾 Remote config `baseten` saved to `~/.trussrc`.
```
To manually configure or update your API key, edit `~/.trussrc`:
```sh theme={"system"}
[baseten]
remote_provider = baseten
api_key = YOUR_API_KEY
```
## Use API keys with endpoints
To call model endpoints with your API key, see [Call your model](/inference/calling-your-model).
## Manage API keys
The [API keys page](https://app.baseten.co/settings/api_keys) shows all your keys with their creation date and last used timestamp. Use this information to identify unused keys.
To rename a key, select the pencil icon next to the key name.
To rotate a key, create a new key, update your applications to use it, then revoke the old key.
To revoke a key, select the trash icon next to the key. Revoked keys cannot be restored.
You can also manage API keys programmatically with the [REST API](/reference/management-api/api-keys/creates-an-api-key).
### Security recommendations
* Store API keys in environment variables or secret managers, not in code.
* Never commit API keys to version control.
* Use [environment-scoped keys](#environment-scoped-api-keys) to limit access to specific environments and models.
* Use team keys with minimal permissions for production applications.
* Rotate keys periodically and revoke unused keys.
# OpenID Connect (OIDC) authentication
Source: https://docs.baseten.co/organization/oidc
Use short-lived OIDC tokens to securely authenticate to cloud resources
OpenID Connect (OIDC) lets your Baseten deployments authenticate to cloud
resources like S3 buckets and container registries using short-lived tokens
instead of long-lived credentials.
Without OIDC, accessing cloud resources requires long-lived credentials: static
API keys or service account keys stored as secrets in Baseten. These keys don't
expire on their own, so if they're leaked or forgotten, they remain valid until
someone manually rotates them. You're responsible for tracking which keys exist,
where they're used, and when to rotate them.
OIDC takes a different approach. Instead of static keys, Baseten issues
short-lived tokens scoped to a specific deployment. There are no secrets to
store, rotate, or clean up.
Baseten OIDC currently supports:
* **AWS**: Amazon ECR (container images) and Amazon S3 (model weights)
* **Google Cloud**: Artifact Registry, GCR (container images), and Google Cloud Storage (model weights)
## How Baseten OIDC works
Baseten acts as an OIDC identity provider with the following configuration:
* **Issuer**: `https://oidc.baseten.co`
* **Audience**: `oidc.baseten.co`
When you deploy your model, Baseten generates short-lived OIDC tokens that
identify your specific workload. Your cloud provider validates these tokens
against the trust relationship you configure, then grants access to the
specified resources.
## Token structure
Each OIDC token includes standard JWT claims and custom claims that identify the
workload. Here's an example unsigned payload:
```json theme={"system"}
{
"iss": "https://oidc.baseten.co",
"sub": "v=1:org=Mvg9jrRd:team=AviIZ0y3:model=kW9wuKFN:deployment=e5f6g7h8:environment=production:type=model_container",
"aud": "oidc.baseten.co",
"iat": 1700000000,
"exp": 1700003600,
"jti": "550e8400-e29b-41d4-a716-446655440000",
"org": "Mvg9jrRd",
"team": "AviIZ0y3",
"model": "kW9wuKFN",
"deployment": "e5f6g7h8",
"environment": "production",
"type": "model_container"
}
```
The `sub` claim uses a structured format that encodes the workload identity:
```text theme={"system"}
v=1:org={org_id}:team={team_id}:model={model_id}:deployment={deployment_id}:environment={environment}:type={workload_type}
```
### Claim components
| Component | Description | Example |
| ------------- | ---------------------------------------------------------------------------------- | ------------- |
| `org` | Your organization ID | `Mvg9jrRd` |
| `team` | Team ID within your organization | `AviIZ0y3` |
| `model` | Model ID | `kW9wuKFN` |
| `deployment` | Specific deployment/version ID | `e5f6g7h8` |
| `environment` | User-defined environment name (max 40 characters). Defaults to `` if not set | `production` |
| `type` | Workload type: `model_build` or `model_container` | `model_build` |
### Workload types
* **`model_build`**: Token used during model image building (for example, pulling base images from ECR/GCR).
* **`model_container`**: Token used by running model containers (for example, downloading weights from S3/GCS).
## Subject claim patterns
Common patterns for scoping which workloads can access your resources:
* **AWS**: Use these in the IAM role **trust policy** under `Condition.StringLike` for `oidc.baseten.co:sub`. Wildcards (`*`) are supported.
* **GCP**: Use these in the Workload Identity Provider **attribute-condition**. With the mapping `google.subject=assertion.sub` (see [Create a Workload Identity Provider](#create-a-workload-identity-provider)), reference the sub claim as `google.subject`. GCP does not support wildcards; use `startsWith()` (and `contains()` where needed).
### All workloads in a team
To give every workload in your team access to a resource, match on the team ID with a wildcard for everything else.
```text theme={"system"}
v=1:org=Mvg9jrRd:team=AviIZ0y3:*
```
```text theme={"system"}
google.subject.startsWith('v=1:org=Mvg9jrRd:team=AviIZ0y3:')
```
### Specific model, all deployments
To restrict access to a single model while allowing all of its deployments and environments, match on the model ID.
```text theme={"system"}
v=1:org=Mvg9jrRd:team=AviIZ0y3:model=kW9wuKFN:*
```
```text theme={"system"}
google.subject.startsWith('v=1:org=Mvg9jrRd:team=AviIZ0y3:model=kW9wuKFN:')
```
### Specific environment, all models
To scope access by environment, match workloads deployed to a specific environment like `production`.
```text theme={"system"}
v=1:org=Mvg9jrRd:team=AviIZ0y3:*:environment=production:*
```
```text theme={"system"}
google.subject.startsWith('v=1:org=Mvg9jrRd:team=AviIZ0y3:') && google.subject.contains('environment=production')
```
### Build-time only access
To limit access to the build phase, like pulling base images from a private registry, match on the `model_build` workload type.
```text theme={"system"}
v=1:org=Mvg9jrRd:team=AviIZ0y3:*:type=model_build
```
```text theme={"system"}
google.subject.startsWith('v=1:org=Mvg9jrRd:team=AviIZ0y3:') && google.subject.endsWith('type=model_build')
```
### Runtime only access
To limit access to running containers, like downloading model weights, match on the `model_container` workload type.
```text theme={"system"}
v=1:org=Mvg9jrRd:team=AviIZ0y3:*:type=model_container
```
```text theme={"system"}
google.subject.startsWith('v=1:org=Mvg9jrRd:team=AviIZ0y3:') && google.subject.endsWith('type=model_container')
```
### Specific model and environment
To apply the most restrictive access, combine model and environment matching so only a specific model in a specific environment can authenticate.
```text theme={"system"}
v=1:org=Mvg9jrRd:team=AviIZ0y3:model=kW9wuKFN:*:environment=production:*
```
```text theme={"system"}
google.subject.startsWith('v=1:org=Mvg9jrRd:team=AviIZ0y3:model=kW9wuKFN:') && google.subject.contains('environment=production')
```
## Finding your OIDC identifiers
Use [`truss whoami --show-oidc`](/reference/cli/truss/whoami) to view your organization and team IDs, issuer, audience, and subject claim format needed for configuring cloud provider trust policies.
## Cloud provider setup
### Create an OIDC identity provider
Register Baseten as a trusted OIDC provider in your AWS account.
1. Navigate to the [AWS IAM Console](https://console.aws.amazon.com/iam/).
2. Go to **Identity providers** → **Add provider**.
3. Select **OpenID Connect**.
4. Configure the provider:
* **Provider URL**: `https://oidc.baseten.co`
* Click **Get thumbprint** to verify the provider.
* **Audience**: `oidc.baseten.co`
5. Click **Add provider**.
If your AWS account requires `sts.amazonaws.com` as a trusted audience, add it to the OIDC provider first, then add `oidc.baseten.co` as an additional audience.
### Create an IAM role
Create a role that your Baseten workloads can assume via OIDC.
1. Go to **Roles** → **Create role**.
2. Select **Web identity** as the trusted entity type.
3. Choose the OIDC provider you created.
4. Select **Audience**: `oidc.baseten.co`, then click **Next**.
5. On the next page, **attach permissions policies** for the resources your models need to access:
#### ECR access (for base images)
Attach this policy to allow pulling container images from ECR.
```json theme={"system"}
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"ecr:GetAuthorizationToken",
"ecr:BatchCheckLayerAvailability",
"ecr:GetDownloadUrlForLayer",
"ecr:BatchGetImage"
],
"Resource": "*"
}
]
}
```
#### S3 access (for model weights)
Attach this policy to allow reading model weights from S3.
```json theme={"system"}
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"s3:GetObject",
"s3:ListBucket"
],
"Resource": [
"arn:aws:s3:::my-model-weights-bucket",
"arn:aws:s3:::my-model-weights-bucket/*"
]
}
]
}
```
6. **Configure the trust policy**: Edit the role's trust policy to include subject claim conditions. After creating the role, go to the role → **Trust relationships** → **Edit** and use a policy like this:
```json theme={"system"}
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Principal": {
"Federated": "arn:aws:iam:::oidc-provider/oidc.baseten.co"
},
"Action": "sts:AssumeRoleWithWebIdentity",
"Condition": {
"StringEquals": {
"oidc.baseten.co:aud": "oidc.baseten.co"
},
"StringLike": {
"oidc.baseten.co:sub": "v=1:org=Mvg9jrRd:team=AviIZ0y3:*"
}
}
}
]
}
```
Replace `` with your AWS account ID, and adjust the `sub` claim pattern to match your security requirements.
### Create a service account
Create a service account that Baseten workloads will impersonate.
```bash theme={"system"}
gcloud iam service-accounts create baseten-oidc \
--display-name="Baseten OIDC Service Account"
```
### Grant permissions to the service account
Grant the service account access to the resources you need. You can grant one or both depending on your use case.
#### Artifact Registry access (for base images)
Grant read access to Artifact Registry for pulling container images.
```bash theme={"system"}
gcloud projects add-iam-policy-binding PROJECT_ID \
--member="serviceAccount:baseten-oidc@PROJECT_ID.iam.gserviceaccount.com" \
--role="roles/artifactregistry.reader"
```
#### GCS access (for model weights)
Grant read access to Cloud Storage for downloading model weights.
```bash theme={"system"}
gcloud projects add-iam-policy-binding PROJECT_ID \
--member="serviceAccount:baseten-oidc@PROJECT_ID.iam.gserviceaccount.com" \
--role="roles/storage.objectViewer"
```
### Create a Workload Identity Pool
Create a pool to manage external identities from Baseten.
```bash theme={"system"}
gcloud iam workload-identity-pools create baseten-pool \
--location="global" \
--display-name="Baseten Workload Identity Pool"
```
### Create a Workload Identity Provider
Add Baseten as an OIDC provider in the pool.
```bash theme={"system"}
gcloud iam workload-identity-pools providers create-oidc baseten-provider \
--location="global" \
--workload-identity-pool="baseten-pool" \
--issuer-uri="https://oidc.baseten.co" \
--allowed-audiences="oidc.baseten.co" \
--attribute-mapping="google.subject=assertion.sub" \
--attribute-condition="google.subject.startsWith('v=1:org=Mvg9jrRd:team=AviIZ0y3:')"
```
The **attribute mapping** `google.subject=assertion.sub` maps the OIDC `sub` claim into the `google.subject` attribute. After this mapping, you can use `google.subject` everywhere (including in `attribute-condition`) to reference the subject claim.
GCP doesn't support wildcard subject claims. Use `startsWith()` in `attribute-condition` to match workloads by prefix. Replace the organization and team IDs with your own values.
### Allow the Workload Identity to impersonate the service account
Grant the workload identity pool permission to act as the service account.
```bash theme={"system"}
gcloud iam service-accounts add-iam-policy-binding \
baseten-oidc@PROJECT_ID.iam.gserviceaccount.com \
--role="roles/iam.workloadIdentityUser" \
--member="principalSet://iam.googleapis.com/projects/PROJECT_NUMBER/locations/global/workloadIdentityPools/baseten-pool/*"
```
## Using OIDC in your Truss configuration
Once you've completed the AWS or GCP setup above, you can configure OIDC authentication in your Truss:
### Private registries (ECR, GCR)
For authenticating to private Docker registries using OIDC, see:
* **[AWS ECR OIDC](/development/model/private-registries#aws-oidc-recommended)**: Configure OIDC for AWS Elastic Container Registry.
* **[GCP Artifact Registry OIDC](/development/model/private-registries#gcp-oidc-recommended)**: Configure OIDC for Google Container Registry / Artifact Registry.
### Model weights (S3, GCS)
For downloading model weights from cloud storage using OIDC, see:
* **[AWS S3 OIDC](/development/model/bdn#aws-oidc-recommended)**: Configure OIDC for S3 model weights.
* **[GCS OIDC](/development/model/bdn#gcp-oidc-recommended)**: Configure OIDC for Google Cloud Storage model weights.
## Best practices
### Use least-privilege access
Use the most specific [subject claim pattern](#subject-claim-patterns) that fits your use case. Create separate roles or Workload Identity providers for different environments, workload types, or models rather than one role with broad permissions. Always test your OIDC configuration in a non-production environment first.
Don't grant access to `v=1:org=*:team=*:*`. This allows any Baseten workload to access your resources.
### Monitor and audit
* Enable CloudTrail (AWS) or Cloud Audit Logs (GCP) to track OIDC token usage.
* Set up alerts for unexpected access patterns.
* Regularly review which roles are being used.
## Troubleshooting
### Authentication failures
If your model fails to authenticate:
1. **Verify the trust relationship**: Ensure your IAM role trusts the Baseten OIDC provider (`https://oidc.baseten.co`).
2. **Check the audience**: Confirm the audience is set to `oidc.baseten.co`.
3. **Review subject claim conditions**: Verify your `sub` claim pattern matches the workload identity.
4. **Inspect your identifiers**: Run `truss whoami --show-oidc` to confirm your org and team IDs.
### Permission denied errors
If authentication succeeds but operations fail:
1. **Check IAM policies**: Ensure the role has the necessary permissions (for example, `s3:GetObject`, `ecr:BatchGetImage`).
2. **Verify resource ARNs**: Confirm bucket names, registry URLs, and other resource identifiers are correct.
3. **Review resource policies**: Some resources (like S3 buckets) have their own policies that may block access.
### Common error messages
| Error | Likely Cause | Solution |
| --------------------------------------------------------- | --------------------------------------- | ------------------------------------------- |
| "Not authorized to perform sts:AssumeRoleWithWebIdentity" | Trust policy doesn't match the workload | Check subject claim pattern in trust policy |
| "Access Denied" | Missing permissions in IAM policy | Add required permissions to the role |
| "Invalid identity token" | Issuer or audience mismatch | Verify OIDC provider configuration |
| "Token has expired" | Clock skew or token refresh issue | Contact Baseten support |
### Debugging with CloudWatch/Cloud Logging
Enable detailed logging to see exactly why authentication or authorization is failing:
**AWS CloudTrail**: Look for `AssumeRoleWithWebIdentity` events to see token validation attempts.
**GCP Cloud Audit Logs**: Check `iam.googleapis.com` logs for workload identity authentication events.
## Migration from long-lived credentials
If you're currently using long-lived AWS or GCP credentials:
1. Set up OIDC as described above.
2. Update your Truss configuration to use OIDC authentication.
3. Deploy and test your model.
4. Once confirmed working, remove the long-lived credentials.
5. Delete any secrets containing long-lived credentials from Baseten.
Both OIDC and long-lived credential authentication methods are supported. You can migrate gradually, starting with non-production environments.
## Limitations
* OIDC tokens can't be customized.
* Baseten manages token lifetime and claims.
* Only AWS and GCP services are supported.
* GCP doesn't support wildcard subject claims or subject-based scoping in IAM role conditions. Use the Workload Identity Provider `attribute-condition` instead.
* Cloudflare R2, Azure containers, and Hugging Face aren't yet supported.
# Organization settings
Source: https://docs.baseten.co/organization/overview
Manage your Baseten organization's access, security, and resources.
* **[Access control](/organization/access)**: Manage roles and permissions.
* **[Teams](/organization/teams)**: Segment resources across multiple teams (Enterprise).
* **[API keys](/organization/api-keys)**: Authenticate requests for deployment, inference, and management.
* **[Secrets](/organization/secrets)**: Store and access sensitive credentials in deployed models.
* **[Restricted environments](/organization/restricted-environments)**: Control environment access.
# Restricted environments
Source: https://docs.baseten.co/organization/restricted-environments
Control access to sensitive environments like production with environment-level permissions.
Restricted environments let organization Admins lock down specific environments so that
only designated users can modify settings and configurations.
Use restricted environments to prevent unauthorized changes to critical
environments like production.
For more information on user roles, see
[Access control](/organization/access) and
[Environments](/deployment/environments).
## How restricted environments work
By default, environments are unrestricted, meaning any organization member can modify
deployments, autoscaling settings, and other configurations.
When you mark an environment as restricted, only users you explicitly grant access can
make changes.
Restricted environments apply across all models and Chains in your organization.
For example, if you restrict an environment named `production`, that restriction applies to
every model and chain's production environment, not just one specific model or chain.
If your organization uses [teams](/organization/teams), restricted environments are scoped to individual teams.
Team Admins can create and manage restricted environments for their team.
### Permissions by access level
| Action | With access | Without access |
| :------------------------------------- | ----------- | -------------- |
| View environment and configuration | ✅ | ✅ (read-only) |
| View metrics | ✅ | ✅ (read-only) |
| Call inference on models and chains | ✅ | ✅ |
| View logs | ✅ | ✅ |
| Modify deployment settings | ✅ | ❌ |
| Change autoscaling configurations | ✅ | ❌ |
| Promote deployments to the environment | ✅ | ❌ |
| Manage environment-specific settings | ✅ | ❌ |
Users without access see a grayed-out UI for restricted actions.
They retain full read access and can still call inference endpoints.
## Managing restricted environments
Only organization **Admins** can create or modify restricted environments.
Members (non-admin users) can only create unrestricted environments and can't change
environment restrictions.
### From the environments page
1. Navigate to **Settings** and then choose **Environments**.
2. Select an existing environment to modify, or select **Create environment** to create a new one.
3. Set the access level to **Restricted**.
4. Add users by searching by name or by email.
5. Select **Save changes** or **Create environment**.
### From a model or chain
1. Go to your model or chain's management page.
2. Select an existing environment to modify, or select **Add environment** then **Create environment** to create a new one.
3. Set the access level to **Restricted**.
4. Add users by searching by name or by email.
5. Select **Save changes** or **Create environment**.
Only admins can create restricted environments, and all admins have implicit
access to every restricted environment. If an admin is later demoted to a member
role, they lose this implicit access and can be removed from the environment
like any other member.
## API behavior
Restricted environments apply the same permission checks to
[API](/reference/management-api/environments/create-an-environment) and
[truss CLI](/reference/cli/truss/push) operations as the UI. API keys inherit
the permissions of their associated user.
If you attempt to modify a restricted environment using an API key associated with a
user without access, you'll receive a `403 Forbidden` error.
This includes operations like:
* Promoting deployments through the
[promote endpoint](/reference/management-api/deployments/promote/promotes-a-deployment-to-an-environment).
* Updating autoscaling settings through the
[autoscaling endpoint](/reference/management-api/deployments/autoscaling/updates-a-deployments-autoscaling-settings).
* Modifying environment configurations through the
[update environment endpoint](/reference/management-api/environments/update-an-environments-settings).
Users without access can still call inference endpoints, as restrictions only apply to
management operations.
# Secrets
Source: https://docs.baseten.co/organization/secrets
Store and access sensitive credentials in your deployed models.
Secrets store sensitive credentials like API keys, access tokens, and passwords that your models need at runtime.
Secrets are encrypted and injected into your model's environment when it runs.
If your organization uses [teams](/organization/teams), secrets are scoped to individual teams.
Models, Chains, and training projects deployed to a team can only access that team's secrets.
## Create a secret
To create a secret:
1. Navigate to the **Secrets** tab in your settings. If your organization uses [teams](/organization/teams), navigate to the team's settings page.
2. Enter a name for the secret.
3. Enter the secret value.
4. Select **Add secret**.
Secret names follow these rules:
* Non-alphanumeric characters are normalized (for example, `hf_access_token` and `hf-access-token` are treated as the same name).
* Editing a secret's value overwrites the previous value.
* Changes take effect immediately for all deployments using the secret.
## Use secrets in your model
To use secrets in your Truss model, see [Secrets](/development/model/secrets).
## Security recommendations
* Create secrets through the Baseten dashboard, not in code.
* Use descriptive names that indicate the secret's purpose.
* Rotate secrets periodically by updating the value in the dashboard.
* Delete unused secrets to reduce exposure risk.
# Teams
Source: https://docs.baseten.co/organization/teams
Organize your organization into multiple teams with isolated resources and granular access control.
Teams let you segment your Baseten organization into multiple isolated
groups, each with its own resources, members, and access controls. Use teams to
separate environments by function, project, or access level.
Teams are available for organizations on our Enterprise tier.
[Contact us](mailto:support@baseten.co) to enable teams for your
organization.
## How teams work
Every organization has a **default team** that contains all existing resources.
In the single-team world, you work within this default team without seeing any
team-specific UI.
When teams are enabled, Organization Admins can create additional teams within the
organization. Each team operates as an isolated unit with its own:
* Models, Chains, and training projects
* Secrets
* Team-level API keys
* Restricted environments
* Team members and roles
Billing remains at the organization level. All teams within an organization
share the same billing account and usage tracking.
## Roles and permissions
Teams introduce a two-level role hierarchy:
* Organization roles
* Team roles
### Organization roles
Organization-level roles determine what a user can do across the entire organization:
| Permission | Admin | Member |
| :-------------------------- | ----- | ------ |
| Manage billing | ✅ | ❌ |
| Manage teams | ✅ | ❌ |
| Manage organization members | ✅ | ❌ |
| View all teams | ✅ | ❌ |
Organization Admins have implicit admin-level access to all teams and all restricted environments.
### Team roles
Team-level roles determine what a user can do within a specific team:
| Permission | Team Admin | Team Member |
| :------------------------------------------- | ---------- | ----------- |
| Manage team members | ✅ | ❌ |
| Create restricted environments | ✅ | ❌ |
| Create team API keys | ✅ | ❌ |
| Deploy models, Chains, and training projects | ✅ | ✅ |
| Call models | ✅ | ✅ |
| View team resources | ✅ | ✅ |
A user can have different roles in different teams. For example, a data scientist might be a Team Admin for the Research team where they run experiments, while having Team Member access to the Inference team to deploy trained models.
## Manage teams
Organization Admins can create and delete teams. Team Admins can manage membership within their teams.
### Create a team
To create a team:
1. From the left navigation, select the dropdown next to the team name and select **Create new team**.
2. Enter a team name and optionally select an icon.
3. Choose **Create team**.
The default team cannot be deleted, but you can rename it.
### Invite members to a team
To invite a new member and add them to teams:
1. Navigate to **Organization settings** and select the **Members** tab.
2. Select **Invite member**.
3. Enter the member's email address.
4. Select the organization role: **Admin** or **Member**.
5. Select the teams to add them to.
6. For each team, set their team role: **Team Admin** or **Team Member**.
7. Select **Invite member**.
The invited user receives an email to join the organization and is automatically added to the selected teams with the specified roles.
To add an existing organization member to a team, navigate to the team's settings page, select the **Members** tab, and add them from there.
### Remove a member
To remove a member from the organization:
1. Navigate to **Organization settings** and select the **Members** tab.
2. Find the member you want to remove.
3. Select the trash icon next to their name.
Removing a member from the organization removes them from all teams.
To remove a member from a specific team without removing them from the organization, navigate to the team's settings page, select the **Members** tab, and remove them from there.
### Change a member's role
To change a member's organization or team roles:
1. Navigate to **Organization settings** and select the **Members** tab.
2. Select the pencil icon next to the member's name.
3. Update their organization role or team assignments as needed.
4. Select **Save changes**.
You can also change a member's team role from the team's settings page by navigating to the **Members** tab.
### Switch between teams
Use the team selector in the navigation to switch between teams.
The team selector displays all teams you have access to.
Selecting a team filters the view to show only that team's resources and settings.
## Team-scoped resources
### Secrets
Secrets are scoped to individual teams.
Each team maintains its own set of secrets, and models deployed to a team can only access that team's secrets.
To manage secrets for a team:
1. Switch to the team using the team selector in the navigation.
2. Navigate to **Settings** and select **Secrets**.
3. Add or modify secrets for that team.
For more information, see [Best practices for secrets](/organization/secrets).
### API keys
API keys can be personal or team-scoped:
* **Personal API keys** are tied to your user account and provide access to resources across all teams you belong to. Use personal keys for local development and testing.
* **Team API keys** are scoped to a single team and can only access that team's resources. Use team keys for automation and production deployments. Only Team Admins and organization Admins can create team API keys.
To create a team API key:
1. Navigate to **Settings** and select **API Keys**.
2. Select **Create API Key**.
3. Choose the team to scope the key to.
4. Name the key and select **Create**.
For more information, see [Best practices for API keys](/organization/api-keys).
### Restricted environments
Restricted environments work at the team level. When you create a restricted
environment, it applies to all models and Chains within that team.
For more information, see
[Restricted environments](/organization/restricted-environments).
## Deploy to a team
To deploy to a team, you can use the Truss CLI or the UI.
### Use the Truss CLI
To deploy a model to a specific team, use the `--team` flag with `truss push`:
```sh theme={"system"}
truss push --team your-team-name
```
If you omit the `--team` flag, Truss infers the target team using the following logic:
1. If you belong to only one team, Truss deploys to that team.
2. If a model with the same name exists in only one of your accessible teams, Truss deploys to that team.
3. If there is ambiguity (for example, the same model name exists in multiple teams), Truss prompts you to select a team.
### Use the UI
The team selector determines which team a model belongs to when you create or deploy through the Baseten console.
To deploy to a specific team, switch to that team before creating or deploying resources.
## Considerations
### Model APIs
Model APIs are only available in the default team.
You can't create or access Model APIs from other teams.
### Billing
Billing is managed at the organization level.
There's no team-level billing breakdown or budget controls.
All usage across teams is aggregated in the organization's billing dashboard, which is visible only to organization Admins.
### Resource naming
Model and Chain names must be unique within a team.
The same name can exist in different teams, but this may require explicit team specification when using the Truss CLI.
## Migrate to multiple teams
When teams are enabled for your organization, all existing resources remain in the default team.
You can then create additional teams and organize resources based on your needs.
Common team structures include:
* **By organizational structure**: Create teams for distinct departments or groups within your organization using Baseten. The recommended way to manage environments on Baseten is with [deployment environments](/deployment/environments), since this allows for centralized management, promotion workflows, and varying levels of access control.
* **By function**: Separate teams for different projects or use cases (for example, a training team and an inference team).
* **By access level**: Separate teams based on who should have access to modify production resources.
There is no single correct way to structure teams.
Consider your organization's access control needs, how you want to isolate secrets and credentials, and how different groups within your organization work with Baseten.
To move a model or Chain to a different team, redeploy it while switched to the target team. The original resource in the default team can then be deleted if no longer needed.
# Overview
Source: https://docs.baseten.co/overview
Baseten helps you train, deploy, and serve AI models at scale with high performance and cost efficiency.
Baseten is a training and inference platform.
Bring a model (an open-source LLM from Hugging Face, a fine-tuned checkpoint, or a custom model) and Baseten turns it into a production API endpoint with autoscaling, observability, and optimized serving infrastructure.
Baseten handles containerization, GPU scheduling across multiple clouds, and engine-level optimizations like TensorRT-LLM compilation, so you can focus on your model and your application.
If you want to skip deployment entirely and start making inference calls right now, [Model APIs](/development/model-apis/overview) provide OpenAI-compatible endpoints for models like DeepSeek, Qwen, and GLM.
Point the OpenAI SDK at Baseten's URL to run inference in seconds.
Call a model through Model APIs in under two minutes. No deployment, no setup, just an API key and a request.
## How models get deployed
The most common way to deploy a model on Baseten is with [Truss](https://pypi.org/project/truss/), an open-source framework that packages your model into a deployable container.
For supported architectures (most popular open-source LLMs, embedding models, and image generators), deployment requires only a `config.yaml` file.
Specify the model, the hardware, and the engine, and Truss handles the rest.
```yaml config.yaml theme={"system"}
model_name: Qwen-2.5-3B
resources:
accelerator: L4
use_gpu: true
trt_llm:
build:
base_model: decoder
checkpoint_repository:
source: HF
repo: "Qwen/Qwen2.5-3B-Instruct"
```
Run `truss push` and Baseten builds a TensorRT-optimized container, deploys it to GPU infrastructure, and provides an endpoint.
The model serves an OpenAI-compatible API out of the box.
When you need custom behavior like preprocessing, postprocessing, or a model architecture that the built-in engines don't support, Truss also supports [custom Python model code](/development/model/custom-model-code).
Write a `Model` class with `load` and `predict` methods and Truss packages it the same way.
Most teams start with config-only deployments and add custom code only when they need it.
Deploy a model to Baseten with just a config file. No custom code needed.
## Inference engines
Baseten optimizes every deployment with an inference engine tuned for your model's architecture. Select the engine that best supports your use case and it handles the low-level performance work: quantization, tensor parallelism, KV cache management, and batching.
Dense text generation models compiled with TensorRT-LLM. Supports lookahead decoding and structured outputs.
Large mixture-of-experts models like DeepSeek R1 and Qwen3 MoE with KV-aware routing and distributed inference.
Embedding, reranking, and classification models with up to 1,400 embeddings per second throughput.
Choose the engine through a field in your `config.yaml`, or Baseten selects it automatically based on your model architecture.
## Multi-step workflows with Chains
Some applications need more than a single model call. A RAG pipeline retrieves documents, embeds them, and generates a response. An image generation workflow runs a diffusion model, upscales the result, and applies safety filtering.
[Chains](/development/chain/overview) is Baseten's framework for orchestrating these multi-step pipelines. Each step runs on its own hardware with its own dependencies, and Chains manages the data flow between them. Define the pipeline in Python, and Chains deploys, scales, and monitors each step independently.
## Training
Baseten also provides [training infrastructure](/training/overview) for fine-tuning and pre-training. Bring your training scripts (Axolotl, TRL, Megatron, or custom code) and run jobs on H200 or A10G GPUs. Checkpoints sync automatically during training, and you can deploy a fine-tuned model from checkpoint to production endpoint in a single command with `truss train deploy_checkpoints`.
## Production infrastructure
Every deployment on Baseten runs on autoscaling infrastructure that adjusts replicas based on traffic. Configure minimum and maximum replicas, concurrency targets, and scale-down delays. Or use the defaults, which handle most workloads well. Models scale to zero when idle, eliminating costs during quiet periods, and scale up within seconds when traffic arrives.
Baseten schedules workloads across multiple cloud providers and regions through Multi-Cloud Capacity Management. This means your models stay available even during provider-level disruptions, and traffic routes to the lowest-latency region automatically.
Built-in [observability](/observability/metrics) gives you real-time metrics, logs, and request traces for every deployment. Export data to tools like Datadog or Prometheus, and debug behavior with full visibility into inputs, outputs, and errors.
## Next steps
The build pipeline, request routing, autoscaling, and deployment lifecycle under the hood.
End-to-end guides for deploying and optimizing popular models.
Reference for the inference API, management API, and Truss CLI.
# Quickstart
Source: https://docs.baseten.co/quickstart
Start running inference on Baseten.
This quickstart walks you through calling an LLM on Baseten using Model APIs.
Sign up, create an API key, and make a chat completion request in just a few minutes with no model deployment required.
Model APIs provide OpenAI-compatible endpoints for high-performance open-source models. If your code already works with the OpenAI SDK, it works with Baseten. Change the base URL and API key to start running inference.
## Prerequisites
* A **[Baseten account](https://app.baseten.co/signup)** with an [API key](https://app.baseten.co/settings/account/api_keys)
* **Python 3.9+** (check with `python3 --version`)
Set up your environment:
```bash theme={"system"}
uv venv && source .venv/bin/activate
export BASETEN_API_KEY="paste-your-api-key-here"
```
```bash theme={"system"}
python3 -m venv .venv && source .venv/bin/activate
export BASETEN_API_KEY="paste-your-api-key-here"
```
```bash theme={"system"}
python3 -m venv .venv && .venv\Scripts\activate
set BASETEN_API_KEY=paste-your-api-key-here
```
## Run inference
Call a model using the OpenAI SDK.
This example uses GLM-4.7, but you can substitute any model from the [supported models list](/development/model-apis/overview#supported-models).
Install the OpenAI SDK if you don't have it:
```bash theme={"system"}
uv pip install openai
```
Create a chat completion:
```python chat.py theme={"system"}
from openai import OpenAI
import os
client = OpenAI(
base_url="https://inference.baseten.co/v1",
api_key=os.environ["BASETEN_API_KEY"],
)
response = client.chat.completions.create(
model="zai-org/GLM-4.7",
messages=[
{"role": "user", "content": "What is inference in machine learning?"}
],
)
print(response.choices[0].message.content)
```
Install the OpenAI SDK if you don't have it:
```bash theme={"system"}
npm install openai
```
Create a chat completion:
```javascript chat.js theme={"system"}
import OpenAI from "openai";
const client = new OpenAI({
baseURL: "https://inference.baseten.co/v1",
apiKey: process.env.BASETEN_API_KEY,
});
const response = await client.chat.completions.create({
model: "zai-org/GLM-4.7",
messages: [
{ role: "user", content: "What is inference in machine learning?" }
],
});
console.log(response.choices[0].message.content);
```
```bash theme={"system"}
curl https://inference.baseten.co/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Api-Key $BASETEN_API_KEY" \
-d '{
"model": "zai-org/GLM-4.7",
"messages": [
{"role": "user", "content": "What is inference in machine learning?"}
]
}'
```
Success looks like this:
```output theme={"system"}
Inference in machine learning refers to the process of using a trained model
to make predictions or generate outputs from new input data...
```
That's it. You're running inference on Baseten.
## Stream the response
For real-time applications, set `stream: true` to receive tokens as they're generated:
```python stream.py theme={"system"}
stream = client.chat.completions.create(
model="zai-org/GLM-4.7",
messages=[
{"role": "user", "content": "Write a haiku about machine learning."}
],
stream=True,
)
for chunk in stream:
content = chunk.choices[0].delta.content
if content:
print(content, end="")
```
```javascript stream.js theme={"system"}
const stream = await client.chat.completions.create({
model: "zai-org/GLM-4.7",
messages: [
{ role: "user", content: "Write a haiku about machine learning." }
],
stream: true,
});
for await (const chunk of stream) {
const content = chunk.choices[0]?.delta?.content;
if (content) process.stdout.write(content);
}
```
## Explore Model API features
Model APIs support the full OpenAI Chat Completions API. Constrain outputs to a JSON schema, let the model call functions you define, or enable extended thinking for complex tasks. See the [Model APIs documentation](/development/model-apis/overview) for the full parameter reference and supported models.
Generate JSON that conforms to a schema you define.
Let the model invoke functions and use the results in its response.
Enable extended thinking for multi-step problem solving.
## Deploy your own model
Model APIs offer the fastest start, but when you need dedicated infrastructure or want to run a model Baseten doesn't host, deploy your own with [Truss](https://pypi.org/project/truss/). A `config.yaml` is all it takes. Point Truss at a Hugging Face model, choose a GPU, and run `truss push`:
```yaml config.yaml theme={"system"}
model_name: Qwen-2.5-3B
resources:
accelerator: L4
use_gpu: true
trt_llm:
build:
base_model: decoder
checkpoint_repository:
source: HF
repo: "Qwen/Qwen2.5-3B-Instruct"
```
Baseten builds a TensorRT-optimized container and provides an OpenAI-compatible endpoint.
Walk through a full config-only deployment from scratch.
## Choose an inference engine
Every deployment on Baseten uses an inference engine tuned for the model's architecture. The engine handles quantization, tensor parallelism, KV cache management, and batching. Select the engine in your `config.yaml`, or Baseten selects it automatically based on the model.
Dense text generation models compiled with TensorRT-LLM. Supports lookahead decoding and structured outputs.
Large mixture-of-experts models like DeepSeek R1 and Qwen3 MoE with KV-aware routing and distributed inference.
Embedding, reranking, and classification models with up to 1,400 embeddings per second throughput.
## Build multi-step workflows
Some applications need more than a single model call. A RAG pipeline retrieves documents, embeds them, and generates a response. An image generation workflow runs a diffusion model, upscales the result, and applies safety filtering. [Chains](/development/chain/overview) orchestrates these multi-step pipelines, with each step running on its own hardware and scaling independently.
Build your first multi-step pipeline.
## Train and fine-tune models
Baseten provides [training infrastructure](/training/overview) for fine-tuning and pre-training. Bring your training scripts (Axolotl, TRL, or custom code) and run jobs on H200 GPUs. Push a training job and deploy the result in two commands:
```bash theme={"system"}
truss train push config.yaml
truss train deploy_checkpoints --training-job-id
```
Run your first fine-tuning job and deploy the checkpoint.
## Scale and monitor in production
Every deployment on Baseten runs on autoscaling infrastructure that adjusts replicas based on traffic. Models scale to zero when idle and scale up within seconds when requests arrive. Built-in observability gives you real-time metrics, logs, and request traces for every deployment.
Configure replicas, concurrency targets, and scale-to-zero.
Monitor performance with metrics, logs, and traces.
## Find your path
If you're integrating a model into your application, start with Model APIs and explore the features that support production use cases.
* [Model APIs overview](/development/model-apis/overview)
* [Structured outputs](/engines/performance-concepts/structured-outputs)
* [Tool calling](/engines/performance-concepts/function-calling)
* [RAG pipeline example](/examples/chains-build-rag)
If you're deploying models on dedicated infrastructure, start with a config-only Truss deployment and tune from there.
* [Deploy your first model](/development/model/build-your-first-model)
* [Engine selection](/engines/index)
* [Autoscaling](/deployment/autoscaling/overview)
* [Performance optimization](/development/model/performance-optimization)
If you're training or fine-tuning models, start with a training job and deploy the result directly to an endpoint.
* [Training overview](/training/overview)
* [Get started with training](/training/getting-started)
* [Deploy from checkpoint](/training/deployment)
## Next steps
End-to-end guides for deploying and optimizing popular models.
Reference for the inference API, management API, and Truss CLI.
# Chains CLI reference
Source: https://docs.baseten.co/reference/cli/chains/chains-cli
Deploy, manage, and develop Chains using the Truss CLI.
```sh theme={"system"}
truss chains [OPTIONS] COMMAND [ARGS]...
```
| Command | Description |
| ----------------- | -------------------------- |
| [`init`](#init) | Initialize a Chain project |
| [`push`](#push) | Deploy a Chain |
| [`watch`](#watch) | Live reload development |
***
## `init`
Initialize a Chain project.
```sh theme={"system"}
truss chains init [OPTIONS] [DIRECTORY]
```
* `DIRECTORY` (optional): Path to a new or empty directory for the Chain. Defaults to the current directory if omitted.
**Options:**
* `--log` `[humanfriendly | INFO | DEBUG]`: Set log verbosity.
* `--help`: Show this message and exit.
**Example:**
To create a new Chain project in a directory called `my-chain`, use the following:
```sh theme={"system"}
truss chains init my-chain
```
***
## `push`
Deploy a Chain.
```sh theme={"system"}
truss chains push [OPTIONS] SOURCE [ENTRYPOINT]
```
* `SOURCE`: Path to a Python file that contains the entrypoint chainlet.
* `ENTRYPOINT` (optional): Class name of the entrypoint chainlet. If omitted, the chainlet tagged with `@chains.mark_entrypoint` is used.
**Options:**
* `--name` (TEXT): Custom name for the Chain (defaults to entrypoint name).
* `--promote / --no-promote`: Promote newly deployed chainlets into production.
* `--environment` (TEXT): Deploy chainlets into a particular environment.
* `--wait / --no-wait`: Wait until all chainlets are ready (or deployment failed).
* `--watch / --no-watch`: Watch the Chains source code and apply live patches. Using this option waits for the Chain to be deployed (the `--wait` flag is applied) before starting to watch for changes. This option requires the deployment to be a development deployment.
* `--experimental-watch-chainlet-names` (TEXT): Run `watch`, but only apply patches to specified chainlets. The option is a comma-separated list of chainlet (display) names. This option can give faster dev loops, but also lead to inconsistent deployments. Use with caution and refer to the [chain watch documentation](/development/chain/watch).
* `--dryrun`: Produce only generated files, but don't deploy anything.
* `--remote` (TEXT): Name of the remote in .trussrc to push to.
* `--include-git-info`: Attach git versioning info (sha, branch, tag) to the deployment.
* `--disable-chain-download`: Disable downloading of pushed chain source code from the UI.
* `--deployment-name` (TEXT): Name of the deployment created by the push. Can be used with `--promote`.
* `--team` (TEXT): Name of the team to deploy to. If not specified, Truss infers the team or prompts for selection.
* `--log` `[humanfriendly|I|INFO|D|DEBUG]`: Customize logging.
* `--help`: Show this message and exit.
The `--team` flag is only available if your organization has teams enabled. [Contact us](mailto:support@baseten.co) to enable teams, or see [Teams](/organization/teams) for more information.
**Example:**
To deploy a Chain as a published deployment:
```sh theme={"system"}
truss chains push my_chain.py
```
To create a development deployment and start watching for changes:
```sh theme={"system"}
truss chains push my_chain.py --watch
```
To deploy and promote to production:
```sh theme={"system"}
truss chains push my_chain.py --promote
```
To deploy to a specific team:
```sh theme={"system"}
truss chains push my_chain.py --team my-team-name
```
***
## `watch`
Live reload development.
```sh theme={"system"}
truss chains watch [OPTIONS] SOURCE [ENTRYPOINT]
```
* `SOURCE`: Path to a Python file containing the entrypoint chainlet.
* `ENTRYPOINT` (optional): Class name of the entrypoint chainlet. If omitted, the chainlet tagged with `@chains.mark_entrypoint` is used.
**Options:**
* `--name` (TEXT): Name of the Chain to be deployed. If not given, the entrypoint name is used.
* `--remote` (TEXT): Name of the remote in .trussrc to push to.
* `--team` (TEXT): Name of the team to deploy to. If not specified, Truss infers the team or prompts for selection.
The `--team` flag is only available if your organization has teams enabled. [Contact us](mailto:support@baseten.co) to enable teams, or see [Teams](/organization/teams) for more information.
* `--experimental-chainlet-names` (TEXT): Run `watch`, but only apply patches to specified chainlets. The option is a comma-separated list of chainlet (display) names. This option can give faster dev loops, but also lead to inconsistent deployments. Use with caution and refer to the [chain watch documentation](/development/chain/watch).
* `--log` `[humanfriendly|W|WARNING|I|INFO|D|DEBUG]`: Customize logging.
* `--help`: Show this message and exit.
**Example:**
To watch a Chain for live reload during development, use the following:
```sh theme={"system"}
truss chains watch my_chain.py
```
# Truss CLI overview
Source: https://docs.baseten.co/reference/cli/index
Install and configure the Truss CLI for deploying models, chains, and training jobs.
The `truss` CLI is your primary interface for everything from packaging and
deploying AI models to building and orchestrating multi-step chains to launching and
managing training jobs.
Use the following commands to manage your models, chains, and training jobs:
* **Models**: Package and deploy individual model servers.
* **Chains**: Build and deploy multi-step inference pipelines.
* **Training**: Launch and manage training jobs.
Install [Truss](https://pypi.org/project/truss/):
```bash theme={"system"}
uv venv && source .venv/bin/activate
uv pip install --upgrade truss
```
```bash theme={"system"}
python3 -m venv .venv && source .venv/bin/activate
pip install --upgrade truss
```
```bash theme={"system"}
python3 -m venv .venv && .venv\Scripts\activate
pip install --upgrade truss
```
You also need a [Baseten account](https://app.baseten.co/signup) with an [API key](https://app.baseten.co/settings/account/api_keys).
## CLI structure
The `truss` CLI organizes commands by workflow:
```
truss [OPTIONS] COMMAND [ARGS]...
```
### Model commands
Use these commands to package, deploy, and iterate on individual models.
| Command | Description |
| ----------------------------------------------------- | --------------------------------- |
| [`truss login`](/reference/cli/truss/login) | Authenticate with Baseten |
| [`truss init`](/reference/cli/truss/init) | Create a new Truss project |
| [`truss push`](/reference/cli/truss/push) | Deploy a model to Baseten |
| [`truss watch`](/reference/cli/truss/watch) | Live reload during development |
| [`truss predict`](/reference/cli/truss/predict) | Call the packaged model |
| [`truss model-logs`](/reference/cli/truss/model-logs) | Fetch logs for the packaged model |
### Chain commands
Use these commands to build multi-model pipelines with shared dependencies.
| Command | Description |
| -------------------------------------------------------------- | ------------------------------ |
| [`truss chains init`](/reference/cli/chains/chains-cli#init) | Initialize a new Chain project |
| [`truss chains push`](/reference/cli/chains/chains-cli#push) | Deploy a Chain to Baseten |
| [`truss chains watch`](/reference/cli/chains/chains-cli#watch) | Live reload Chain development |
### Training commands
Use these commands to launch, monitor, and manage training jobs.
| Command | Description |
| --------------------------------------------------------------- | ------------------------------- |
| [`truss train init`](/reference/cli/training/training-cli#init) | Initialize a training project |
| [`truss train push`](/reference/cli/training/training-cli#push) | Deploy and run a training job |
| [`truss train logs`](/reference/cli/training/training-cli#logs) | Stream logs from a training job |
| [`truss train view`](/reference/cli/training/training-cli#view) | List and inspect training jobs |
## Authentication
After installing Truss, authenticate with Baseten using either method:
**Option 1: Environment variable (recommended for CI/CD)**
```sh theme={"system"}
export BASETEN_API_KEY="EMPTY"
```
**Option 2: Interactive login**
```sh theme={"system"}
truss login
```
This opens a browser window to authenticate and stores your credentials locally.
## Next steps
Package and deploy a model in minutes.
Create multi-step inference pipelines.
Fine-tune models on Baseten infrastructure.
Configure dependencies, resources, and more.
# Training CLI reference
Source: https://docs.baseten.co/reference/cli/training/training-cli
Deploy, manage, and monitor training jobs using the Truss CLI.
The `truss train` command provides subcommands for managing the full training job lifecycle.
```sh theme={"system"}
truss train [COMMAND] [OPTIONS]
```
### Universal options
The following options are available for all `truss train` commands:
* `--help`: Show help message and exit.
* `--non-interactive`: Disable interactive prompts (for CI/automated environments).
* `--remote TEXT`: Name of the remote in `.trussrc`.
***
## init
Initialize a training project from templates or create an empty project.
```sh theme={"system"}
truss train init [OPTIONS]
```
### Options
Template name or comma-separated list of templates to initialize. See the [ML Cookbook](https://github.com/basetenlabs/ml-cookbook) for available examples.
Directory to initialize the project in. Defaults to current directory.
List all available example templates.
### Examples
Initialize a project from a template:
```sh theme={"system"}
truss train init --examples qwen3-8b-lora-dpo-trl
```
Initialize multiple templates:
```sh theme={"system"}
truss train init --examples qwen3-8b-lora-dpo-trl,qwen3-8b-lora-verl
```
List available templates:
```sh theme={"system"}
truss train init --list-examples
```
Create an empty training project:
```sh theme={"system"}
truss train init
```
***
## push
Submit and run a training job.
```sh theme={"system"}
truss train push [OPTIONS] CONFIG
```
### Arguments
Path to the training configuration file (e.g., `config.py`).
### Options
Stream status and logs after submitting the job.
Name for the training job.
Team name for the training project. If not specified, Truss infers the team or prompts for selection.
The `--team` flag is only available if your organization has teams enabled. [Contact us](mailto:support@baseten.co) to enable teams, or see [Teams](/organization/teams) for more information.
Enable an [interactive rSSH session](/training/interactive-sessions) on the training job. Options: `on_startup`, `on_failure`, `on_demand`.
Session timeout in minutes. Defaults to 480 (8 hours).
Override the training job's entrypoint command. Use `"bash"` with `--interactive` for a clean container to experiment in before running anything.
GPU type and count in `TYPE:COUNT` format (e.g., `H200:8`).
Number of compute nodes for the training job.
### Examples
Submit a training job:
```sh theme={"system"}
truss train push config.py
```
Submit and stream logs:
```sh theme={"system"}
truss train push config.py --tail
```
Submit to a specific team:
```sh theme={"system"}
truss train push config.py --team my-team-name
```
Submit with a custom job name:
```sh theme={"system"}
truss train push config.py --job-name fine-tune-v1
```
***
## logs
Fetch and stream logs from a training job.
```sh theme={"system"}
truss train logs [OPTIONS]
```
### Options
Job ID to fetch logs from.
Project name or project ID.
Project ID.
Continuously stream new logs.
### Examples
Stream logs for a specific job:
```sh theme={"system"}
truss train logs --job-id abc123 --tail
```
View logs for a job without streaming:
```sh theme={"system"}
truss train logs --job-id abc123
```
***
## metrics
View real-time metrics for a training job including CPU, GPU, and storage usage.
```sh theme={"system"}
truss train metrics [OPTIONS]
```
### Options
Job ID to fetch metrics from.
Project name or project ID.
Project ID.
### Examples
View metrics for a specific job:
```sh theme={"system"}
truss train metrics --job-id abc123
```
***
## view
List training projects and jobs, or view details for a specific job.
```sh theme={"system"}
truss train view [OPTIONS]
```
### Options
View details for a specific training job.
View jobs for a specific project (name or ID).
View jobs for a specific project ID.
### Examples
List all training projects:
```sh theme={"system"}
truss train view
```
View jobs in a specific project:
```sh theme={"system"}
truss train view --project my-project
```
View details for a specific job:
```sh theme={"system"}
truss train view --job-id abc123
```
***
## stop
Stop a running training job.
```sh theme={"system"}
truss train stop [OPTIONS]
```
### Options
Job ID to stop.
Project name or project ID.
Project ID.
Stop all running jobs. Prompts for confirmation.
### Examples
Stop a specific job:
```sh theme={"system"}
truss train stop --job-id abc123
```
Stop all running jobs:
```sh theme={"system"}
truss train stop --all
```
***
## recreate
Recreate an existing training job with the same configuration.
```sh theme={"system"}
truss train recreate [OPTIONS]
```
### Options
Job ID of the training job to recreate. If not provided, defaults to the last created job.
Stream status and logs after recreating the job.
### Examples
Recreate a specific job:
```sh theme={"system"}
truss train recreate --job-id abc123
```
Recreate and stream logs:
```sh theme={"system"}
truss train recreate --job-id abc123 --tail
```
***
## download
Download training job artifacts to your local machine.
```sh theme={"system"}
truss train download [OPTIONS]
```
### Options
Job ID to download artifacts from.
Directory to download files to. Defaults to current directory.
Keep the compressed archive without extracting.
### Examples
Download artifacts to current directory:
```sh theme={"system"}
truss train download --job-id abc123
```
Download to a specific directory:
```sh theme={"system"}
truss train download --job-id abc123 --target-directory ./downloads
```
Download without extracting:
```sh theme={"system"}
truss train download --job-id abc123 --no-unzip
```
***
## deploy\_checkpoints
Deploy a trained model checkpoint to Baseten's inference platform.
```sh theme={"system"}
truss train deploy_checkpoints [OPTIONS]
```
### Options
Job ID containing the checkpoints to deploy.
Project name or project ID.
Project ID.
Path to a Python file defining a `DeployCheckpointsConfig`.
Generate a Truss config without deploying. Useful for previewing the deployment configuration.
Path to output the generated Truss config. Defaults to `truss_configs/_`.
### Examples
Deploy checkpoints interactively:
```sh theme={"system"}
truss train deploy_checkpoints
```
Deploy checkpoints from a specific job:
```sh theme={"system"}
truss train deploy_checkpoints --job-id abc123
```
Preview deployment without deploying:
```sh theme={"system"}
truss train deploy_checkpoints --job-id abc123 --dry-run
```
***
## get\_checkpoint\_urls
Get presigned URLs for checkpoint artifacts.
```sh theme={"system"}
truss train get_checkpoint_urls [OPTIONS]
```
### Options
Job ID containing the checkpoints.
### Examples
Get checkpoint URLs for a job:
```sh theme={"system"}
truss train get_checkpoint_urls --job-id abc123
```
***
## cache summarize
View a summary of the training cache for a project.
```sh theme={"system"}
truss train cache summarize [OPTIONS] PROJECT
```
### Arguments
Project name or project ID.
### Options
Sort files by column. Options: `filepath`, `size`, `modified`, `type`, `permissions`.
Sort order: `asc` (ascending) or `desc` (descending).
Output format: `cli-table` (default), `csv`, or `json`.
### Examples
View cache summary:
```sh theme={"system"}
truss train cache summarize my-project
```
Sort by size descending:
```sh theme={"system"}
truss train cache summarize my-project --sort size --order desc
```
Export as JSON:
```sh theme={"system"}
truss train cache summarize my-project --output-format json
```
***
## isession
View interactive session details for a training job, including auth codes and connection status. Can also update session configuration.
```sh theme={"system"}
truss train isession [OPTIONS]
```
### Options
Job ID to view interactive session details for.
Minutes to extend the session timeout by.
Change the session trigger. Options: `on_startup`, `on_failure`, `on_demand`. Cannot be changed on `on_startup` sessions.
Output format: `table` (default) or `json`.
### Examples
View session details for a job:
```sh theme={"system"}
truss train isession --job-id abc123
```
Extend session timeout:
```sh theme={"system"}
truss train isession --job-id abc123 --update-timeout 60
```
Output as JSON:
```sh theme={"system"}
truss train isession --job-id abc123 --format json
```
***
## update\_session
Update the interactive session configuration on a running training job. At least one of `--trigger` or `--timeout-minutes` must be provided.
```sh theme={"system"}
truss train update_session [OPTIONS] JOB_ID
```
### Arguments
Job ID of the training job to update.
### Options
New trigger mode for the session. Options: `on_startup`, `on_failure`, `on_demand`.
Number of minutes before the interactive session times out.
### Examples
Change the session trigger:
```sh theme={"system"}
truss train update_session abc123 --trigger on_startup
```
Update the session timeout:
```sh theme={"system"}
truss train update_session abc123 --timeout-minutes 120
```
`truss train update_session` requires API support that may not be available in all environments.
If you receive a 404 error, set the trigger mode at push time using `--interactive on_startup` or `--interactive on_failure` instead.
***
## Ignore files and folders
Create a `.truss_ignore` file in your project root to exclude files from upload. Uses `.gitignore` syntax.
```plaintext .truss_ignore theme={"system"}
# Python cache files
__pycache__/
*.pyc
*.pyo
*.pyd
# Type checking
.mypy_cache/
# Testing
.pytest_cache/
# Large data files
data/
*.bin
```
# truss cleanup
Source: https://docs.baseten.co/reference/cli/truss/cleanup
Clean up Truss data.
```sh theme={"system"}
truss cleanup [OPTIONS]
```
Clears temporary directories created by Truss for operations like building Docker images. Use this to free up disk space.
**Example:**
Clean up temporary Truss data:
```sh theme={"system"}
truss cleanup
```
This command produces no output on success. Temporary files are removed from `~/.truss/`.
# truss configure
Source: https://docs.baseten.co/reference/cli/truss/configure
Configure Truss settings.
```sh theme={"system"}
truss configure [OPTIONS]
```
Opens the `.trussrc` configuration file in your system editor. Use this command to view or modify your local Truss configuration (API keys, remote URLs, etc.).
**Example:**
Open the Truss configuration file:
```sh theme={"system"}
truss configure
```
You should see a configuration file that you can edit, for example:
```yaml ~/.trussrc theme={"system"}
[baseten]
remote_provider = baseten
api_key = EMPTY
remote_url = https://app.baseten.co
```
# truss container
Source: https://docs.baseten.co/reference/cli/truss/container
Run and manage Truss containers locally.
```sh theme={"system"}
truss container [OPTIONS] COMMAND [ARGS]...
```
Manage Docker containers for your Truss.
***
## `kill`
Kill containers related to a specific Truss.
```sh theme={"system"}
truss container kill [OPTIONS] [TARGET_DIRECTORY]
```
### Arguments
A Truss directory. Defaults to current directory.
**Example:**
Kill containers for the current Truss:
```sh theme={"system"}
truss container kill
```
***
## `kill-all`
Kill all Truss containers that are not manually persisted.
```sh theme={"system"}
truss container kill-all [OPTIONS]
```
**Example:**
Kill all Truss containers:
```sh theme={"system"}
truss container kill-all
```
***
## `logs`
Get logs from a running Truss container.
```sh theme={"system"}
truss container logs [OPTIONS] [TARGET_DIRECTORY]
```
### Arguments
A Truss directory. Defaults to current directory.
**Example:**
View logs from the current Truss container:
```sh theme={"system"}
truss container logs
```
# truss image
Source: https://docs.baseten.co/reference/cli/truss/image
Build and manage Truss Docker images.
```sh theme={"system"}
truss image [OPTIONS] COMMAND [ARGS]...
```
Build and manage Docker images for your Truss.
***
## `build`
Build the Docker image for a Truss.
```sh theme={"system"}
truss image build [OPTIONS] [TARGET_DIRECTORY] [BUILD_DIR]
```
### Options
Docker image tag.
Use the host network for the Docker build.
### Arguments
A Truss directory. Defaults to current directory.
Image context directory. If not provided, a temp directory is created.
**Example:**
Build a Docker image for your Truss:
```sh theme={"system"}
truss image build
```
Build with a custom tag:
```sh theme={"system"}
truss image build --tag my-model:v1
```
***
## `build-context`
Create a Docker build context for a Truss without building the image.
```sh theme={"system"}
truss image build-context [OPTIONS] BUILD_DIR [TARGET_DIRECTORY]
```
### Arguments
Directory where image context is created.
A Truss directory. Defaults to current directory.
**Example:**
Create a build context in a specific directory:
```sh theme={"system"}
truss image build-context ./build-context
```
***
## `run`
Run the Docker image for a Truss locally.
```sh theme={"system"}
truss image run [OPTIONS] [TARGET_DIRECTORY] [BUILD_DIR]
```
### Options
Docker image tag to run.
Local port to expose the model on. Default: `8080`.
Attach to the container process.
Use the host network for the Docker build and run.
### Arguments
A Truss directory. Defaults to current directory.
Image context directory. If not provided, a temp directory is created.
**Example:**
Build and run a Truss locally:
```sh theme={"system"}
truss image run
```
Run on a custom port:
```sh theme={"system"}
truss image run --port 9000
```
Run in attached mode:
```sh theme={"system"}
truss image run --attach
```
# truss init
Source: https://docs.baseten.co/reference/cli/truss/init
Create a new Truss project.
```sh theme={"system"}
truss init [OPTIONS] TARGET_DIRECTORY
```
Creates a new Truss project in the specified directory with the standard file structure.
### Options
Server type to create. Default: `TrussServer`.
The value assigned to `model_name` in `config.yaml`.
Use code-first tooling to build the model. Default: `--no-python-config`.
### Arguments
Directory where the Truss project is created.
**Examples:**
Create a new Truss project:
```sh theme={"system"}
truss init my-model
```
You should see:
```
Truss my-model was created in /path/to/my-model
```
This creates the following directory structure:
```
my-model/
├── config.yaml
├── data/
├── model/
│ ├── __init__.py
│ └── model.py
└── packages/
```
Create a Truss with a custom name:
```sh theme={"system"}
truss init --name "My Model" my-model
```
Create a Truss with TRT\_LLM backend:
```sh theme={"system"}
truss init --backend TRT_LLM my-trt-model
```
# truss login
Source: https://docs.baseten.co/reference/cli/truss/login
Authenticate with Baseten.
```sh theme={"system"}
truss login [OPTIONS]
```
Authenticates with Baseten, storing the API key in the local configuration file.
If used with no options, runs in interactive mode. Otherwise, the API key can be passed as an option.
### Options
Baseten API key. If provided, the command runs in non-interactive mode.
**Examples:**
Authenticate interactively:
```sh theme={"system"}
truss login
```
You should see:
```
💻 Let's add a Baseten remote!
🤫 Quietly paste your API_KEY:
💾 Remote config `baseten` saved to `~/.trussrc`.
```
Authenticate non-interactively with your API key:
```sh theme={"system"}
truss login --api-key YOUR_API_KEY
```
# truss model-logs
Source: https://docs.baseten.co/reference/cli/truss/model-logs
Fetch logs for a deployed model.
```sh theme={"system"}
truss model-logs [OPTIONS]
```
Fetches logs for a deployed model. Use this command to debug issues or monitor model behavior in production.
### Options
The ID of the model to fetch logs from.
The ID of the deployment to fetch logs from.
Tail for ongoing logs. Streams new log entries as they arrive.
Name of the remote in .trussrc to fetch logs from.
**Example:**
Fetch logs for a specific deployment:
```sh theme={"system"}
truss model-logs --model-id YOUR_MODEL_ID --deployment-id YOUR_DEPLOYMENT_ID
```
Stream logs in real-time:
```sh theme={"system"}
truss model-logs --model-id YOUR_MODEL_ID --deployment-id YOUR_DEPLOYMENT_ID --tail
```
# Truss CLI reference
Source: https://docs.baseten.co/reference/cli/truss/overview
Deploy, manage, and develop models using the Truss CLI.
```sh theme={"system"}
truss [OPTIONS] COMMAND [ARGS]...
```
**Options:**
Show the version and exit.
Customize logging verbosity.
Disable interactive prompts. Use in CI or automated execution contexts.
Show help message and exit.
### Main commands
| Command | Description |
| ----------------------------------------------- | --------------------------------- |
| [`init`](/reference/cli/truss/init) | Create a new Truss project |
| [`push`](/reference/cli/truss/push) | Deploy a model to Baseten |
| [`watch`](/reference/cli/truss/watch) | Live reload during development |
| [`predict`](/reference/cli/truss/predict) | Call the packaged model |
| [`model-logs`](/reference/cli/truss/model-logs) | Fetch logs for the packaged model |
### Advanced commands
| Command | Description |
| --------------------------------------------- | --------------------------------------- |
| [`image`](/reference/cli/truss/image) | Build and manage Truss Docker images |
| [`container`](/reference/cli/truss/container) | Run and manage Truss containers locally |
| [`cleanup`](/reference/cli/truss/cleanup) | Clean up Truss data |
### Other commands
| Command | Description |
| ----------------------------------------------- | -------------------------------------------- |
| [`login`](/reference/cli/truss/login) | Authenticate with Baseten |
| [`configure`](/reference/cli/truss/configure) | Configure Truss settings |
| [`run-python`](/reference/cli/truss/run-python) | Run a Python script in the Truss environment |
| [`whoami`](/reference/cli/truss/whoami) | Show user information and exit |
All commands support `--help` to display usage information.
# truss predict
Source: https://docs.baseten.co/reference/cli/truss/predict
Call the packaged model.
```sh theme={"system"}
truss predict [OPTIONS]
```
Calls the packaged model with the provided input data. Use this to test your model locally or remotely.
### Options
A Truss directory. Defaults to current directory.
JSON string representing the request payload.
Path to a JSON file containing the request payload.
Name of the remote in .trussrc to invoke.
ID of the model to invoke.
ID of the model deployment to invoke.
**Deprecated:** Use `--model-deployment` instead. ID of the model deployment to invoke.
Invoke the published (production) deployment.
**Examples:**
Call a deployed model with inline JSON data:
```sh theme={"system"}
truss predict --remote baseten --model YOUR_MODEL_ID -d '{"prompt": "What is the meaning of life?"}'
```
The response is printed as formatted JSON. For streaming models, output is printed as chunks arrive.
Call a model using a JSON file:
```sh theme={"system"}
truss predict -f request.json
```
Call the production deployment:
```sh theme={"system"}
truss predict --published -d '{"prompt": "Hello, world!"}'
```
# truss push
Source: https://docs.baseten.co/reference/cli/truss/push
Deploy a model to Baseten.
```sh theme={"system"}
truss push [OPTIONS] [TARGET_DIRECTORY]
```
Deploys a Truss to Baseten. By default, creates a published deployment.
### Options
Path to a custom config file. Defaults to `config.yaml` in the Truss directory.
Name of the remote in .trussrc to push to.
Create a development deployment, wait for it to deploy, then watch for source code changes and apply live patches. Use this for rapid iteration during development. Cannot be used with `--promote`, `--environment`, or `--tail`.
Published is now the default behavior for `truss push`. Previously required to create a published deployment. If no production deployment exists, the first published deployment is automatically promoted to production.
Push as a published deployment and promote to production, even if a production deployment already exists.
Push as a published deployment and promote into the specified environment. When specified, `--promote` is ignored.
Preserve the previous production deployment's autoscaling settings. Can only be used with `--promote`.
When pushing to an environment, preserve the instance type configured in the environment instead of using the resources from the Truss config. Default: `--preserve-env-instance-type`. Ignored if `--environment` is not specified.
Temporarily overrides the model name for this deployment without updating `config.yaml`.
Name of the deployment. Only applies to published deployments (not development deployments created with `--watch`). Must contain only alphanumeric, `.`, `-`, or `_` characters.
Wait for deployment to complete before returning. Returns non-zero exit code if deploy or build fails.
Stream deployment logs after push. Cannot be used with `--wait`.
Pass a JSON string with key-value pairs. This will be attached to the
deployment and can be used for searching and filtering.
```sh theme={"system"}
truss push --labels '{"env": "staging", "team": "ml-platform", "version": "1.2.0"}'
```
Maximum time to wait for deployment status polling in seconds. Only applies when `--wait` is used. This is a client-side timeout for the polling loop. For a server-side deploy operation timeout, use `--deploy-timeout-minutes`.
Attach git versioning info (sha, branch, tag) to the deployment. Can also be set permanently in `.trussrc`.
Disable downloading the Truss directory from the UI.
Force a full rebuild without using cached layers.
Timeout in minutes for the deploy operation.
Name of the team to deploy to. If not specified, Truss infers the team based on your team membership and existing models, or prompts for selection when ambiguous.
The `--team` flag is only available if your organization has teams enabled. [Contact us](mailto:support@baseten.co) to enable teams, or see [Teams](/organization/teams) for more information.
### Arguments
A Truss directory. Defaults to current directory.
**Examples:**
Deploy a published deployment from the current directory:
```sh theme={"system"}
truss push
```
You should see:
```
Deploying as a published deployment. Use --watch for a development deployment.
✨ Model my-model was successfully pushed ✨
🪵 View logs for your deployment at https://app.baseten.co/models/.../logs
```
Create a development deployment and start watching for changes:
```sh theme={"system"}
truss push --watch
```
Deploy and promote to production:
```sh theme={"system"}
truss push --promote
```
Deploy to a specific environment:
```sh theme={"system"}
truss push --environment staging
```
Deploy with a custom deployment name:
```sh theme={"system"}
truss push --deployment-name my-model_v1.0
```
Deploy with a custom config file:
```sh theme={"system"}
truss push --config my-config.yaml
```
Deploy to a specific team:
```sh theme={"system"}
truss push --team my-team-name
```
# truss run-python
Source: https://docs.baseten.co/reference/cli/truss/run-python
Run a Python script in the Truss environment.
```sh theme={"system"}
truss run-python [OPTIONS] SCRIPT [TARGET_DIRECTORY]
```
Runs a Python script in the same environment as your Truss. This builds a Docker
image matching your Truss environment, mounts the script, and executes it. Use
this to test scripts with the same dependencies your model uses.
### Arguments
Path to the Python script to run.
A Truss directory. Defaults to current directory.
**Example:**
Run a script in the Truss environment:
```sh theme={"system"}
truss run-python test_script.py
```
Run a script with a specific Truss directory:
```sh theme={"system"}
truss run-python test_script.py /path/to/my-truss
```
# truss watch
Source: https://docs.baseten.co/reference/cli/truss/watch
Live reload during development.
```sh theme={"system"}
truss watch [OPTIONS] [TARGET_DIRECTORY]
```
Watches for source code changes and applies live patches to a development deployment. This enables rapid iteration without redeploying.
You can create a development deployment and start watching in one step with `truss push --watch`.
### Options
Name of the remote in .trussrc to patch changes to.
Path to a custom config file. Defaults to `config.yaml` in the Truss directory.
Name of the team to deploy to. If not specified, Truss infers the team or prompts for selection.
The `--team` flag is only available if your organization has teams enabled. [Contact us](mailto:support@baseten.co) to enable teams, or see [Teams](/organization/teams) for more information.
Keep the development model warm by preventing scale-to-zero while watching.
Temporarily overrides the model name for this session without updating `config.yaml`.
### Arguments
A Truss directory. Defaults to current directory.
**Examples:**
Watch for changes in the current directory:
```sh theme={"system"}
truss watch
```
You should see:
```
🪵 View logs for your development model at https://app.baseten.co/models/.../logs
👀 Watching for changes to truss at '/path/to/my-model'...
```
When you edit a file, Truss detects the change and applies a live patch to the running deployment.
Watch a specific Truss directory:
```sh theme={"system"}
truss watch /path/to/my-truss
```
Watch with a custom config file:
```sh theme={"system"}
truss watch --config my-config.yaml
```
# truss whoami
Source: https://docs.baseten.co/reference/cli/truss/whoami
Show user information.
```sh theme={"system"}
truss whoami [OPTIONS]
```
Shows the currently authenticated user information and exits. Use this command to verify your authentication status.
### Options
Name of the remote in .trussrc to check.
Display your [OIDC configuration](/organization/oidc) for workload identity, including org ID, team IDs, issuer, audience, and the subject claim format used for cloud provider trust policies.
**Examples:**
Check the current authenticated user:
```sh theme={"system"}
truss whoami
```
You should see:
```
my-workspace\user@example.com
```
View your OIDC configuration for setting up cloud provider trust policies:
```sh theme={"system"}
truss whoami --show-oidc
```
This displays your OIDC configuration for workload identity:
| Field | Description | Example |
| --------------------- | ------------------------------------------------------------- | ---------------------------------------------------------------------------------------------------------------------------- |
| Org ID | Your organization identifier | `Mvg9jrRd` |
| Teams | Team IDs within your organization | `AviIZ0y3 (my-team)` |
| Issuer | The Baseten OIDC issuer URL | `https://oidc.baseten.co` |
| Audience | The expected audience claim | `oidc.baseten.co` |
| Workload Type Options | Available workload types for subject claims | `model_container`, `model_build` |
| Subject Claim Format | Pattern used in cloud provider trust policies to scope access | `v=1:org=:team=:model=:deployment=:environment=:type=` |
Use the org and team IDs from this output when configuring trust policies in [AWS](/organization/oidc#aws-setup) or [GCP](/organization/oidc#google-cloud-setup).
# Chat Completions
Source: https://docs.baseten.co/reference/inference-api/chat-completions
reference/inference-api/llm-openapi-spec.json post /v1/chat/completions
Create chat completions using Baseten Model APIs, an OpenAI-compatible endpoint for managed LLMs.
Download the [OpenAPI schema](/reference/inference-api/llm-openapi-spec.json) for code generation and client libraries.
[Model APIs](https://app.baseten.co/model-apis/create) provide instant access to high-performance open-source LLMs through an OpenAI-compatible endpoint.
## Replace OpenAI with Baseten
Switching from OpenAI to Baseten takes two changes: the base URL and API key.
To switch to Baseten with the Python SDK, change `base_url` and `api_key` when initializing the client:
```python theme={"system"}
from openai import OpenAI
import os
client = OpenAI(
base_url="https://inference.baseten.co/v1",
api_key=os.environ["BASETEN_API_KEY"],
)
response = client.chat.completions.create(
model="deepseek-ai/DeepSeek-V3.1",
messages=[{"role": "user", "content": "Hello!"}],
)
```
To switch to Baseten with the JavaScript SDK, change `baseURL` and `apiKey` when initializing the client:
```javascript theme={"system"}
import OpenAI from "openai";
const client = new OpenAI({
baseURL: "https://inference.baseten.co/v1",
apiKey: process.env.BASETEN_API_KEY,
});
const response = await client.chat.completions.create({
model: "deepseek-ai/DeepSeek-V3.1",
messages: [{ role: "user", content: "Hello!" }],
});
```
To call Baseten with cURL, send a POST request to `inference.baseten.co` with your API key:
```bash theme={"system"}
curl https://inference.baseten.co/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Api-Key $BASETEN_API_KEY" \
-d '{
"model": "deepseek-ai/DeepSeek-V3.1",
"messages": [{"role": "user", "content": "Hello!"}]
}'
```
Deploy a [Model API](https://app.baseten.co/model-apis/create) to get started.
For detailed usage guides including structured outputs and tool calling, see [Using Model APIs](/development/model-apis/overview).
OpenAI-compatible models you deploy yourself also support the chat completions format at their own base URL: `https://model-{model_id}.api.baseten.co/v1/chat/completions`. See [deployed model endpoints](/reference/inference-api/overview#deployed-model-endpoints) for URL formats.
# Overview
Source: https://docs.baseten.co/reference/inference-api/overview
Baseten provides two ways to call models: Model APIs for managed LLMs and deployed model endpoints for custom models and chains.
Every model running on Baseten is accessible over HTTPS through the inference API.
The API provides two paths depending on how your model is served.
Model APIs offer managed, high-performance LLMs through a single OpenAI-compatible endpoint, with no deployment step required.
Deployed model endpoints serve custom models and chains that you package and deploy with Truss, each routed through a dedicated subdomain.
## Model APIs
Model APIs give you instant access to popular open-source LLMs with optimized serving. Baseten manages the infrastructure (shared GPU clusters, model weights, and serving configuration), so there's no deployment step and nothing to configure. The supported catalog includes models like DeepSeek, GLM, and Kimi, with all models supporting tool calling and most supporting structured outputs. Pricing is per million tokens.
Because Model APIs implement the OpenAI chat completions format, switching from OpenAI to Baseten requires only changing the base URL and API key in your existing client. All requests route through a single endpoint:
```sh theme={"system"}
https://inference.baseten.co/v1/chat/completions
```
The [Chat Completions](/reference/inference-api/chat-completions) reference covers request and response schemas. For usage details including structured outputs and tool calling, refer to the [Model APIs guide](/development/model-apis/overview).
## Deployed model endpoints
When you deploy a custom model or chain with Truss, Baseten assigns it a dedicated subdomain for routing. This is the path for models that aren't in the Model APIs catalog: models with custom serving logic, fine-tuned weights, or multi-step inference pipelines built as chains. You control the hardware, scaling behavior, and serving engine.
Each endpoint URL includes a deployment target: an environment name like `production`, the `development` deployment, or a specific deployment ID.
**For models:**
```
https://model-{model_id}.api.baseten.co/{deployment_type_or_id}/{endpoint}
```
**For chains:**
```
https://chain-{chain_id}.api.baseten.co/{deployment_type_or_id}/{endpoint}
```
* `model_id`: the model's alphanumeric ID, found in your model dashboard.
* `chain_id`: the chain's alphanumeric ID, found in your chain dashboard.
* `deployment_type_or_id`: either `development`, `production`, or a specific deployment's alphanumeric ID.
* `endpoint`: the API action, such as `predict`.
For [regional environments](/deployment/environments#regional-environments), the environment name is embedded in the hostname instead of the URL path:
```
https://model-{model_id}-{env_name}.api.baseten.co/{endpoint}
https://chain-{chain_id}-{env_name}.api.baseten.co/{endpoint}
```
For long-running tasks, the inference API supports [asynchronous inference](/inference/async) with priority queuing.
### Predict endpoints
| Method | Endpoint | Description |
| :---------- | :----------------------------------------------------------------------------------------------------------------- | :---------------------------------------------------------------------------------------------------------- |
| `POST` | [`/environments/{env_name}/predict`](/reference/inference-api/predict-endpoints/environments-predict) | Call an **environment**. |
| `POST` | [`/development/predict`](/reference/inference-api/predict-endpoints/development-predict) | Call the **development deployment**. |
| `POST` | [`/deployment/{deployment_id}/predict`](/reference/inference-api/predict-endpoints/deployment-predict) | Call any **deployment**. |
| `POST` | [`/environments/{env_name}/async_predict`](/reference/inference-api/predict-endpoints/environments-async-predict) | For [Async inference](/inference/async), call the deployment associated with the specified **environment**. |
| `POST` | [`/development/async_predict`](/reference/inference-api/predict-endpoints/development-async-predict) | For [Async inference](/inference/async), call the **development deployment**. |
| `POST` | [`/deployment/{deployment_id}/async_predict`](/reference/inference-api/predict-endpoints/deployment-async-predict) | For [Async inference](/inference/async), call any published **deployment** of your model. |
| `WEBSOCKET` | [`/environments/{env_name}/websocket`](/reference/inference-api/predict-endpoints/environments-websocket) | For WebSockets, connect to an **environment**. |
| `WEBSOCKET` | [`/development/websocket`](/reference/inference-api/predict-endpoints/development-websocket) | For WebSockets, connect to the **development deployment**. |
| `WEBSOCKET` | [`/deployment/{deployment_id}/websocket`](/reference/inference-api/predict-endpoints/deployment-websocket) | For WebSockets, connect to a **deployment**. |
| Method | Endpoint | Description |
| :---------- | :----------------------------------------------------------------------------------------------------------------------- | :---------------------------------------------------------------------------------------------------------------- |
| `POST` | [`/environments/{env_name}/run_remote`](/reference/inference-api/predict-endpoints/environments-run-remote) | Call a Chain **environment**. |
| `POST` | [`/development/run_remote`](/reference/inference-api/predict-endpoints/development-run-remote) | Call a Chain **development deployment**. |
| `POST` | [`/deployment/{deployment_id}/run_remote`](/reference/inference-api/predict-endpoints/deployment-run-remote) | Call a Chain **deployment**. |
| `POST` | [`/environments/{env_name}/async_run_remote`](/reference/inference-api/predict-endpoints/environments-async-run-remote) | For [Async inference](/inference/async), call the Chain deployment associated with the specified **environment**. |
| `POST` | [`/development/async_run_remote`](/reference/inference-api/predict-endpoints/development-async-run-remote) | For [Async inference](/inference/async), call a Chain **development deployment**. |
| `POST` | [`/deployment/{deployment_id}/async_run_remote`](/reference/inference-api/predict-endpoints/deployment-async-run-remote) | For [Async inference](/inference/async), call any published Chain **deployment**. |
| `WEBSOCKET` | [`/environments/{env_name}/websocket`](/reference/inference-api/predict-endpoints/environments-websocket) | For WebSockets, connect to an **environment**. |
| `WEBSOCKET` | [`/development/websocket`](/reference/inference-api/predict-endpoints/development-websocket) | For WebSockets, connect to the **development deployment**. |
| `WEBSOCKET` | [`/deployment/{deployment_id}/websocket`](/reference/inference-api/predict-endpoints/deployment-websocket) | For WebSockets, connect to a **deployment**. |
### Async status endpoints
| Method | Endpoint | Description |
| :----- | :------------------------------------------------------------------------------------------------------------------------------ | :---------------------------------------------------------------------------------------- |
| `GET` | [`/async_request/{request_id}`](/reference/inference-api/status-endpoints/get-async-request-status) | Get the **status** of a **model** async request. |
| `GET` | [`/async_request/{request_id}`](/reference/inference-api/status-endpoints/get-chain-async-request-status) | Get the **status** of a **chain** async request. |
| `DEL` | [`/async_request/{request_id}`](/reference/inference-api/predict-endpoints/cancel-async-request) | **Cancel** an async request. |
| `GET` | [`/environments/{env_name}/async_queue_status`](/reference/inference-api/status-endpoints/environments-get-async-queue-status) | Get the **async queue status** for a model associated with the **specified environment**. |
| `GET` | [`/development/async_queue_status`](/reference/inference-api/status-endpoints/development-get-async-queue-status) | Get the **status** of a **development deployment's** async queue. |
| `GET` | [`/deployment/{deployment_id}/async_queue_status`](/reference/inference-api/status-endpoints/deployment-get-async-queue-status) | Get the **status** of a **deployment's** async queue. |
### Wake endpoints
| Method | Endpoint | Description |
| :----- | :---------------------------------------------------------------------------------- | :------------------------------------------------- |
| `POST` | [`/production/wake`](/reference/inference-api/wake/production-wake) | Wake the **production environment** of your model. |
| `POST` | [`/development/wake`](/reference/inference-api/wake/development-wake) | Wake the **development deployment** of your model. |
| `POST` | [`/deployment/{deployment_id}/wake`](/reference/inference-api/wake/deployment-wake) | Wake any **deployment** of your model. |
# Async cancel request
Source: https://docs.baseten.co/reference/inference-api/predict-endpoints/cancel-async-request
DELETE https://model-{model_id}.api.baseten.co/async_request/{request_id}
Use this endpoint to cancel a queued async request.
Only `QUEUED` requests may be canceled.
### Parameters
The ID of the model.
The ID of the chain.
The ID of the async request.
### Headers
Your Baseten API key, formatted with prefix `Api-Key` (e.g. `{"Authorization": "Api-Key abcd1234.abcd1234"}`).
### Response
The ID of the async request.
Whether the request was canceled.
Additional details about whether the request was canceled.
### Rate limits
Calls to the cancel async request status endpoint are limited to **20 requests per second**. If this limit is exceeded, subsequent requests will receive a 429 status code.
```python Python (Model) theme={"system"}
import requests
import os
model_id = ""
request_id = ""
# Read secrets from environment variables
baseten_api_key = os.environ["BASETEN_API_KEY"]
resp = requests.delete(
f"https://model-{model_id}.api.baseten.co/async_request/{request_id}",
headers={"Authorization": f"Api-Key {baseten_api_key}"}
)
print(resp.json())
```
```python Python (Chain) theme={"system"}
import requests
import os
chain_id = ""
request_id = ""
# Read secrets from environment variables
baseten_api_key = os.environ["BASETEN_API_KEY"]
resp = requests.delete(
f"https://chain-{chain_id}.api.baseten.co/async_request/{request_id}",
headers={"Authorization": f"Api-Key {baseten_api_key}"}
)
print(resp.json())
```
# Async deployment
Source: https://docs.baseten.co/reference/inference-api/predict-endpoints/deployment-async-predict
POST https://model-{model_id}.api.baseten.co/deployment/{deployment-id}/async_predict
Use this endpoint to call any [published deployment](/deploy/lifecycle) of your model.
### Parameters
The ID of the model you want to call.
The ID of the specific deployment you want to call.
### Headers
Your Baseten API key, formatted with prefix `Api-Key` (e.g. `{"Authorization": "Api-Key abcd1234.abcd1234"}`).
### Body
There is a 256 KiB size limit to `/async_predict` request payloads.
JSON-serializable model input.
Baseten **does not** store model outputs. If `webhook_endpoint` is empty, your model must save prediction outputs so they can be accessed later.
URL of the webhook endpoint. We require that webhook endpoints use HTTPS. Both HTTP/2 and HTTP/1.1 protocols are supported.
Priority of the request. A lower value corresponds to a higher priority (e.g. requests with priority 0 are scheduled before requests of priority 1).
`priority` is between 0 and 2, inclusive.
Maximum time a request will spend in the queue before expiring.
`max_time_in_queue_seconds` must be between 10 seconds and 72 hours, inclusive.
Exponential backoff parameters used to retry the model predict request.
Number of predict request attempts.
`max_attempts` must be between 1 and 10, inclusive.
Minimum time between retries in milliseconds.
`initial_delay_ms` must be between 0 and 10,000 milliseconds, inclusive.
Maximum time between retries in milliseconds.
`max_delay_ms` must be between 0 and 60,000 milliseconds, inclusive.
### Response
The ID of the async request.
### Rate limits
Two types of rate limits apply when making async requests:
* Calls to the `/async_predict` endpoint are limited to **200 requests per second**.
* Each organization is limited to **50,000 `QUEUED` or `IN_PROGRESS` async requests**, summed across all deployments.
If either limit is exceeded, subsequent `/async_predict` requests will receive a 429 status code.
To avoid hitting these rate limits, we advise:
* Implementing a backpressure mechanism, such as calling `/async_predict` with exponential backoff in response to 429 errors.
* Monitoring the [async queue size metric](/observability/metrics#async-queue-metrics). If your model is accumulating a backlog of requests, consider increasing the number of requests your model can process at once by increasing the number of max replicas or the concurrency target in your autoscaling settings.
```python Python theme={"system"}
import requests
import os
model_id = ""
deployment_id = ""
webhook_endpoint = ""
# Read secrets from environment variables
baseten_api_key = os.environ["BASETEN_API_KEY"]
resp = requests.post(
f"https://model-{model_id}.api.baseten.co/deployment/{deployment_id}/async_predict",
headers={"Authorization": f"Api-Key {baseten_api_key}"},
json={
"model_input": {"prompt": "hello world!"},
"webhook_endpoint": webhook_endpoint # Optional fields for priority, max_time_in_queue_seconds, etc
},
)
print(resp.json())
```
```sh cURL theme={"system"}
curl --request POST \
--url https://model-{model_id}.api.baseten.co/deployment/{deployment_id}/async_predict \
--header "Authorization: Api-Key $BASETEN_API_KEY" \
--data '{
"model_input": {"prompt": "hello world!"},
"webhook_endpoint": "https://my_webhook.com/webhook",
"priority": 1,
"max_time_in_queue_seconds": 100,
"inference_retry_config": {
"max_attempts": 3,
"initial_delay_ms": 1000,
"max_delay_ms": 5000
}
}'
```
```javascript Node.js theme={"system"}
const fetch = require("node-fetch");
const resp = await fetch(
"https://model-{model_id}.api.baseten.co/deployment/{deployment_id}/async_predict",
{
method: "POST",
headers: { Authorization: "Api-Key EMPTY" },
body: JSON.stringify({
model_input: { prompt: "hello world!" },
webhook_endpoint: "https://my_webhook.com/webhook",
priority: 1,
max_time_in_queue_seconds: 100,
inference_retry_config: {
max_attempts: 3,
initial_delay_ms: 1000,
max_delay_ms: 5000,
},
}),
}
);
const data = await resp.json();
console.log(data);
```
```json 201 theme={"system"}
{
"request_id": ""
}
```
# Async chains deployment
Source: https://docs.baseten.co/reference/inference-api/predict-endpoints/deployment-async-run-remote
POST https://chain-{chain_id}.api.baseten.co/deployment/{deployment-id}/async_run_remote
Call a specific chain deployment asynchronously by deployment ID.
Use this endpoint to call any [deployment](/deployment/deployments) of your
chain asynchronously.
```sh theme={"system"}
https://chain-{chain_id}.api.baseten.co/deployment/{deployment_id}/async_run_remote
```
### Parameters
The ID of the chain you want to call.
The ID of the specific deployment you want to call.
Your Baseten API key, formatted with prefix `Api-Key` (e.g. `{"Authorization": "Api-Key abcd1234.abcd1234"}`).
### Body
JSON-serializable chain input. The input schema corresponds to the signature
of the entrypoint's `run_remote` method. I.e. The top-level keys are the
argument names. The values are the corresponding JSON representation of the
types.
```python Python theme={"system"}
import urllib3
import os
chain_id = ""
deployment_id = ""
# Read secrets from environment variables
baseten_api_key = os.environ["BASETEN_API_KEY"]
resp = urllib3.request(
"POST",
f"https://chain
-{chain_id}.api.baseten.co/deployment/{deployment_id}/async_run_remote",
headers={"Authorization": f"Api-Key {baseten_api_key}"},
json={}, # JSON-serializable chain input
)
print(resp.json())
```
```sh cURL theme={"system"}
curl -X POST https://chain-{chain_id}.api.baseten.co/deployment/{deployment_id}/async_run_remote \
-H 'Authorization: Api-Key EMPTY' \
-d '{}' # JSON-serializable chain input
```
```javascript Node.js theme={"system"}
const fetch = require("node-fetch");
const resp = await fetch(
"https://chain-{chain_id}.api.baseten.co/deployment/{deployment_id}/async_run_remote",
{
method: "POST",
headers: { Authorization: "Api-Key EMPTY" },
body: JSON.stringify({}), // JSON-serializable chain input
}
);
const data = await resp.json();
console.log(data);
```
```json 201 theme={"system"}
{
"request_id": ""
}
```
# Deployment
Source: https://docs.baseten.co/reference/inference-api/predict-endpoints/deployment-predict
POST https://model-{model_id}.api.baseten.co/deployment/{deployment-id}/predict
Call a specific deployment of a model by deployment ID.
Use this endpoint to call any [published deployment](/deployment/deployments) of your model.
```sh theme={"system"}
https://model-{model_id}.api.baseten.co/deployment/{deployment_id}/predict
```
### Parameters
The ID of the model you want to call.
The ID of the specific deployment you want to call.
Your Baseten API key, formatted with prefix `Api-Key` (e.g. `{"Authorization": "Api-Key abcd1234.abcd1234"}`).
### Body
JSON-serializable model input.
```python Python theme={"system"}
import urllib3
import os
model_id = ""
deployment_id = ""
# Read secrets from environment variables
baseten_api_key = os.environ["BASETEN_API_KEY"]
resp = urllib3.request(
"POST",
f"https://model-{model_id}.api.baseten.co/deployment/{deployment_id}/predict",
headers={"Authorization": f"Api-Key {baseten_api_key}"},
json={}, # JSON-serializable model input
)
print(resp.json())
```
```sh cURL theme={"system"}
curl -X POST https://model-{model_id}.api.baseten.co/deployment/{deployment_id}/predict \
-H 'Authorization: Api-Key EMPTY' \
-d '{}' # JSON-serializable model input
```
```sh Truss theme={"system"}
truss predict --model-version DEPLOYMENT_ID -d '{}' # JSON-serializable model input
```
```javascript Node.js theme={"system"}
const fetch = require("node-fetch");
const resp = await fetch(
"https://model-{model_id}.api.baseten.co/deployment/{deployment_id}/predict",
{
method: "POST",
headers: { Authorization: "Api-Key EMPTY" },
body: JSON.stringify({}), // JSON-serializable model input
}
);
const data = await resp.json();
console.log(data);
```
```json Example Response // JSON-serializable output varies by model theme={"system"}
{}
```
# Chains deployment
Source: https://docs.baseten.co/reference/inference-api/predict-endpoints/deployment-run-remote
POST https://chain-{chain_id}.api.baseten.co/deployment/{deployment-id}/run_remote
Call a specific chain deployment by deployment ID.
Use this endpoint to call any [deployment](/deployment/deployments) of your
chain.
```sh theme={"system"}
https://chain-{chain_id}.api.baseten.co/deployment/{deployment_id}/run_remote
```
### Parameters
The ID of the chain you want to call.
The ID of the specific deployment you want to call.
Your Baseten API key, formatted with prefix `Api-Key` (e.g. `{"Authorization": "Api-Key abcd1234.abcd1234"}`).
### Body
JSON-serializable chain input. The input schema corresponds to the signature
of the entrypoint's `run_remote` method. I.e. The top-level keys are the
argument names. The values are the corresponding JSON representation of the
types.
```python Python theme={"system"}
import urllib3
import os
chain_id = ""
deployment_id = ""
# Read secrets from environment variables
baseten_api_key = os.environ["BASETEN_API_KEY"]
resp = urllib3.request(
"POST",
f"https://chain
-{chain_id}.api.baseten.co/deployment/{deployment_id}/run_remote",
headers={"Authorization": f"Api-Key {baseten_api_key}"},
json={}, # JSON-serializable chain input
)
print(resp.json())
```
```sh cURL theme={"system"}
curl -X POST https://chain-{chain_id}.api.baseten.co/deployment/{deployment_id}/run_remote \
-H 'Authorization: Api-Key EMPTY' \
-d '{}' # JSON-serializable chain input
```
```javascript Node.js theme={"system"}
const fetch = require("node-fetch");
const resp = await fetch(
"https://chain-{chain_id}.api.baseten.co/deployment/{deployment_id}/run_remote",
{
method: "POST",
headers: { Authorization: "Api-Key EMPTY" },
body: JSON.stringify({}), // JSON-serializable chain input
}
);
const data = await resp.json();
console.log(data);
```
```json Example Response // JSON-serializable output varies by chain theme={"system"}
{}
```
# Websocket deployment
Source: https://docs.baseten.co/reference/inference-api/predict-endpoints/deployment-websocket
Connect via WebSocket to a specific deployment.
Use this endpoint to connect via WebSockets to a specific deployment.
Note that `entity` here could be either `model` or `chain`, depending on whether you using Baseten models or Chains.
```sh theme={"system"}
wss://{entity}-{entity_id}.api.baseten.co/deployment/{deployment_id}/websocket
```
See [WebSockets](/development/model/websockets) for more details.
### Parameters
The type of entity you want to connect to. Either `model` or `chain`.
The ID of the model or chain you want to connect to.
The ID of the deployment you want to connect to.
Your Baseten API key, formatted with prefix `Api-Key` (e.g. `{"Authorization": "Api-Key abcd1234.abcd1234"}`).
```sh websocat theme={"system"}
websocat -H 'Authorization: Api-Key EMPTY' \
wss://{entity}-{model_id}.api.baseten.co/deployment/{deployment_id}/websocket
```
# Async development
Source: https://docs.baseten.co/reference/inference-api/predict-endpoints/development-async-predict
POST https://model-{model_id}.api.baseten.co/development/async_predict
Use this endpoint to call the [development deployment](/deploy/lifecycle) of your model asynchronously.
### Parameters
The ID of the model you want to call.
### Headers
Your Baseten API key, formatted with prefix `Api-Key` (e.g. `{"Authorization": "Api-Key abcd1234.abcd1234"}`).
### Body
There is a 256 KiB size limit to `/async_predict` request payloads.
JSON-serializable model input.
Baseten **does not** store model outputs. If `webhook_endpoint` is empty, your model must save prediction outputs so they can be accessed later.
URL of the webhook endpoint. We require that webhook endpoints use HTTPS. Both HTTP/2 and HTTP/1.1 protocols are supported.
Priority of the request. A lower value corresponds to a higher priority (e.g. requests with priority 0 are scheduled before requests of priority 1).
`priority` is between 0 and 2, inclusive.
Maximum time a request will spend in the queue before expiring.
`max_time_in_queue_seconds` must be between 10 seconds and 72 hours, inclusive.
Exponential backoff parameters used to retry the model predict request.
Number of predict request attempts.
`max_attempts` must be between 1 and 10, inclusive.
Minimum time between retries in milliseconds.
`initial_delay_ms` must be between 0 and 10,000 milliseconds, inclusive.
Maximum time between retries in milliseconds.
`max_delay_ms` must be between 0 and 60,000 milliseconds, inclusive.
### Response
The ID of the async request.
### Rate limits
Two types of rate limits apply when making async requests:
* Calls to the `/async_predict` endpoint are limited to **200 requests per second**.
* Each organization is limited to **50,000 `QUEUED` or `IN_PROGRESS` async requests**, summed across all deployments.
If either limit is exceeded, subsequent `/async_predict` requests will receive a 429 status code.
To avoid hitting these rate limits, we advise:
* Implementing a backpressure mechanism, such as calling `/async_predict` with exponential backoff in response to 429 errors.
* Monitoring the [async queue size metric](/observability/metrics#async-queue-metrics). If your model is accumulating a backlog of requests, consider increasing the number of requests your model can process at once by increasing the number of max replicas or the concurrency target in your autoscaling settings.
```python Python theme={"system"}
import requests
import os
model_id = ""
webhook_endpoint = ""
# Read secrets from environment variables
baseten_api_key = os.environ["BASETEN_API_KEY"]
resp = requests.post(
f"https://model-{model_id}.api.baseten.co/development/async_predict",
headers={"Authorization": f"Api-Key {baseten_api_key}"},
json={
"model_input": {"prompt": "hello world!"},
"webhook_endpoint": webhook_endpoint # Optional fields for priority, max_time_in_queue_seconds, etc
},
)
print(resp.json())
```
```sh cURL theme={"system"}
curl --request POST \
--url https://model-{model_id}.api.baseten.co/development/async_predict \
--header "Authorization: Api-Key $BASETEN_API_KEY" \
--data '{
"model_input": {"prompt": "hello world!"},
"webhook_endpoint": "https://my_webhook.com/webhook",
"priority": 1,
"max_time_in_queue_seconds": 100,
"inference_retry_config": {
"max_attempts": 3,
"initial_delay_ms": 1000,
"max_delay_ms": 5000
}
}'
```
```javascript Node.js theme={"system"}
const fetch = require("node-fetch");
const resp = await fetch(
"https://model-{model_id}.api.baseten.co/development/async_predict",
{
method: "POST",
headers: { Authorization: "Api-Key EMPTY" },
body: JSON.stringify({
model_input: { prompt: "hello world!" },
webhook_endpoint: "https://my_webhook.com/webhook",
priority: 1,
max_time_in_queue_seconds: 100,
inference_retry_config: {
max_attempts: 3,
initial_delay_ms: 1000,
max_delay_ms: 5000,
},
}),
}
);
const data = await resp.json();
console.log(data);
```
```json 201 theme={"system"}
{
"request_id": ""
}
```
# Async chains development
Source: https://docs.baseten.co/reference/inference-api/predict-endpoints/development-async-run-remote
POST https://chain-{chain_id}.api.baseten.co/development/async_run_remote
Call the development deployment of a chain asynchronously.
Use this endpoint to call the [development deployment](/development/chain/deploy#development) of
your chain asynchronously.
```sh theme={"system"}
https://chain-{chain_id}.api.baseten.co/development/async_run_remote
```
### Parameters
The ID of the chain you want to call.
Your Baseten API key, formatted with prefix `Api-Key` (e.g. `{"Authorization": "Api-Key abcd1234.abcd1234"}`).
### Body
JSON-serializable chain input. The input schema corresponds to the
signature of the entrypoint's `run_remote` method. I.e. The top-level keys
are the argument names. The values are the corresponding JSON representation of
the types.
```python Python theme={"system"}
import urllib3
import os
chain_id = ""
# Read secrets from environment variables
baseten_api_key = os.environ["BASETEN_API_KEY"]
resp = urllib3.request(
"POST",
f"https://chain-{chain_id}.api.baseten.co/development/async_run_remote",
headers={"Authorization": f"Api-Key {baseten_api_key}"},
json={}, # JSON-serializable chain input
)
print(resp.json())
```
```sh cURL theme={"system"}
curl -X POST https://chain-{chain_id}.api.baseten.co/development/async_run_remote \
-H 'Authorization: Api-Key EMPTY' \
-d '{}' # JSON-serializable chain input
```
```javascript Node.js theme={"system"}
const fetch = require('node-fetch');
const resp = await fetch(
'https://chain-{chain_id}.api.baseten.co/development/async_run_remote',
{
method: 'POST',
headers: { Authorization: 'Api-Key EMPTY' },
body: JSON.stringify({}), // JSON-serializable chain input
}
);
const data = await resp.json();
console.log(data);
```
```json 201 theme={"system"}
{
"request_id": ""
}
```
# Development
Source: https://docs.baseten.co/reference/inference-api/predict-endpoints/development-predict
POST https://model-{model_id}.api.baseten.co/development/predict
Call the development deployment of a model.
Use this endpoint to call the [development deployment](/deployment/deployments) of your model.
```sh theme={"system"}
https://model-{model_id}.api.baseten.co/development/predict
```
### Parameters
The ID of the model you want to call.
Your Baseten API key, formatted with prefix `Api-Key` (e.g. `{"Authorization": "Api-Key abcd1234.abcd1234"}`).
### Body
JSON-serializable model input.
```python Python theme={"system"}
import urllib3
import os
model_id = ""
# Read secrets from environment variables
baseten_api_key = os.environ["BASETEN_API_KEY"]
resp = urllib3.request(
"POST",
f"https://model-{model_id}.api.baseten.co/development/predict",
headers={"Authorization": f"Api-Key {baseten_api_key}"},
json={}, # JSON-serializable model input
)
print(resp.json())
```
```sh cURL theme={"system"}
curl -X POST https://model-{model_id}.api.baseten.co/development/predict \
-H 'Authorization: Api-Key EMPTY' \
-d '{}' # JSON-serializable model input
```
```sh Truss theme={"system"}
truss predict --model-version DEPLOYMENT_ID -d '{}' # JSON-serializable model input
```
```javascript Node.js theme={"system"}
const fetch = require("node-fetch");
const resp = await fetch(
"https://model-{model_id}.api.baseten.co/development/predict",
{
method: "POST",
headers: { Authorization: "Api-Key EMPTY" },
body: JSON.stringify({}), // JSON-serializable model input
}
);
const data = await resp.json();
console.log(data);
```
```json Example Response // JSON-serializable output varies by model theme={"system"}
{}
```
# Chains development
Source: https://docs.baseten.co/reference/inference-api/predict-endpoints/development-run-remote
POST https://chain-{chain_id}.api.baseten.co/development/run_remote
Call the development deployment of a chain.
Use this endpoint to call the [development deployment](/development/chain/deploy#development) of
your chain.
```sh theme={"system"}
https://chain-{chain_id}.api.baseten.co/development/run_remote
```
### Parameters
The ID of the chain you want to call.
Your Baseten API key, formatted with prefix `Api-Key` (e.g. `{"Authorization": "Api-Key abcd1234.abcd1234"}`).
### Body
JSON-serializable chain input. The input schema corresponds to the
signature of the entrypoint's `run_remote` method. I.e. The top-level keys
are the argument names. The values are the corresponding JSON representation of
the types.
```python Python theme={"system"}
import urllib3
import os
chain_id = ""
# Read secrets from environment variables
baseten_api_key = os.environ["BASETEN_API_KEY"]
resp = urllib3.request(
"POST",
f"https://chain-{chain_id}.api.baseten.co/development/run_remote",
headers={"Authorization": f"Api-Key {baseten_api_key}"},
json={}, # JSON-serializable chain input
)
print(resp.json())
```
```sh cURL theme={"system"}
curl -X POST https://chain-{chain_id}.api.baseten.co/development/run_remote \
-H 'Authorization: Api-Key EMPTY' \
-d '{}' # JSON-serializable chain input
```
```javascript Node.js theme={"system"}
const fetch = require('node-fetch');
const resp = await fetch(
'https://chain-{chain_id}.api.baseten.co/development/run_remote',
{
method: 'POST',
headers: { Authorization: 'Api-Key EMPTY' },
body: JSON.stringify({}), // JSON-serializable chain input
}
);
const data = await resp.json();
console.log(data);
```
```json Example Response theme={"system"}
// JSON-serializable output varies by chain
{}
```
# Websocket development
Source: https://docs.baseten.co/reference/inference-api/predict-endpoints/development-websocket
Connect via WebSocket to the development deployment of a model or chain.
Use this endpoint to connect via WebSockets to the development deployment of a model or chain.
```sh theme={"system"}
wss://{entity}-{entity_id}.api.baseten.co/development/websocket
```
See [WebSockets](/development/model/websockets) for more details.
### Parameters
The type of entity you want to connect to. Either `model` or `chain`.
The ID of the model or chain you want to connect to.
Your Baseten API key, formatted with prefix `Api-Key` (e.g. `{"Authorization": "Api-Key abcd1234.abcd1234"}`).
```sh websocat theme={"system"}
websocat -H 'Authorization: Api-Key EMPTY' \
wss://{entity}-{entity_id}.api.baseten.co/development/websocket
```
# Async environment
Source: https://docs.baseten.co/reference/inference-api/predict-endpoints/environments-async-predict
POST https://model-{model_id}.api.baseten.co/environments/{env_name}/async_predict
Use this endpoint to call the model associated with the specified environment asynchronously.
### Parameters
The ID of the model you want to call.
The name of the model's environment you want to call.
### Headers
Your Baseten API key, formatted with prefix `Api-Key` (e.g. `{"Authorization": "Api-Key abcd1234.abcd1234"}`).
### Body
There is a 256 KiB size limit to `/async_predict` request payloads.
JSON-serializable model input.
Baseten **does not** store model outputs. If `webhook_endpoint` is empty, your model must save prediction outputs so they can be accessed later.
URL of the webhook endpoint. We require that webhook endpoints use HTTPS. Both HTTP/2 and HTTP/1.1 protocols are supported.
Priority of the request. A lower value corresponds to a higher priority (e.g. requests with priority 0 are scheduled before requests of priority 1).
`priority` is between 0 and 2, inclusive.
Maximum time a request will spend in the queue before expiring.
`max_time_in_queue_seconds` must be between 10 seconds and 72 hours, inclusive.
Exponential backoff parameters used to retry the model predict request.
Number of predict request attempts.
`max_attempts` must be between 1 and 10, inclusive.
Minimum time between retries in milliseconds.
`initial_delay_ms` must be between 0 and 10,000 milliseconds, inclusive.
Maximum time between retries in milliseconds.
`max_delay_ms` must be between 0 and 60,000 milliseconds, inclusive.
### Response
The ID of the async request.
```json 201 theme={"system"}
{
"request_id": ""
}
```
### Rate limits
Two types of rate limits apply when making async requests:
* Calls to the `/async_predict` endpoint are limited to **200 requests per second**.
* Each organization is limited to **50,000 `QUEUED` or `IN_PROGRESS` async requests**, summed across all deployments.
If either limit is exceeded, subsequent `/async_predict` requests will receive a 429 status code.
To avoid hitting these rate limits, we advise:
* Implementing a backpressure mechanism, such as calling `/async_predict` with exponential backoff in response to 429 errors.
* Monitoring the [async queue size metric](/observability/metrics#async-queue-metrics). If your model is accumulating a backlog of requests, consider increasing the number of requests your model can process at once by increasing the number of max replicas or the concurrency target in your autoscaling settings.
# Async chains environment
Source: https://docs.baseten.co/reference/inference-api/predict-endpoints/environments-async-run-remote
POST https://chain-{chain_id}.api.baseten.co/environments/{env_name}/async_run_remote
Call the chain deployment associated with a specified environment asynchronously.
```sh theme={"system"}
https://chain-{chain_id}.api.baseten.co/environments/{env_name}/async_run_remote
```
### Parameters
The ID of the chain you want to call.
The name of the chain's environment you want to call.
Your Baseten API key, formatted with prefix `Api-Key` (e.g. `{"Authorization": "Api-Key abcd1234.abcd1234"}`).
### Body
JSON-serializable chain input. The input schema corresponds to the
signature of the entrypoint's `run_remote` method. I.e. The top-level keys
are the argument names. The values are the corresponding JSON representation of
the types.
```python Python theme={"system"}
import urllib3
import os
chain_id = ""
env_name = "staging"
# Read secrets from environment variables
baseten_api_key = os.environ["BASETEN_API_KEY"]
resp = urllib3.request(
"POST",
f"https://chain-{chain_id}.api.baseten.co/environments/{env_name}/async_run_remote",
headers={"Authorization": f"Api-Key {baseten_api_key}"},
json={}, # JSON-serializable chain input
)
print(resp.json())
```
```sh cURL theme={"system"}
curl -X POST https://chain-{chain_id}.api.baseten.co/environments/{env_name}/async_run_remote \
-H 'Authorization: Api-Key EMPTY' \
-d '{}' # JSON-serializable chain input
```
```javascript Node.js theme={"system"}
const fetch = require('node-fetch');
const resp = await fetch(
'https://chain-{chain_id}.api.baseten.co/environments/{env_name}/async_run_remote',
{
method: 'POST',
headers: { Authorization: 'Api-Key EMPTY' },
body: JSON.stringify({}), // JSON-serializable chain input
}
);
const data = await resp.json();
console.log(data);
```
```json 201 theme={"system"}
{
"request_id": ""
}
```
# Environment
Source: https://docs.baseten.co/reference/inference-api/predict-endpoints/environments-predict
POST https://model-{model_id}.api.baseten.co/environments/{env_name}/predict
Call the model deployment associated with a specified environment.
Use this endpoint to call the deployment associated with the specified [environment](/deployment/environments).
```sh theme={"system"}
https://model-{model_id}.api.baseten.co/environments/{env_name}/predict
```
### Parameters
The ID of the model you want to call.
The name of the model's environment you want to call.
Your Baseten API key, formatted with prefix `Api-Key` (e.g. `{"Authorization": "Api-Key abcd1234.abcd1234"}`).
### Body
JSON-serializable model input.
```python Python theme={"system"}
import urllib3
import os
model_id = ""
env_name = "staging"
# Read secrets from environment variables
baseten_api_key = os.environ["BASETEN_API_KEY"]
resp = urllib3.request(
"POST",
f"https://model-{model_id}.api.baseten.co/environments/{env_name}/predict",
headers={"Authorization": f"Api-Key {baseten_api_key}"},
json={}, # JSON-serializable model input
)
print(resp.json())
```
```sh cURL theme={"system"}
curl -X POST https://model-{model_id}.api.baseten.co/environments/{env_name}/predict \
-H 'Authorization: Api-Key EMPTY' \
-d '{}' # JSON-serializable model input
```
```javascript Node.js theme={"system"}
const fetch = require("node-fetch");
const resp = await fetch(
"https://model-{model_id}.api.baseten.co/environments/{env_name}/predict",
{
method: "POST",
headers: { Authorization: "Api-Key EMPTY" },
body: JSON.stringify({}), // JSON-serializable model input
}
);
const data = await resp.json();
console.log(data);
```
```json Example Response // JSON-serializable output varies by model theme={"system"}
{}
```
# Chains environment
Source: https://docs.baseten.co/reference/inference-api/predict-endpoints/environments-run-remote
POST https://chain-{chain_id}.api.baseten.co/environments/{env_name}/run_remote
Call the chain deployment associated with a specified environment.
```sh theme={"system"}
https://chain-{chain_id}.api.baseten.co/environments/{env_name}/run_remote
```
### Parameters
The ID of the chain you want to call.
The name of the chain's environment you want to call.
Your Baseten API key, formatted with prefix `Api-Key` (e.g. `{"Authorization": "Api-Key abcd1234.abcd1234"}`).
### Body
JSON-serializable chain input. The input schema corresponds to the
signature of the entrypoint's `run_remote` method. I.e. The top-level keys
are the argument names. The values are the corresponding JSON representation of
the types.
```python Python theme={"system"}
import urllib3
import os
chain_id = ""
env_name = "staging"
# Read secrets from environment variables
baseten_api_key = os.environ["BASETEN_API_KEY"]
resp = urllib3.request(
"POST",
f"https://chain-{chain_id}.api.baseten.co/environments/{env_name}/run_remote",
headers={"Authorization": f"Api-Key {baseten_api_key}"},
json={}, # JSON-serializable chain input
)
print(resp.json())
```
```sh cURL theme={"system"}
curl -X POST https://chain-{chain_id}.api.baseten.co/environments/{env_name}/run_remote \
-H 'Authorization: Api-Key EMPTY' \
-d '{}' # JSON-serializable chain input
```
```javascript Node.js theme={"system"}
const fetch = require('node-fetch');
const resp = await fetch(
'https://chain-{chain_id}.api.baseten.co/environments/{env_name}/run_remote',
{
method: 'POST',
headers: { Authorization: 'Api-Key EMPTY' },
body: JSON.stringify({}), // JSON-serializable chain input
}
);
const data = await resp.json();
console.log(data);
```
```json Example Response theme={"system"}
// JSON-serializable output varies by chain
{}
```
# Websocket environment
Source: https://docs.baseten.co/reference/inference-api/predict-endpoints/environments-websocket
Connect via WebSocket to the deployment associated with an environment.
Use this endpoint to connect via WebSockets to the deployment associated with the specified [environment](/deployment/environments).
Note that `entity` here could be either `model` or `chain`, depending on whether you using Baseten models or Chains.
```sh theme={"system"}
wss://{entity}-{entity_id}.api.baseten.co/environments/{env_name}/websocket
```
See [WebSockets](/development/model/websockets) for more details.
### Parameters
The type of entity you want to connect to. Either `model` or `chain`.
The ID of the model or chain you want to connect to.
The name of the environment you want to connect to.
Your Baseten API key, formatted with prefix `Api-Key` (e.g. `{"Authorization": "Api-Key abcd1234.abcd1234"}`).
```sh websocat theme={"system"}
websocat -H 'Authorization: Api-Key EMPTY' \
wss://{entity}-{model_id}.api.baseten.co/environments/{env_name}/websocket
```
# Transcribe Streaming Audio
Source: https://docs.baseten.co/reference/inference-api/predict-endpoints/streaming-transcription-api
Transcribe audio in real time over a WebSocket connection.
The streaming audio transcription endpoint is **ONLY** compatible with **websockets** not with the REST API.
To begin using the transcription endpoint, establish a connection via WebSocket. Once connected, you must first send a metadata JSON object (as a string) over the WebSocket. This metadata informs the model about the format and type of audio data it should expect.
After the metadata is sent, you can begin streaming raw audio bytes directly over the same WebSocket connection.
```sh theme={"system"}
wss://model-{model_id}.api.baseten.co/environments/production/websocket
```
### Parameters
The ID of the model you want to call.
Your Baseten API key, formatted with prefix `Api-Key` (e.g. `{"Authorization": "Api-Key abcd1234.abcd1234"}`).
### Websocket Metadata
These parameters configure the Voice Activity Detector (VAD) and allow you to tune behavior such as speech endpointing.
* **threshold** (`float`, default=`0.5`): The probability threshold for detecting speech, between 0.0 and 1.0. Frames with a probability above this value are considered speech. A higher threshold makes the VAD more selective, reducing false positives from background noise.
* **min\_silence\_duration\_ms** (`int`, default=`300`): The minimum duration of silence (in milliseconds) required to determine that speech has ended.
* **speech\_pad\_ms** (`int`, default=`0`): Padding (in milliseconds) added to both the start and end of detected speech segments to avoid cutting off words prematurely.
Parameters for controlling streaming ASR behavior.
* **encoding** (`string`, default=`"pcm_s16le"`): Audio encoding format.
* **sample\_rate** (`int`, default=`16000`): Audio sample rate in Hz. Whisper models are optimized for a sample rate of 16,000 Hz.
* **enable\_partial\_transcripts** (`boolean`, optional): If set to true, intermediate (partial) transcripts will be sent over the WebSocket as audio is received. For most voice AI use cases, we recommend setting this to `false`.
* **partial\_transcript\_interval\_s** (`float`, default=`0.5`): Interval in seconds that the model waits before sending a partial transcript, if partials are enabled.
* **final\_transcript\_max\_duration\_s** (`int`, default=`30`): The maximum duration of buffered audio (in seconds) before a final transcript is forcibly returned. This value should not exceed `30`.
Parameters for controlling Whisper's behavior.
* **prompt** (`string`, optional): Optional transcription prompt.
* **audio\_language** (`string`, default=`"en"`): Language of the input audio. Set to `"auto"` for automatic detection.
* **language\_detection\_only** (`boolean`, default=`false`): If `true`, only return the automatic language detection result without transcribing.
* **language\_options** (`list[string]`, default=`[]`): List of language codes to consider for language detection, for example `["en", "zh"]`. This could improve language detection accuracy by scoping the language detection to a specific set of languages that only makes sense for your use case. By default, we consider [all languages](https://platform.openai.com/docs/guides/speech-to-text#supported-languages) supported by Whisper model. \[Added since v0.5.0]
* **use\_dynamic\_preprocessing** (`boolean`, default=`false`): Enables dynamic range compression to process audio with variable loudness.
* **show\_word\_timestamps** (`boolean`, default=`false`): If `true`, include word-level timestamps in the output. \[Added since v0.4.0]
* **show\_beam\_results** (`boolean`, default=`false`): If `true`, include transcriptions from all beams of beam search in the response. \[Added since v0.7.5]
Advanced parameters for controlling Whisper's sampling behavior.
* **beam\_width** (`integer`, optional): Beam search width for decoding. Controls the number of candidate sequences to maintain during beam search. \[Added since v0.6.0]
* **length\_penalty** (`float`, optional): Length penalty applied to the output. Higher values encourage longer outputs. \[Added since v0.6.0]
* **repetition\_penalty** (`float`, optional): Penalty for repeating tokens. Higher values discourage repetition. \[Added since v0.6.0]
* **beam\_search\_diversity\_rate** (`float`, optional): Controls diversity in beam search. Higher values increase diversity among beam candidates. \[Added since v0.6.0]
* **no\_repeat\_ngram\_size** (`integer`, optional): Prevents repetition of n-grams of the specified size. \[Added since v0.6.0]Deprecated since v0.6.0. Use `whisper_params.whisper_sampling_params` instead. Specifically, replace `beam_size` with `whisper_params.whisper_sampling_params.beam_width` and `length_penalty` with `whisper_params.whisper_sampling_params.length_penalty`.
```python Python theme={"system"}
import asyncio
import websockets
import sounddevice as sd
import numpy as np
import json
import os
model_id = "" # Baseten model id here
baseten_api_key = os.environ["BASETEN_API_KEY"]
# Audio config
SAMPLE_RATE = 16000
CHUNK_SIZE = 512
CHANNELS = 1
headers = {"Authorization": f"Api-Key {baseten_api_key}"}
# Metadata to send first
metadata = {
"streaming_vad_config": {
"threshold": 0.5,
"min_silence_duration_ms": 300,
"speech_pad_ms": 30
},
"streaming_params": {
"encoding": "pcm_s16le",
"sample_rate": SAMPLE_RATE,
"enable_partial_transcripts": True
},
"whisper_params": {"audio_language": "en"},
}
async def stream_microphone_audio(ws_url):
loop = asyncio.get_running_loop()
async with websockets.connect(ws_url, additional_headers=headers) as ws:
print("Connected to server")
# Send the metadata JSON blob
await ws.send(json.dumps(metadata))
print("Sent metadata to server")
send_queue = asyncio.Queue()
# Start audio stream
def audio_callback(indata, frames, time_info, status):
if status:
print(f"Audio warning: {status}")
int16_data = (indata * 32767).astype(np.int16).tobytes()
loop.call_soon_threadsafe(send_queue.put_nowait, int16_data)
with sd.InputStream(
samplerate=SAMPLE_RATE,
blocksize=CHUNK_SIZE,
channels=CHANNELS,
dtype="float32",
callback=audio_callback,
):
print("Streaming mic audio...")
async def send_audio():
while True:
chunk = await send_queue.get()
await ws.send(chunk)
async def receive_messages():
while True:
response = await ws.recv()
message = json.loads(response)
msg_type = message.get("type")
if msg_type == "transcription":
is_final = message.get("is_final")
text = " ".join(s.get("text", "") for s in message.get("segments", []))
print(f"[{'final' if is_final else 'partial'}] {text}")
else:
print(f"[{msg_type}] {message.get('body')}")
# Run send + receive tasks concurrently
await asyncio.gather(send_audio(), receive_messages())
ws_url = f"wss://model-{model_id}.api.baseten.co/environments/production/websocket"
asyncio.run(stream_microphone_audio(ws_url))
```
```json Example Response theme={"system"}
{
"type": "transcription",
"is_final": true,
"transcription_num": 4,
"language_code": "en",
"language_prob": null,
"audio_length_sec": 9.92,
"segments": [
{
"text": "That's one small step for man, one giant leap for mankind.",
"log_prob": -0.8644908666610718,
"start_time": 0,
"end_time": 9.92
}
]
}
```
***
## FAQ
### How do I handle end of audio to avoid losing the last utterance?
By default, the VAD-based endpointing only triggers a transcript when it detects a period of silence after speech. If you close the connection abruptly without signaling end-of-audio, **any speech still buffered that hasn't hit a silence boundary will be lost**.
To flush the buffer and get a final transcript for all remaining audio, send an `end_audio` control message before closing the connection:
```json theme={"system"}
{"type": "end_audio"}
```
The server will:
1. Immediately acknowledge: `{"type": "end_audio", "body": {"status": "acknowledged"}}`
2. Finish transcribing all remaining buffered audio, sending any final transcription results
3. Signal completion: `{"type": "end_audio", "body": {"status": "finished"}}`
After receiving `finished`, it is safe to close the connection.
```python Python theme={"system"}
import asyncio
import signal
import websockets
import sounddevice as sd
import numpy as np
import json
import os
SAMPLE_RATE = 16000
CHUNK_SIZE = 512 # ~32ms per chunk
headers = {"Authorization": f"Api-Key {os.environ['BASETEN_API_KEY']}"}
ws_url = "wss://model-{model_id}.api.baseten.co/environments/production/websocket"
metadata = {
"streaming_params": {"encoding": "pcm_s16le", "sample_rate": SAMPLE_RATE},
"whisper_params": {"audio_language": "en"},
}
async def stream_mic():
loop = asyncio.get_running_loop()
send_queue = asyncio.Queue()
stop_event = asyncio.Event()
def audio_callback(indata, frames, time_info, status):
loop.call_soon_threadsafe(
send_queue.put_nowait, (indata * 32767).astype(np.int16).tobytes()
)
# Ctrl+C sets stop_event instead of raising KeyboardInterrupt,
# so the end_audio handshake can complete cleanly before closing.
loop.add_signal_handler(signal.SIGINT, lambda: loop.call_soon_threadsafe(stop_event.set))
async with websockets.connect(ws_url, additional_headers=headers) as ws:
await ws.send(json.dumps(metadata))
print("Recording — press Ctrl+C to stop.\n")
async def send_audio():
with sd.InputStream(samplerate=SAMPLE_RATE, blocksize=CHUNK_SIZE,
channels=1, dtype="float32", callback=audio_callback):
while not stop_event.is_set():
try:
chunk = await asyncio.wait_for(send_queue.get(), timeout=0.1)
await ws.send(chunk)
except asyncio.TimeoutError:
continue
# Drain any chunks buffered after stop
while not send_queue.empty():
await ws.send(send_queue.get_nowait())
# Flush remaining speech buffered on the server
await ws.send(json.dumps({"type": "end_audio"}))
async def receive_messages():
# Receive concurrently with send_audio — VAD may trigger transcription
# results while audio is still being sent; sequential receive would miss them.
async for raw in ws:
msg = json.loads(raw)
if msg.get("type") == "transcription":
text = " ".join(s["text"] for s in msg.get("segments", []))
print(f"[{'final' if msg['is_final'] else 'partial'}] {text}")
elif msg.get("type") == "end_audio":
if msg.get("body", {}).get("status") == "finished":
break # All audio processed — safe to close
await asyncio.gather(send_audio(), receive_messages())
asyncio.run(stream_mic())
```
Do not rely on simply closing the WebSocket to flush audio. Always send `{"type": "end_audio"}` and wait for `{"status": "finished"}` before closing to ensure you receive all transcription results.
***
### How do I process multiple audio sessions without reconnecting every time?
Each WebSocket connection is a **single streaming session**. The metadata (language, VAD config, encoding, etc.) is fixed at connection time and can't be changed mid-session. Once the server sends `{"status": "finished"}` in response to `end_audio`, the session is complete and the connection will close.
To process multiple files or conversation turns, **open a new connection for each session**. To minimize reconnection latency in high-throughput scenarios, establish the next connection before the previous one has fully closed (overlapping connections):
```python Python theme={"system"}
import asyncio
import time
import websockets
import json
import os
SAMPLE_RATE = 16000
CHUNK_SAMPLES = 512 # ~32ms per chunk — matches live mic cadence
CHUNK_SIZE = CHUNK_SAMPLES * 2 # bytes (pcm_s16le = 2 bytes/sample)
CHUNK_DURATION_S = CHUNK_SAMPLES / SAMPLE_RATE
headers = {"Authorization": f"Api-Key {os.environ['BASETEN_API_KEY']}"}
ws_url = "wss://model-{model_id}.api.baseten.co/environments/production/websocket"
async def transcribe_session(audio_bytes: bytes, language: str = "en") -> str:
"""Open a new connection, transcribe one audio buffer, close cleanly."""
metadata = {
"streaming_params": {"encoding": "pcm_s16le", "sample_rate": SAMPLE_RATE},
"whisper_params": {"audio_language": language},
}
transcripts = []
async with websockets.connect(ws_url, additional_headers=headers) as ws:
await ws.send(json.dumps(metadata))
async def send_audio():
for i in range(0, len(audio_bytes), CHUNK_SIZE):
chunk = audio_bytes[i : i + CHUNK_SIZE]
# VAD requires ≥ 512 samples per chunk. Zero-pad the last
# chunk if the file doesn't divide evenly.
if len(chunk) < CHUNK_SIZE:
chunk = chunk + b"\x00" * (CHUNK_SIZE - len(chunk))
await ws.send(chunk)
# Pace at real-time speed so VAD sees audio at the same
# cadence as a live mic — sending faster causes idle timeouts.
await asyncio.sleep(CHUNK_DURATION_S)
await ws.send(json.dumps({"type": "end_audio"}))
async def receive_messages():
# Receive concurrently — VAD may emit transcripts while audio
# is still being sent; sequential receive would miss them.
async for raw in ws:
message = json.loads(raw)
if message.get("type") == "transcription":
transcripts.append(
" ".join(s["text"] for s in message.get("segments", []))
)
elif message.get("type") == "end_audio":
if message.get("body", {}).get("status") == "finished":
break
await asyncio.gather(send_audio(), receive_messages())
return " ".join(transcripts)
async def process_sequential(audio_files: list[bytes]):
"""One connection per file, each opened after the previous one closes."""
for audio in audio_files:
transcript = await transcribe_session(audio)
print(f"Transcript: {transcript}")
async def process_overlapping(audio_files: list[bytes]):
"""All connections opened in parallel — wall-clock time ≈ longest file."""
results = await asyncio.gather(*[transcribe_session(a) for a in audio_files])
for transcript in results:
print(f"Transcript: {transcript}")
```
Each WebSocket connection maps to a dedicated worker on the server. Keeping connections alive unnecessarily will consume server resources. Use health check messages (`{"type": "health_check"}`) to verify a long-lived connection is still active before sending audio.
# Transcribe Pre-Recorded Audio
Source: https://docs.baseten.co/reference/inference-api/predict-endpoints/transcription-api
POST https://model-{model_id}.api.baseten.co/production/predict
Transcribe a pre-recorded audio file using a deployed transcription model.
Use this endpoint to call the [production environment](/deployment/environments) of your model.
```sh theme={"system"}
https://model-{model_id}.api.baseten.co/environments/production/predict
```
**If you are deploying this model as a chain**, you can call it in the following way
```sh theme={"system"}
https://chain-{chain_id}.api.baseten.co/environments/production/run_remote
```
### Parameters
The ID of the model you want to call.
Your Baseten API key, formatted with prefix `Api-Key` (e.g. `{"Authorization": "Api-Key abcd1234.abcd1234"}`).
### Body
The audio input options. You must provide one of `url`, `audio_b64`, or `audio_bytes`.
* **url** (`string`): URL of the audio file.
* **audio\_b64** (`string`): Base64-encoded audio content.
* **audio\_bytes** (`bytes`): Raw audio bytes.
Parameters for controlling Whisper's behavior.
* **prompt** (`string`, optional): Optional transcription prompt.
* **audio\_language** (`string`, default=`"en"`): Language of the input audio. Set to `"auto"` for automatic detection.
* **language\_detection\_only** (`boolean`, default=`false`): If `true`, only return the automatic language detection result without transcribing.
* **language\_options** (`list[string]`, default=`[]`): List of language codes to consider for language detection, for example `["en", "zh"]`. This could improve language detection accuracy by scoping the language detection to a specific set of languages that only makes sense for your use case. By default, we consider [all languages](https://platform.openai.com/docs/guides/speech-to-text#supported-languages) supported by Whisper model. \[Added since v0.5.0]
* **use\_dynamic\_preprocessing** (`boolean`, default=`false`): Enables dynamic range compression to process audio with variable loudness.
* **show\_word\_timestamps** (`boolean`, default=`false`): If `true`, include word-level timestamps in the output. \[Added since v0.4.0]
* **enable\_vad** (`boolean`, default=`true`): If `true`, enable audio chunking by voice activity detection (VAD) model. If `false`, the model can only process up to 30 seconds of audio at a time. \[Added since v0.6.0]
* **show\_beam\_results** (`boolean`, default=`false`): If `true`, include transcriptions from all beams of beam search in the response. \[Added since v0.7.5]
* **enable\_chunk\_level\_language\_detection** (`boolean`, default=`false`): If `true`, language detection is performed at the chunk/segment level instead of file level. \[Added since v0.7.6]
Advanced parameters for controlling Whisper's sampling behavior.
* **beam\_width** (`integer`, optional): Beam search width for decoding. Controls the number of candidate sequences to maintain during beam search. \[Added since v0.6.0]
* **length\_penalty** (`float`, optional): Length penalty applied to the output. Higher values encourage longer outputs. \[Added since v0.6.0]
* **repetition\_penalty** (`float`, optional): Penalty for repeating tokens. Higher values discourage repetition. \[Added since v0.6.0]
* **beam\_search\_diversity\_rate** (`float`, optional): Controls diversity in beam search. Higher values increase diversity among beam candidates. \[Added since v0.6.0]
* **no\_repeat\_ngram\_size** (`integer`, optional): Prevents repetition of n-grams of the specified size. \[Added since v0.6.0]
Advanced settings for automatic speech recognition (ASR) process.
* **beam\_size** (`integer`, default=`1`): Beam search size for decoding. We support beam size up to 5. \[Deprecated since v0.6.0. Use `whisper_input.whisper_params.whisper_sampling_params.beam_width` instead.]
* **length\_penalty** (`float`, default=`2.0`): Length penalty applied to ASR output. Length penalty can only work when `beam_size` is greater than 1. \[Deprecated since v0.6.0. Use `whisper_input.whisper_params.whisper_sampling_params.length_penalty` instead.]
Parameters for controlling voice activity detection (VAD) process.
* **max\_speech\_duration\_s** (`integer`, default=`29`): Maximum duration of speech in seconds to be considered a speech segment. `max_speech_duration_s` cannot be over 30 because Whisper model can only take up to 30 seconds audio input. \[Added since v0.4.0]
* **min\_silence\_duration\_ms** (`integer`, default=`3000`): In the end of each speech chunk wait for min\_silence\_duration\_ms before separating it. \[Added since v0.4.0]
* **threshold** (`float`, default=`0.5`): Speech threshold. VAD outputs speech probabilities for each audio chunk, probabilities above this value are considered as speech. It is better to tune this parameter for each dataset separately, but "lazy" 0.5 is pretty good for most datasets. \[Added since v0.4.0]
```python Python theme={"system"}
import requests
import os
model_id = ""
# Read secrets from environment variables
baseten_api_key = os.environ["BASETEN_API_KEY"]
# Define the request payload
payload = {
"whisper_input": {
"audio": {
"url": "https://example.com/audio.wav", # Replace with actual URL # "audio_b64": "BASE64_ENCODED_AUDIO", # Uncomment if using Base64
},
"whisper_params": {
"prompt": "Optional transcription prompt",
"audio_language": "en",
}
}
}
resp = requests.post(
f"https://model-{model_id}.api.baseten.co/environments/production/predict",
headers={"Authorization": f"Api-Key {baseten_api_key}"},
json=payload
)
print(resp.json())
```
```sh cURL theme={"system"}
curl -X POST https://model-{model_id}.api.baseten.co/environments/production/predict \
-H 'Authorization: Api-Key EMPTY' \
-H "Content-Type: application/json" \
-d '{
"whisper_input": {
"audio": {
"url": "https://example.com/audio.mp3"
},
"whisper_params": {
"prompt": "Optional transcription prompt",
"audio_language": "en",
}
}
}'
```
```javascript Node.js theme={"system"}
const fetch = require("node-fetch");
const payload = {
whisper_input: {
audio: {
url: "https://example.com/audio.mp3",
},
whisper_params: {
prompt: "Optional transcription prompt",
audio_language: "en",
},
},
};
const resp = await fetch(
"https://model-{model_id}.api.baseten.co/environments/production/predict",
{
method: "POST",
headers: {
Authorization: "Api-Key EMPTY",
"Content-Type": "application/json",
},
body: JSON.stringify(payload),
}
);
const data = await resp.json();
console.log(data);
```
```json Example Response theme={"system"}
{
"language_code": "en",
"language_prob": null,
"segments": [
{
"text": "That's one small step for man, one giant leap for mankind.",
"log_prob": -0.8644908666610718,
"start_time": 0,
"end_time": 9.92
}
]
}
```
# Async deployment
Source: https://docs.baseten.co/reference/inference-api/status-endpoints/deployment-get-async-queue-status
GET https://model-{model_id}.api.baseten.co/deployment/{deployment_id}/async_queue_status
Use this endpoint to get the status of a published deployment's async queue.
### Parameters
The ID of the model.
The ID of the chain.
The ID of the deployment.
### Headers
Your Baseten API key, formatted with prefix `Api-Key` (e.g. `{"Authorization": "Api-Key abcd1234.abcd1234"}`).
### Response
The ID of the model.
The ID of the deployment.
The number of requests in the deployment's async queue with `QUEUED` status (i.e. awaiting processing by the model).
The number of requests in the deployment's async queue with `IN_PROGRESS` status (i.e. currently being processed by the model).
```json 200 theme={"system"}
{
"model_id": "",
"deployment_id": "",
"num_queued_requests": 12,
"num_in_progress_requests": 3
}
```
### Rate limits
Calls to the `/async_queue_status` endpoint are limited to **20 requests per second**. If this limit is exceeded, subsequent requests will receive a 429 status code.
To gracefully handle hitting this rate limit, we advise implementing a backpressure mechanism, such as calling `/async_queue_status` with exponential backoff in response to 429 errors.
```py Model theme={"system"}
import requests
import os
model_id = ""
deployment_id = ""
# Read secrets from environment variables
baseten_api_key = os.environ["BASETEN_API_KEY"]
resp = requests.get(
f"https://model-{model_id}.api.baseten.co/deployment/{deployment_id}/async_queue_status",
headers={"Authorization": f"Api-Key {baseten_api_key}"}
)
print(resp.json())
```
```py Chain theme={"system"}
import requests
import os
chain_id = ""
deployment_id = ""
# Read secrets from environment variables
baseten_api_key = os.environ["BASETEN_API_KEY"]
resp = requests.get(
f"https://chain-{chain_id}.api.baseten.co/deployment/{deployment_id}/async_queue_status",
headers={"Authorization": f"Api-Key {baseten_api_key}"}
)
print(resp.json())
```
# Async development
Source: https://docs.baseten.co/reference/inference-api/status-endpoints/development-get-async-queue-status
GET https://model-{model_id}.api.baseten.co/development/async_queue_status
Use this endpoint to get the status of a development deployment's async queue.
### Parameters
The ID of the model.
The ID of the chain.
### Headers
Your Baseten API key, formatted with prefix `Api-Key` (e.g. `{"Authorization": "Api-Key abcd1234.abcd1234"}`).
### Response
The ID of the model.
The ID of the deployment.
The number of requests in the deployment's async queue with `QUEUED` status (i.e. awaiting processing by the model).
The number of requests in the deployment's async queue with `IN_PROGRESS` status (i.e. currently being processed by the model).
```json 200 theme={"system"}
{
"model_id": "",
"deployment_id": "",
"num_queued_requests": 12,
"num_in_progress_requests": 3
}
```
### Rate limits
Calls to the `/async_queue_status` endpoint are limited to **20 requests per second**. If this limit is exceeded, subsequent requests will receive a 429 status code.
To gracefully handle hitting this rate limit, we advise implementing a backpressure mechanism, such as calling `/async_queue_status` with exponential backoff in response to 429 errors.
```py Model theme={"system"}
import requests
import os
model_id = ""
# Read secrets from environment variables
baseten_api_key = os.environ["BASETEN_API_KEY"]
resp = requests.get(
f"https://model-{model_id}.api.baseten.co/development/async_queue_status",
headers={"Authorization": f"Api-Key {baseten_api_key}"}
)
print(resp.json())
```
```py Chain theme={"system"}
import requests
import os
chain_id = ""
# Read secrets from environment variables
baseten_api_key = os.environ["BASETEN_API_KEY"]
resp = requests.get(
f"https://chain-{chain_id}.api.baseten.co/development/async_queue_status",
headers={"Authorization": f"Api-Key {baseten_api_key}"}
)
print(resp.json())
```
# Async environment
Source: https://docs.baseten.co/reference/inference-api/status-endpoints/environments-get-async-queue-status
GET https://model-{model_id}.api.baseten.co/environments/{env_name}/async_queue_status
Use this endpoint to get the async queue status for a model associated with the specified environment.
### Parameters
The ID of the model.
The ID of the chain.
The name of the environment.
### Headers
Your Baseten API key, formatted with prefix `Api-Key` (e.g. `{"Authorization": "Api-Key abcd1234.abcd1234"}`).
### Response
The ID of the model.
The ID of the deployment.
The number of requests in the deployment's async queue with `QUEUED` status (i.e. awaiting processing by the model).
The number of requests in the deployment's async queue with `IN_PROGRESS` status (i.e. currently being processed by the model).
```json 200 theme={"system"}
{
"model_id": "",
"deployment_id": "",
"num_queued_requests": 12,
"num_in_progress_requests": 3
}
```
### Rate limits
Calls to the `/async_queue_status` endpoint are limited to **20 requests per second**. If this limit is exceeded, subsequent requests will receive a 429 status code.
To gracefully handle hitting this rate limit, we advise implementing a backpressure mechanism, such as calling `/async_queue_status` with exponential backoff in response to 429 errors.
```py Model theme={"system"}
import requests
import os
model_id = ""
env_name = ""
# Read secrets from environment variables
baseten_api_key = os.environ["BASETEN_API_KEY"]
resp = requests.get(
f"https://model-{model_id}.api.baseten.co/environments/{env_name}/async_queue_status",
headers={"Authorization": f"Api-Key {baseten_api_key}"}
)
print(resp.json())
```
```py Chain theme={"system"}
import requests
import os
chain_id = ""
env_name = ""
# Read secrets from environment variables
baseten_api_key = os.environ["BASETEN_API_KEY"]
resp = requests.get(
f"https://chain-{chain_id}.api.baseten.co/environments/{env_name}/async_queue_status",
headers={"Authorization": f"Api-Key {baseten_api_key}"}
)
print(resp.json())
```
# Async request
Source: https://docs.baseten.co/reference/inference-api/status-endpoints/get-async-request-status
GET https://model-{model_id}.api.baseten.co/async_request/{request_id}
Use this endpoint to get the status of an async request.
### Parameters
The ID of the model.
The ID of the chain.
The ID of the async request.
### Headers
Your Baseten API key, formatted with prefix `Api-Key` (e.g. `{"Authorization": "Api-Key abcd1234.abcd1234"}`).
### Response
The ID of the async request.
The ID of the model that executed the request.
The ID of the deployment that executed the request.
An enum representing the status of the request.
Available options: `QUEUED`, `IN_PROGRESS`, `SUCCEEDED`, `FAILED`, `EXPIRED`, `CANCELED`, `WEBHOOK_FAILED`
An enum representing the status of sending the predict result to the provided webhook.
Available options: `PENDING`, `SUCCEEDED`, `FAILED`, `CANCELED`, `NO_WEBHOOK_PROVIDED`
The time in UTC at which the async request was created.
The time in UTC at which the async request's status was updated.
Any errors that occurred in processing the async request. Empty if no errors occurred.
An enum representing the type of error that occurred.
Available options: `MODEL_PREDICT_ERROR`, `MODEL_PREDICT_TIMEOUT`, `MODEL_NOT_READY`, `MODEL_DOES_NOT_EXIST`, `MODEL_UNAVAILABLE`, `MODEL_INVALID_INPUT`, `ASYNC_REQUEST_NOT_SUPPORTED`, `INTERNAL_SERVER_ERROR`
A message containing details of the error that occurred.
The ID of the async request.
The ID of the chain that executed the request.
The ID of the deployment that executed the request.
An enum representing the status of the request.
Available options: `QUEUED`, `IN_PROGRESS`, `SUCCEEDED`, `FAILED`, `EXPIRED`, `CANCELED`, `WEBHOOK_FAILED`
An enum representing the status of sending the predict result to the provided webhook.
Available options: `PENDING`, `SUCCEEDED`, `FAILED`, `CANCELED`, `NO_WEBHOOK_PROVIDED`
The time in UTC at which the async request was created.
The time in UTC at which the async request's status was updated.
Any errors that occurred in processing the async request. Empty if no errors occurred.
An enum representing the type of error that occurred.
Available options: `MODEL_PREDICT_ERROR`, `MODEL_PREDICT_TIMEOUT`, `MODEL_NOT_READY`, `MODEL_DOES_NOT_EXIST`, `MODEL_UNAVAILABLE`, `MODEL_INVALID_INPUT`, `ASYNC_REQUEST_NOT_SUPPORTED`, `INTERNAL_SERVER_ERROR`
A message containing details of the error that occurred.
```json 200 (Model) theme={"system"}
{
"request_id": "",
"model_id": "",
"deployment_id": "",
"status": "",
"webhook_status": "",
"created_at": "",
"status_at": "",
"errors": [
{
"code": "",
"message": ""
}
]
}
```
```json 200 (Chain) theme={"system"}
{
"request_id": "",
"chain_id": "",
"deployment_id": "",
"status": "",
"webhook_status": "",
"created_at": "",
"status_at": "",
"errors": [
{
"code": "",
"message": ""
}
]
}
```
### Rate limits
Calls to the get async request status endpoint are limited to **20 requests per second**. If this limit is exceeded, subsequent requests will receive a 429 status code.
To avoid hitting this rate limit, we recommend [configuring a webhook endpoint](/inference/async#configuring-the-webhook-endpoint) to receive async predict results instead of frequently polling this endpoint for async request statuses.
```python Python (Model) theme={"system"}
import requests
import os
model_id = ""
request_id = ""
# Read secrets from environment variables
baseten_api_key = os.environ["BASETEN_API_KEY"]
resp = requests.get(
f"https://model-{model_id}.api.baseten.co/async_request/{request_id}",
headers={"Authorization": f"Api-Key {baseten_api_key}"}
)
print(resp.json())
```
```python Python (Chain) theme={"system"}
import requests
import os
chain_id = ""
request_id = ""
# Read secrets from environment variables
baseten_api_key = os.environ["BASETEN_API_KEY"]
resp = requests.get(
f"https://chain-{chain_id}.api.baseten.co/async_request/{request_id}",
headers={"Authorization": f"Api-Key {baseten_api_key}"}
)
print(resp.json())
```
# Get chain async request status
Source: https://docs.baseten.co/reference/inference-api/status-endpoints/get-chain-async-request-status
GET https://chain-{chain_id}.api.baseten.co/async_request/{request_id}
Use this endpoint to get the status of an async request to a chain.
### Parameters
The ID of the chain that executed the request.
The ID of the async request.
### Headers
Your Baseten API key, formatted with prefix `Api-Key` (e.g. `{"Authorization": "Api-Key abcd1234.abcd1234"}`).
### Response
The ID of the async request.
The ID of the chain that executed the request.
An enum representing the status of the request.
Available options: `QUEUED`, `IN_PROGRESS`, `SUCCEEDED`, `FAILED`, `EXPIRED`, `CANCELED`, `WEBHOOK_FAILED`
An enum representing the status of sending the predict result to the provided webhook.
Available options: `PENDING`, `SUCCEEDED`, `FAILED`, `CANCELED`, `NO_WEBHOOK_PROVIDED`
The time in UTC at which the async request was created.
The time in UTC at which the async request's status was updated.
Any errors that occurred in processing the async request. Empty if no errors occurred.
An enum representing the type of error that occurred.
Available options: `MODEL_PREDICT_ERROR`, `MODEL_PREDICT_TIMEOUT`, `MODEL_NOT_READY`, `MODEL_DOES_NOT_EXIST`, `MODEL_UNAVAILABLE`, `MODEL_INVALID_INPUT`, `ASYNC_REQUEST_NOT_SUPPORTED`, `INTERNAL_SERVER_ERROR`
A message containing details of the error that occurred.
### Rate limits
Calls to the get async request status endpoint are limited to **20 requests per second**. If this limit is exceeded, subsequent requests will receive a 429 status code.
To avoid hitting this rate limit, we recommend [configuring a webhook endpoint](/inference/async#quick-start) to receive async predict results instead of frequently polling this endpoint for async request statuses.
```python Python theme={"system"}
import requests
import os
chain_id = ""
request_id = ""
# Read secrets from environment variables
baseten_api_key = os.environ["BASETEN_API_KEY"]
resp = requests.get(
f"https://chain-{chain_id}.api.baseten.co/async_request/{request_id}",
headers={"Authorization": f"Api-Key {baseten_api_key}"}
)
print(resp.json())
```
```sh cURL theme={"system"}
curl --request GET \
--url https://chain-{chain_id}.api.baseten.co/async_request/{request_id} \
--header "Authorization: Api-Key $BASETEN_API_KEY"
```
```javascript Node.js theme={"system"}
const fetch = require("node-fetch");
const resp = await fetch(
"https://chain-{chain_id}.api.baseten.co/async_request/{request_id}",
{
method: "GET",
headers: { Authorization: "Api-Key EMPTY" },
}
);
const data = await resp.json();
console.log(data);
```
# Deployment
Source: https://docs.baseten.co/reference/inference-api/wake/deployment-wake
POST https://model-{model_id}.api.baseten.co/deployment/{deployment-id}/wake
Wake a specific deployment of a model by deployment ID.
Use this endpoint to wake any scaled-to-zero [deployment](/deployment/deployments) of your model.
```sh theme={"system"}
https://model-{model_id}.api.baseten.co/deployment/{deployment_id}/wake
```
### Parameters
The ID of the model you want to wake.
The ID of the specific deployment you want to wake.
Your Baseten API key, formatted with prefix `Api-Key` (e.g. `{"Authorization": "Api-Key abcd1234.abcd1234"}`).
```python Python theme={"system"}
import urllib3
import os
model_id = ""
deployment_id = ""
# Read secrets from environment variables
baseten_api_key = os.environ["BASETEN_API_KEY"]
resp = urllib3.request(
"POST",
f"https://model-{model_id}.api.baseten.co/deployment/{deployment_id}/wake",
headers={"Authorization": f"Api-Key {baseten_api_key}"},
)
print(resp.json())
```
```sh cURL theme={"system"}
curl -X POST https://model-{model_id}.api.baseten.co/deployment/{deployment_id}/wake \
-H 'Authorization: Api-Key EMPTY' \
```
```javascript Node.js theme={"system"}
const fetch = require("node-fetch");
const resp = await fetch(
"https://model-{model_id}.api.baseten.co/deployment/{deployment_id}/wake",
{
method: "POST",
headers: { Authorization: "Api-Key EMPTY" },
}
);
const data = await resp.json();
console.log(data);
```
```json Example Response // Returns a 202 response code theme={"system"}
{}
```
# Development
Source: https://docs.baseten.co/reference/inference-api/wake/development-wake
POST https://model-{model_id}.api.baseten.co/development/wake
Wake the development deployment of a model.
Use this endpoint to wake the [development deployment](/deployment/deployments#development-deployment) of your model if it is scaled to zero.
```sh theme={"system"}
https://model-{model_id}.api.baseten.co/development/wake
```
### Parameters
The ID of the model you want to wake.
Your Baseten API key, formatted with prefix `Api-Key` (e.g. `{"Authorization": "Api-Key abcd1234.abcd1234"}`).
```python Python theme={"system"}
import urllib3
import os
model_id = ""
# Read secrets from environment variables
baseten_api_key = os.environ["BASETEN_API_KEY"]
resp = urllib3.request(
"POST",
f"https://model-{model_id}.api.baseten.co/development/wake",
headers={"Authorization": f"Api-Key {baseten_api_key}"},
)
print(resp.json())
```
```sh cURL theme={"system"}
curl -X POST https://model-{model_id}.api.baseten.co/development/wake \
-H 'Authorization: Api-Key EMPTY' \
```
```javascript Node.js theme={"system"}
const fetch = require("node-fetch");
const resp = await fetch(
"https://model-{model_id}.api.baseten.co/development/wake",
{
method: "POST",
headers: { Authorization: "Api-Key EMPTY" },
}
);
const data = await resp.json();
console.log(data);
```
```json Example Response // Returns a 202 response code theme={"system"}
{}
```
# Production
Source: https://docs.baseten.co/reference/inference-api/wake/production-wake
POST https://model-{model_id}.api.baseten.co/production/wake
Wake the production environment of a model.
Use this endpoint to wake the [production environment](/deployment/deployments#environments-and-promotion) of your model if it is scaled to zero.
```sh theme={"system"}
https://model-{model_id}.api.baseten.co/production/wake
```
### Parameters
The ID of the model you want to wake.
Your Baseten API key, formatted with prefix `Api-Key` (e.g. `{"Authorization": "Api-Key abcd1234.abcd1234"}`).
```python Python theme={"system"}
import urllib3
import os
model_id = ""
# Read secrets from environment variables
baseten_api_key = os.environ["BASETEN_API_KEY"]
resp = urllib3.request(
"POST",
f"https://model-{model_id}.api.baseten.co/production/wake",
headers={"Authorization": f"Api-Key {baseten_api_key}"},
)
print(resp.json())
```
```sh cURL theme={"system"}
curl -X POST https://model-{model_id}.api.baseten.co/production/wake \
-H 'Authorization: Api-Key EMPTY' \
```
```javascript Node.js theme={"system"}
const fetch = require("node-fetch");
const resp = await fetch(
"https://model-{model_id}.api.baseten.co/production/wake",
{
method: "POST",
headers: { Authorization: "Api-Key EMPTY" },
}
);
const data = await resp.json();
console.log(data);
```
```json Example Response // Returns a 202 response code theme={"system"}
{}
```
# Create an API key
Source: https://docs.baseten.co/reference/management-api/api-keys/creates-an-api-key
post /v1/api_keys
Creates an API key with the provided name and type. The API key is returned in the response.
# Delete an API key
Source: https://docs.baseten.co/reference/management-api/api-keys/delete-an-api-key
delete /v1/api_keys/{api_key_prefix}
Deletes an API key by prefix and returns info about the API key.
# Get all API keys
Source: https://docs.baseten.co/reference/management-api/api-keys/lists-the-users-api-keys
get /v1/api_keys
Lists all API keys your account has access to.
```json 200 theme={"system"}
{
"name": "my-api-key",
"type": "PERSONAL"
}
```
# Get billing usage summary
Source: https://docs.baseten.co/reference/management-api/billing/gets-billing-usage-summary-for-a-date-range
get /v1/billing/usage_summary
Returns billing usage data within the specified date range. Includes dedicated model serving, training, and model APIs usage. The date range must not exceed 31 days.
# Delete chains
Source: https://docs.baseten.co/reference/management-api/chains/deletes-a-chain-by-id
delete /v1/chains/{chain_id}
# By ID
Source: https://docs.baseten.co/reference/management-api/chains/gets-a-chain-by-id
get /v1/chains/{chain_id}
# All chains
Source: https://docs.baseten.co/reference/management-api/chains/gets-all-chains
get /v1/chains
# Any deployment by ID
Source: https://docs.baseten.co/reference/management-api/deployments/activate/activates-a-deployment
post /v1/models/{model_id}/deployments/{deployment_id}/activate
Activates an inactive deployment and returns the activation status.
# Activate environment deployment
Source: https://docs.baseten.co/reference/management-api/deployments/activate/activates-a-deployment-associated-with-an-environment
post /v1/models/{model_id}/environments/{env_name}/activate
Activates an inactive deployment associated with an environment and returns the activation status.
# Development deployment
Source: https://docs.baseten.co/reference/management-api/deployments/activate/activates-a-development-deployment
post /v1/models/{model_id}/deployments/development/activate
Activates an inactive development deployment and returns the activation status.
# Activate production deployment
Source: https://docs.baseten.co/reference/management-api/deployments/activate/activates-production-deployment
post /v1/models/{model_id}/deployments/production/activate
Activates an inactive production deployment and returns the activation status.
# Update chainlet environment's autoscaling settings
Source: https://docs.baseten.co/reference/management-api/deployments/autoscaling/update-a-chainlet-environments-autoscaling-settings
patch /v1/chains/{chain_id}/environments/{env_name}/chainlet_settings/autoscaling_settings
Updates a chainlet environment's autoscaling settings and returns the updated chainlet environment settings.
# Any model deployment by ID
Source: https://docs.baseten.co/reference/management-api/deployments/autoscaling/updates-a-deployments-autoscaling-settings
patch /v1/models/{model_id}/deployments/{deployment_id}/autoscaling_settings
Updates a deployment's autoscaling settings and returns the update status.
To update autoscaling settings at the environment level, use the [update environment settings](/reference/management-api/environments/update-an-environments-settings) endpoint.
# Development model deployment
Source: https://docs.baseten.co/reference/management-api/deployments/autoscaling/updates-a-development-deployments-autoscaling-settings
patch /v1/models/{model_id}/deployments/development/autoscaling_settings
Updates a development deployment's autoscaling settings and returns the update status.
To update autoscaling settings at the environment level, use the [update environment settings](/reference/management-api/environments/update-an-environments-settings) endpoint.
# Update production deployment autoscaling settings
Source: https://docs.baseten.co/reference/management-api/deployments/autoscaling/updates-production-deployment-autoscaling-settings
patch /v1/models/{model_id}/deployments/production/autoscaling_settings
Updates a production deployment's autoscaling settings and returns the update status.
# Any deployment by ID
Source: https://docs.baseten.co/reference/management-api/deployments/deactivate/deactivates-a-deployment
post /v1/models/{model_id}/deployments/{deployment_id}/deactivate
Deactivates a deployment and returns the deactivation status.
# Deactivate environment deployment
Source: https://docs.baseten.co/reference/management-api/deployments/deactivate/deactivates-a-deployment-associated-with-an-environment
post /v1/models/{model_id}/environments/{env_name}/deactivate
Deactivates a deployment associated with an environment and returns the deactivation status.
# Development deployment
Source: https://docs.baseten.co/reference/management-api/deployments/deactivate/deactivates-a-development-deployment
post /v1/models/{model_id}/deployments/development/deactivate
Deactivates a development deployment and returns the deactivation status.
# Deactivate production deployment
Source: https://docs.baseten.co/reference/management-api/deployments/deactivate/deactivates-production-deployment
post /v1/models/{model_id}/deployments/production/deactivate
Deactivates a production deployment and returns the deactivation status.
# Delete chain deployment
Source: https://docs.baseten.co/reference/management-api/deployments/deletes-a-chain-deployment-by-id
delete /v1/chains/{chain_id}/deployments/{chain_deployment_id}
# Delete model deployments
Source: https://docs.baseten.co/reference/management-api/deployments/deletes-a-models-deployment-by-id
delete /v1/models/{model_id}/deployments/{deployment_id}
Deletes a model's deployment by ID and returns the tombstone of the deployment.
# Any chain deployment by ID
Source: https://docs.baseten.co/reference/management-api/deployments/gets-a-chain-deployment-by-id
get /v1/chains/{chain_id}/deployments/{chain_deployment_id}
# Any model deployment by ID
Source: https://docs.baseten.co/reference/management-api/deployments/gets-a-models-deployment-by-id
get /v1/models/{model_id}/deployments/{deployment_id}
Gets a model's deployment by ID and returns the deployment.
# Development model deployment
Source: https://docs.baseten.co/reference/management-api/deployments/gets-a-models-development-deployment
get /v1/models/{model_id}/deployments/development
Gets a model's development deployment and returns the deployment.
# Production model deployment
Source: https://docs.baseten.co/reference/management-api/deployments/gets-a-models-production-deployment
get /v1/models/{model_id}/deployments/production
Gets a model's production deployment and returns the deployment.
# Get all chain deployments
Source: https://docs.baseten.co/reference/management-api/deployments/gets-all-chain-deployments
get /v1/chains/{chain_id}/deployments
# Get all model deployments
Source: https://docs.baseten.co/reference/management-api/deployments/gets-all-deployments-of-a-model
get /v1/models/{model_id}/deployments
# Cancel model promotion
Source: https://docs.baseten.co/reference/management-api/deployments/promote/cancel-promotion
post /v1/models/{model_id}/environments/{env_name}/cancel_promotion
Cancels an ongoing promotion to an environment and returns the cancellation status.
```json 200 theme={"system"}
{
"status": "CANCELED",
"message": "Promotion to production was successfully canceled."
}
```
```json 400 theme={"system"}
{
"code": "VALIDATION_ERROR",
"message": "Environment production has no in progress promotion."
}
```
# Force cancel rolling deployment
Source: https://docs.baseten.co/reference/management-api/deployments/promote/force-cancel-promotion
post /v1/models/{model_id}/environments/{env_name}/force_cancel_promotion
Immediately cancels an in-progress rolling promotion and triggers rollback to the previous version.
# Force roll forward promotion
Source: https://docs.baseten.co/reference/management-api/deployments/promote/force-roll-forward-promotion
post /v1/models/{model_id}/environments/{env_name}/force_roll_forward_promotion
Immediately completes the rolling promotion, shifting all traffic to the new version. This works even if the promotion is in the process of rolling back.
# Pause rolling deployment
Source: https://docs.baseten.co/reference/management-api/deployments/promote/pause-promotion
post /v1/models/{model_id}/environments/{env_name}/pause_promotion
Pauses an in-progress rolling promotion after the current step completes. No further scaling changes are made until resumed.
# Promote to chain environment
Source: https://docs.baseten.co/reference/management-api/deployments/promote/promotes-a-chain-deployment-to-an-environment
post /v1/chains/{chain_id}/environments/{env_name}/promote
Promotes an existing chain deployment to an environment and returns the promoted chain deployment.
# Promote to model environment
Source: https://docs.baseten.co/reference/management-api/deployments/promote/promotes-a-deployment-to-an-environment
post /v1/models/{model_id}/environments/{env_name}/promote
Promotes an existing deployment to an environment and returns the promoted deployment.
# Any model deployment by ID
Source: https://docs.baseten.co/reference/management-api/deployments/promote/promotes-a-deployment-to-production
post /v1/models/{model_id}/deployments/{deployment_id}/promote
Promotes an existing deployment to production and returns the same deployment.
# Development model deployment
Source: https://docs.baseten.co/reference/management-api/deployments/promote/promotes-a-development-deployment-to-production
post /v1/models/{model_id}/deployments/development/promote
Creates a new production deployment from the development deployment, the currently building deployment is returned.
# Resume rolling deployment
Source: https://docs.baseten.co/reference/management-api/deployments/promote/resume-promotion
post /v1/models/{model_id}/environments/{env_name}/resume_promotion
Resumes a paused rolling promotion, continuing from where it was paused.
# Terminate deployment replica
Source: https://docs.baseten.co/reference/management-api/deployments/terminates-deployment-replica
delete /v1/models/{model_id}/deployments/{deployment_id}/replicas/{replica_id}
Terminates a deployment replica and returns the termination status.
# Create Chain environment
Source: https://docs.baseten.co/reference/management-api/environments/create-a-chain-environment
post /v1/chains/{chain_id}/environments
Create a chain environment. Returns the resulting environment.
# Create environment
Source: https://docs.baseten.co/reference/management-api/environments/create-an-environment
post /v1/models/{model_id}/environments
Creates an environment for the specified model and returns the environment.
# Get Chain environment
Source: https://docs.baseten.co/reference/management-api/environments/get-a-chain-environments-details
get /v1/chains/{chain_id}/environments/{env_name}
Gets a chain environment's details and returns the chain environment.
# Get all Chain environments
Source: https://docs.baseten.co/reference/management-api/environments/get-all-chain-environments
get /v1/chains/{chain_id}/environments
Gets all chain environments for a given chain
# Get all environments
Source: https://docs.baseten.co/reference/management-api/environments/get-all-environments
get /v1/models/{model_id}/environments
Gets all environments for a given model
# Get environment
Source: https://docs.baseten.co/reference/management-api/environments/get-an-environments-details
get /v1/models/{model_id}/environments/{env_name}
Gets an environment's details and returns the environment.
# Update Chain environment
Source: https://docs.baseten.co/reference/management-api/environments/update-a-chain-environments-settings
patch /v1/chains/{chain_id}/environments/{env_name}
Update a chain environment's settings and returns the chain environment.
# Update chainlet environment's instance type
Source: https://docs.baseten.co/reference/management-api/environments/update-a-chainlet-environments-instance-type-settings
post /v1/chains/{chain_id}/environments/{env_name}/chainlet_settings/instance_types/update
Updates a chainlet environment's instance type settings. The chainlet environment setting must exist. When updated, a new chain deployment is created and deployed. It is promoted to the chain environment according to promotion settings on the environment.
# Update model environment
Source: https://docs.baseten.co/reference/management-api/environments/update-an-environments-settings
patch /v1/models/{model_id}/environments/{env_name}
Updates an environment's settings and returns the updated environment.
# All instance types
Source: https://docs.baseten.co/reference/management-api/instance-types/gets-all-instance-types
get /v1/instance_types
# Instance type prices
Source: https://docs.baseten.co/reference/management-api/instance-types/gets-instance-type-prices
get /v1/instance_type_prices
# Delete models
Source: https://docs.baseten.co/reference/management-api/models/deletes-a-model-by-id
delete /v1/models/{model_id}
# By ID
Source: https://docs.baseten.co/reference/management-api/models/gets-a-model-by-id
get /v1/models/{model_id}
# All models
Source: https://docs.baseten.co/reference/management-api/models/gets-all-models
get /v1/models
# Overview
Source: https://docs.baseten.co/reference/management-api/overview
Manage models and deployments with the Baseten management API. It supports monitoring, CI/CD, and automation at both the model and workspace levels.
## Model endpoints
| Method | Endpoint | Description |
| :----- | :-------------------------------------------------------------------------------- | :--------------- |
| `GET` | [`/v1/models`](/reference/management-api/models/gets-all-models) | Get all models |
| `GET` | [`/v1/models/{model_id}`](/reference/management-api/models/gets-a-model-by-id) | Get models by ID |
| `DEL` | [`/v1/models/{model_id}`](/reference/management-api/models/deletes-a-model-by-id) | Delete models |
## Chain endpoints
| Method | Endpoint | Description |
| :----- | :-------------------------------------------------------------------------------- | :---------------- |
| `GET` | [`/v1/chains`](/reference/management-api/chains/gets-all-chains) | Get all Chains |
| `GET` | [`/v1/chains/{chain_id}`](/reference/management-api/chains/gets-a-chain-by-id) | Get a Chain by ID |
| `DEL` | [`/v1/chains/{chain_id}`](/reference/management-api/chains/deletes-a-chain-by-id) | Delete Chains |
## Deployment endpoints
### Activate a model deployment
| Method | Endpoint | Description |
| :----- | :--------------------------------------------------------------------------------------------------------------------------------------------------------------- | :-------------------------- |
| `POST` | [`/v1/models/{model_id}/environments/{env_name}/activate`](/reference/management-api/deployments/activate/activates-a-deployment-associated-with-an-environment) | **Activate** an environment |
| `POST` | [`/v1/models/{model_id}/deployments/development/activate`](/reference/management-api/deployments/activate/activates-a-development-deployment) | **Activate** development |
| `POST` | [`/v1/models/{model_id}/deployments/{deployment_id}/activate`](/reference/management-api/deployments/activate/activates-a-deployment) | **Activate** a deployment |
### Deactivate a model deployment
| Method | Endpoint | Description |
| :----- | :--------------------------------------------------------------------------------------------------------------------------------------------------------------------- | :---------------------------- |
| `POST` | [`/v1/models/{model_id}/environments/{env_name}/deactivate`](/reference/management-api/deployments/deactivate/deactivates-a-deployment-associated-with-an-environment) | **Deactivate** an environment |
| `POST` | [`/v1/models/{model_id}/deployments/development/deactivate`](/reference/management-api/deployments/deactivate/deactivates-a-development-deployment) | **Deactivate** development |
| `POST` | [`/v1/models/{model_id}/deployments/{deployment_id}/deactivate`](/reference/management-api/deployments/deactivate/deactivates-a-deployment) | **Deactivate** a deployment |
### Promote a model deployment
| Method | Endpoint | Description |
| :----- | :------------------------------------------------------------------------------------------------------------------------------------------------------- | :--------------------------------------- |
| `POST` | [`/v1/models/{model_id}/environments/{env_name}/promote`](/reference/management-api/deployments/promote/promotes-a-deployment-to-an-environment) | **Promote** to model **environment** |
| `POST` | [`/v1/models/{model_id}/environments/{env_name}/cancel_promotion`](/reference/management-api/deployments/promote/cancel-promotion) | **Cancel** a promotion to an environment |
| `POST` | [`/v1/models/{model_id}/deployments/development/promote`](/reference/management-api/deployments/promote/promotes-a-development-deployment-to-production) | **Promote** development deployment |
| `POST` | [`/v1/models/{model_id}/deployments/{deployment_id}/promote`](/reference/management-api/deployments/promote/promotes-a-deployment-to-production) | **Promote** any deployment |
### Autoscaling
| Method | Endpoint | Description |
| :------ | :------------------------------------------------------------------------------------------------------------------------------------------------------------- | :---------------------------------------------- |
| `PATCH` | [`.../deployments/development/autoscaling_settings`](/reference/management-api/deployments/autoscaling/updates-a-development-deployments-autoscaling-settings) | Updates **development's autoscaling** settings |
| `PATCH` | [`.../deployments/{deployment_id}/autoscaling_settings`](/reference/management-api/deployments/autoscaling/updates-a-deployments-autoscaling-settings) | Updates a **deployment's autoscaling** settings |
### Manage deployment endpoints
| Method | Endpoint | Description |
| :----- | :----------------------------------------------------------------------------------------------------------------------------- | :--------------------------- |
| `GET` | [`/v1/models/{model_id}/deployments`](/reference/management-api/deployments/gets-all-deployments-of-a-model) | Get all model deployments |
| `GET` | [`/v1/models/{model_id}/deployments/production`](/reference/management-api/deployments/gets-a-models-production-deployment) | Production model deployment |
| `GET` | [`/v1/models/{model_id}/deployments/development`](/reference/management-api/deployments/gets-a-models-development-deployment) | Development model deployment |
| `GET` | [`/v1/models/{model_id}/deployments/{deployment_id}`](/reference/management-api/deployments/gets-a-models-deployment-by-id) | Any model deployment by ID |
| `DEL` | [`/v1/models/{model_id}/deployments/{deployment_id}`](/reference/management-api/deployments/deletes-a-models-deployment-by-id) | Delete model deployments |
### Promote a Chain deployment
| Method | Endpoint | Description |
| :----- | :----------------------------------------------------------------------------------------------------------------------------------------------------- | :--------------------------- |
| `POST` | [`/v1/chains/{chain_id}/environments/{env_name}/promote`](/reference/management-api/deployments/promote/promotes-a-chain-deployment-to-an-environment) | Promote to chain environment |
### Autoscaling
| Method | Endpoint | Description |
| :------ | :---------------------------------------------------------------------------------------------------------------------------------------------------- | :----------------------------------------------------- |
| `PATCH` | [`.../chainlet_settings/autoscaling_settings`](/reference/management-api/deployments/autoscaling/update-a-chainlet-environments-autoscaling-settings) | **Update chainlet** environment's autoscaling settings |
### Manage Chain deployments
| Method | Endpoint | Description |
| :----- | :---------------------------------------------------------------------------------------------------------------------------------- | :--------------------------- |
| `GET` | [`/v1/chains/{chain_id}/deployments`](/reference/management-api/deployments/gets-all-chain-deployments) | Get all chain deployments |
| `GET` | [`/v1/chains/{chain_id}/deployments/{chain_deployment_id}`](/reference/management-api/deployments/gets-a-chain-deployment-by-id) | Any chain deployment by ID |
| `DEL` | [`/v1/chains/{chain_id}/deployments/{chain_deployment_id}`](/reference/management-api/deployments/deletes-a-chain-deployment-by-id) | **Delete** chain deployments |
## Environment endpoints
| Method | Endpoint | Description |
| :------ | :------------------------------------------------------------------------------------------------------------------------ | :------------------------- |
| `POST` | [`/v1/models/{model_id}/environments`](/reference/management-api/environments/create-an-environment) | Create environment |
| `GET` | [`/v1/models/{model_id}/environments`](/reference/management-api/environments/get-all-environments) | Get all environments |
| `GET` | [`/v1/models/{model_id}/environments/{env_name}`](/reference/management-api/environments/get-an-environments-details) | Get an environment details |
| `PATCH` | [`/v1/models/{model_id}/environments/{env_name}`](/reference/management-api/environments/update-an-environments-settings) | Update model environment |
| Method | Endpoint | Description |
| :------ | :-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | :------------------------------------------ |
| `POST` | [`/v1/chains/{chain_id}/environments`](/reference/management-api/environments/create-a-chain-environment) | Create chain environment |
| `GET` | [`/v1/chains/{chain_id}/environments`](/reference/management-api/environments/get-all-chain-environments) | Get all chain environments |
| `GET` | [`/v1/chains/{chain_id}/environments/{env_name}`](/reference/management-api/environments/get-a-chain-environments-details) | Get a chain environment |
| `PATCH` | [`/v1/chains/{chain_id}/environments/{env_name}`](/reference/management-api/environments/update-a-chain-environments-settings) | Update chain environment |
| `POST` | [`/v1/chains/{chain_id}/environments/{env_name}/chainlet_settings/instance_types/update`](/reference/management-api/environments/update-a-chainlet-environments-instance-type-settings) | Update chainlet environment's instance type |
## Instance type endpoints
| Method | Endpoint | Description |
| :----- | :----------------------------------------------------------------------------------------------- | :----------------------- |
| `GET` | [`/v1/instance_types`](/reference/management-api/instance-types/gets-all-instance-types) | Get all instance types |
| `GET` | [`/v1/instance_type_prices`](/reference/management-api/instance-types/gets-instance-type-prices) | Get instance type prices |
## Team endpoints
| Method | Endpoint | Description |
| :----- | :------------------------------------------------------------- | :------------ |
| `GET` | [`/v1/teams`](/reference/management-api/teams/lists-all-teams) | Get all teams |
## Secret endpoints
| Method | Endpoint | Description |
| :----- | :------------------------------------------------------------------------------------- | :----------------------------- |
| `GET` | [`/v1/secrets`](/reference/management-api/secrets/gets-all-secrets) | Get all secrets |
| `POST` | [`/v1/secrets`](/reference/management-api/secrets/upserts-a-secret) | Create or update a secret |
| `GET` | [`/v1/teams/{team_id}/secrets`](/reference/management-api/teams/gets-all-team-secrets) | Get all team secrets |
| `POST` | [`/v1/teams/{team_id}/secrets`](/reference/management-api/teams/upserts-a-team-secret) | Create or update a team secret |
## API Key endpoints
| Method | Endpoint | Description |
| :------- | :--------------------------------------------------------------------------------------- | :-------------------- |
| `GET` | [`/v1/api_keys`](/reference/management-api/api-keys/lists-the-users-api-keys) | Get all API keys |
| `POST` | [`/v1/api_keys`](/reference/management-api/api-keys/creates-an-api-key) | Create an API key |
| `DELETE` | [`/v1/api_keys/{api_key_prefix}`](/reference/management-api/api-keys/delete-an-api-key) | Delete an API key |
| `POST` | [`/v1/teams/{team_id}/api_keys`](/reference/management-api/teams/creates-a-team-api-key) | Create a team API key |
## Training endpoints
| Method | Endpoint | Description |
| :----- | :------------------------------------------------------------------------------------------------------------------------------------------------- | :------------------------------ |
| `POST` | [`/v1/teams/{team_id}/training_projects`](/reference/management-api/teams/creates-a-team-training-project) | Create a team training project |
| `GET` | [`/v1/training_projects/{training_project_id}/jobs`](/reference/training-api/list-training-jobs) | Get all training jobs |
| `GET` | [`/v1/training_projects/{training_project_id}/jobs/{training_job_id}`](/reference/training-api/get-training-job) | Get training job by ID |
| `POST` | [`/v1/training_jobs/search`](/reference/training-api/search-training-jobs) | Search training jobs |
| `POST` | [`/v1/training_projects/{training_project_id}/jobs/{training_job_id}/stop`](/reference/training-api/stop-training-job) | Stop a training job |
| `POST` | [`/v1/training_projects/{training_project_id}/jobs/{training_job_id}/recreate`](/reference/training-api/recreate-training-job) | Recreate a training job |
| `GET` | [`/v1/training_projects/{training_project_id}/jobs/{training_job_id}/checkpoints`](/reference/training-api/get-training-job-checkpoints) | Get training job checkpoints |
| `GET` | [`/v1/training_projects/{training_project_id}/jobs/{training_job_id}/checkpoint_files`](/reference/training-api/get-training-job-checkpoint-files) | Get checkpoint files |
| `GET` | [`/v1/training_projects/{training_project_id}/jobs/{training_job_id}/logs`](/reference/training-api/get-training-job-logs) | Get training job logs |
| `GET` | [`/v1/training_projects/{training_project_id}/jobs/{training_job_id}/metrics`](/reference/training-api/get-training-job-metrics) | Get training job metrics |
| `GET` | [`/v1/training_projects/{training_project_id}/jobs/{training_job_id}/download`](/reference/training-api/download-training-job) | Download training job artifacts |
# Get all secrets
Source: https://docs.baseten.co/reference/management-api/secrets/gets-all-secrets
get /v1/secrets
# Upsert a secret
Source: https://docs.baseten.co/reference/management-api/secrets/upserts-a-secret
post /v1/secrets
Creates a new secret or updates an existing secret if one with the provided name already exists. The name and creation date of the created or updated secret is returned.
# Create a team API key
Source: https://docs.baseten.co/reference/management-api/teams/creates-a-team-api-key
post /v1/teams/{team_id}/api_keys
Creates a team API key with the provided name and type. The API key is returned in the response.
# Create a team training project
Source: https://docs.baseten.co/reference/management-api/teams/creates-a-team-training-project
post /v1/teams/{team_id}/training_projects
Upserts a training project with the specified metadata for a team.
# Get all team secrets
Source: https://docs.baseten.co/reference/management-api/teams/gets-all-team-secrets
get /v1/teams/{team_id}/secrets
# List all teams
Source: https://docs.baseten.co/reference/management-api/teams/lists-all-teams
get /v1/teams
Returns a list of all teams the authenticated user has access to.
# Upsert a team secret
Source: https://docs.baseten.co/reference/management-api/teams/upserts-a-team-secret
post /v1/teams/{team_id}/secrets
Creates a new secret or updates an existing secret if one with the provided name already exists. The name and creation date of the created or updated secret is returned. This secret belongs to the specified team
# Reference documentation
Source: https://docs.baseten.co/reference/overview
For deploying, managing, and interacting with machine learning models on Baseten.
This reference section documents our API, CLI, and Python SDK for deploying models, managing inference chains, and calling endpoints in production.
## API Reference
Baseten provides two sets of API endpoints:
For calling deployed models and chains.
For managing models, workspaces, and training jobs.
## CLI Reference
The CLI provides a command-line interface for managing deployments, running local inference, and configuring Truss models.
* [Truss CLI reference](/reference/cli/truss/overview): Commands for initializing, deploying, and managing models.
* [Chains CLI reference](/reference/cli/chains/chains-cli): Commands for orchestrating multi-model workflows.
* [Training CLI reference](/reference/cli/training/training-cli): Commands for managing training jobs.
***
## SDK Reference
The Python SDK provides an abstraction for deploying models, managing deployments, and interacting with models via code.
* [Truss SDK reference](/reference/sdk/truss): Deploy and interact with Truss models using Python.
* [Chains SDK reference](/reference/sdk/chains): Build and manage inference chains programmatically.
* [Training SDK reference](/reference/sdk/training): Deploy and interact with trained models using Python.
# Chains SDK Reference
Source: https://docs.baseten.co/reference/sdk/chains
Python SDK Reference for Chains
# Chainlet classes
APIs for creating user-defined Chainlets.
### *class* `truss_chains.ChainletBase`
Base class for all chainlets.
Inheriting from this class adds validations to make sure subclasses adhere to the
chainlet pattern and facilitates remote chainlet deployment.
Refer to [the docs](/development/chain/getting-started) and this
[example chainlet](https://github.com/basetenlabs/truss/blob/main/truss-chains/truss_chains/reference_code/reference_chainlet.py)
for more guidance on how to create subclasses.
### *class* `truss_chains.ModelBase`
Base class for all standalone models.
Inheriting from this class adds validations to make sure subclasses adhere to the
truss model pattern.
### *class* `truss_chains.EngineBuilderLLMChainlet`
#### *method final async* run\_remote(llm\_input)
**Parameters:**
| Name | Type | Description |
| ----------- | ----------------------- | -------------------------- |
| `llm_input` | *EngineBuilderLLMInput* | OpenAI compatible request. |
* **Returns:**
*AsyncIterator*\[str]
### *function* `truss_chains.depends`
Sets a “symbolic marker” to indicate to the framework that a chainlet is a
dependency of another chainlet. The return value of `depends` is intended to be
used as a default argument in a chainlet’s `__init__`-method.
When deploying a chain remotely, a corresponding stub to the remote is injected in
its place. In [`run_local`](#function-truss-chains-run-local) mode an instance
of a local chainlet is injected.
Refer to [the docs](/development/chain/getting-started) and this
[example chainlet](https://github.com/basetenlabs/truss/blob/main/truss-chains/truss_chains/reference_code/reference_chainlet.py)
for more guidance on how make one chainlet depend on another chainlet.
Despite the type annotation, this does *not* immediately provide a
chainlet instance. Only when deploying remotely or using `run_local` a
chainlet instance is provided.
**Parameters:**
| Name | Type | Default | Description |
| ------------------- | --------------------------------------------------------- | ------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
| `chainlet_cls` | *Type\[[ChainletBase](#class-truss-chains-chainletbase)]* | | The chainlet class of the dependency. |
| `retries` | *int* | `1` | The number of times to retry the remote chainlet in case of failures (e.g. due to transient network issues). For streaming, retries are only made if the request fails before streaming any results back. Failures mid-stream not retried. |
| `timeout_sec` | *float* | `600.0` | Timeout for the HTTP request to this chainlet. |
| `use_binary` | *bool* | `False` | Whether to send data in binary format. This can give a parsing speedup and message size reduction (\~25%) for numpy arrays. Use `NumpyArrayField` as a field type on pydantic models for integration and set this option to `True`. For simple text data, there is no significant benefit. |
| `concurrency_limit` | *int* | `300` | The maximum number of concurrent requests to send to the remote chainlet. Excessive requests will be queued and a warning will be shown. Try to design your algorithm in a way that spreads requests evenly over time so that this the default value can be used. |
* **Returns:**
A “symbolic marker” to be used as a default argument in a chainlet’s
initializer.
### *function* `truss_chains.depends_context`
Sets a “symbolic marker” for injecting a context object at runtime.
Refer to [the docs](/development/chain/getting-started) and this
[example chainlet](https://github.com/basetenlabs/truss/blob/main/truss-chains/truss_chains/reference_code/reference_chainlet.py)
for more guidance on the `__init__`-signature of chainlets.
Despite the type annotation, this does *not* immediately provide a
context instance. Only when deploying remotely or using `run_local` a
context instance is provided.
* **Returns:**
A “symbolic marker” to be used as a default argument in a chainlet’s
initializer.
### *class* `truss_chains.DeploymentContext`
Bases: `pydantic.BaseModel`
Bundles config values and resources needed to instantiate Chainlets.
The context can optionally be added as a trailing argument in a Chainlet’s
`__init__` method and then used to set up the chainlet (e.g. using a secret as
an access token for downloading model weights).
**Parameters:**
| Name | Type | Default | Description |
| --------------------- | ------------------------------------------------------------------------------------------ | ------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `chainlet_to_service` | *Mapping\[str,[DeployedServiceDescriptor](#class-truss-chains-deployedservicedescriptor)]* | | A mapping from chainlet names to service descriptors. This is used to create RPC sessions to dependency chainlets. It contains only the chainlet services that are dependencies of the current chainlet. |
| `secrets` | *Mapping\[str,str]* | | A mapping from secret names to secret values. It contains only the secrets that are listed in `remote_config.assets.secret_keys` of the current chainlet. |
| `data_dir` | *Path\|None* | `None` | The directory where the chainlet can store and access data, e.g. for downloading model weights. |
| `environment` | *[Environment](#class-truss-chains-environment)\|None* | `None` | The environment that the chainlet is deployed in. None if the chainlet is not associated with an environment. |
#### *method* get\_baseten\_api\_key()
* **Returns:**
str
#### *method* get\_service\_descriptor(chainlet\_name)
**Parameters:**
| Name | Type | Description |
| --------------- | ----- | ------------------------- |
| `chainlet_name` | *str* | The name of the chainlet. |
* **Returns:**
[*DeployedServiceDescriptor*](#class-truss-chains-deployedservicedescriptor)
### *class* `truss_chains.Environment`
Bases: `pydantic.BaseModel`
The environment the chainlet is deployed in.
* **Parameters:**
**name** (*str*) – The name of the environment.
### *class* `truss_chains.ChainletOptions`
Bases: `pydantic.BaseModel`
**Parameters:**
| Name | Type | Default | Description |
| ------------------------ | ----------------------------------------------------- | ---------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `enable_b10_tracing` | *bool* | `False` | enables baseten-internal trace data collection. This helps baseten engineers better analyze chain performance in case of issues. It is independent of a potentially user-configured tracing instrumentation. Turning this on, could add performance overhead. |
| `enable_debug_logs` | *bool* | `False` | Sets log level to debug in deployed server. |
| `env_variables` | *Mapping\[str,str]* | `{}` | static environment variables available to the deployed chainlet. |
| `health_checks` | *HealthChecks* | `truss.base.truss_config.HealthChecks()` | Configures health checks for the chainlet. See [guide](https://docs.baseten.co/truss/guides/custom-health-checks#chains). |
| `metadata` | *JsonValue\|None* | `None` | Arbitrary JSON object to describe chainlet. |
| `streaming_read_timeout` | *int* | `60` | Amount of time (in seconds) between each streamed chunk before a timeout is triggered. |
| `transport` | *Union\[HTTPOptions\|WebsocketOptions\|GRPCOptions]'* | `None` | Allows to customize certain transport protocols, e.g. websocket pings. |
### *class* `truss_chains.RPCOptions`
Bases: `pydantic.BaseModel`
Options to customize RPCs to dependency chainlets.
**Parameters:**
| Name | Type | Default | Description |
| ------------------- | ------- | ------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
| `retries` | *int* | `1` | The number of times to retry the remote chainlet in case of failures (e.g. due to transient network issues). For streaming, retries are only made if the request fails before streaming any results back. Failures mid-stream not retried. |
| `timeout_sec` | *float* | `600.0` | Timeout for the HTTP request to this chainlet. |
| `use_binary` | *bool* | `False` | Whether to send data in binary format. This can give a parsing speedup and message size reduction (\~25%) for numpy arrays. Use `NumpyArrayField` as a field type on pydantic models for integration and set this option to `True`. For simple text data, there is no significant benefit. |
| `concurrency_limit` | *int* | `300` | The maximum number of concurrent requests to send to the remote chainlet. Excessive requests will be queued and a warning will be shown. Try to design your algorithm in a way that spreads requests evenly over time so that this the default value can be used. |
### *function* `truss_chains.mark_entrypoint`
Decorator to mark a chainlet as the entrypoint of a chain.
This decorator can be applied to *one* chainlet in a source file and then the
CLI push command simplifies: only the file, not the class within, must be specified.
Optionally a display name for the Chain (not the Chainlet) can be set (effectively
giving a custom default value for the `name` arg of the CLI push command).
Example usage:
```python theme={"system"}
import truss_chains as chains
@chains.mark_entrypoint
class MyChainlet(ChainletBase):
...
# OR with custom Chain name.
@chains.mark_entrypoint("My Chain Name")
class MyChainlet(ChainletBase):
...
```
# Remote Configuration
These data structures specify for each chainlet how it gets deployed remotely, e.g. dependencies and compute resources.
### *class* `truss_chains.RemoteConfig`
Bases: `pydantic.BaseModel`
Bundles config values needed to deploy a chainlet remotely.
This is specified as a class variable for each chainlet class, e.g.:
```python theme={"system"}
import truss_chains as chains
class MyChainlet(chains.ChainletBase):
remote_config = chains.RemoteConfig(
docker_image=chains.DockerImage(
pip_requirements=["torch==2.0.1", ...]
),
compute=chains.Compute(cpu_count=2, gpu="A10G", ...),
assets=chains.Assets(secret_keys=["hf_access_token"], ...),
)
```
**Parameters:**
| Name | Type | Default |
| -------------- | -------------------------------------------------------- | -------------------------------- |
| `docker_image` | *[DockerImage](#class-truss-chains-dockerimage)* | `truss_chains.DockerImage()` |
| `compute` | *[Compute](#class-truss-chains-compute)* | `truss_chains.Compute()` |
| `assets` | *[Assets](#class-truss-chains-assets)* | `truss_chains.Assets()` |
| `name` | *str\|None* | `None` |
| `options` | *[ChainletOptions](#class-truss-chains-chainletoptions)* | `truss_chains.ChainletOptions()` |
### *class* `truss_chains.DockerImage`
Bases: `pydantic.BaseModel`
Configures the docker image in which a remote chainlet is deployed.
Any paths are relative to the source file where `DockerImage` is
defined and must be created with the helper function \[`make_abs_path_here`]
(#function-truss-chains-make-abs-path-here).
This allows you for example organize chainlets in different (potentially nested)
modules and keep their requirement files right next their python source files.
**Parameters:**
| Name | Type | Default | Description |
| ------------------------------- | -------------------------------------------------------------------------------------------------- | ----------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
| `base_image` | *[BasetenImage](#class-truss-chains-basetenimage)\|[CustomImage](#class-truss-chains-customimage)* | `truss_chains.BasetenImage()` | The base image used by the chainlet. Other dependencies and assets are included as additional layers on top of that image. You can choose a Baseten default image for a supported python version (e.g. `BasetenImage.PY311`), this will also include GPU drivers if needed, or provide a custom image (e.g. `CustomImage(image=”python:3.11-slim”)`). |
| `pip_requirements_file` | *AbsPath\|None* | `None` | **Deprecated.** Use `requirements_file` instead. Path to a file containing pip requirements. The file content is naively concatenated with `pip_requirements`. |
| `pip_requirements` | *list\[str]* | `[]` | A list of pip requirements to install. Only supported with pip-style requirements files. Cannot be used with `pyproject.toml` or `uv.lock` requirements files. |
| `apt_requirements` | *list\[str]* | `[]` | A list of apt requirements to install. |
| `requirements_file` | *AbsPath\|None* | `None` | Path to a requirements file. Supports `requirements.txt` (pip format), `pyproject.toml`, and `uv.lock`. The file type is auto-detected from the filename. For pip-style files, the content is concatenated with `pip_requirements`. For `pyproject.toml` and `uv.lock`, the file is used as-is for installing dependencies. |
| `data_dir` | *AbsPath\|None* | `None` | Data from this directory is copied into the docker image and accessible to the remote chainlet at runtime. |
| `external_package_dirs` | *list\[AbsPath]\|None* | `None` | A list of directories containing additional python packages outside the chain’s workspace dir, e.g. a shared library. This code is copied into the docker image and importable at runtime. |
| `truss_server_version_override` | *str\|None* | `None` | By default, deployed Chainlets use the truss server implementation corresponding to the truss version of the user’s CLI. To use a specific version, e.g. pinning it for exact reproducibility, the version can be overridden here. Valid versions correspond to truss releases on PyPi: [https://pypi.org/project/truss/#history](https://pypi.org/project/truss/#history), e.g. “0.9.80”. |
### *class* `truss_chains.BasetenImage`
Bases: `Enum`
Default images, curated by baseten, for different python versions. If a Chainlet
uses GPUs, drivers will be included in the image.
| Enum Member | Value |
| ----------- | ------- |
| `PY39` | *py39* |
| `PY310` | *py310* |
| `PY311` | *py311* |
| `PY312` | *py312* |
| `PY313` | *py313* |
| `PY314` | *py314* |
### *class* `truss_chains.CustomImage`
Bases: `pydantic.BaseModel`
Configures the usage of a custom image hosted on dockerhub.
**Parameters:**
| Name | Type | Default | Description |
| ------------------------ | -------------------------- | ------- | ------------------------------------------------------------------------------------------------------ |
| `image` | *str* | | Reference to image on dockerhub. |
| `python_executable_path` | *str\|None* | `None` | Absolute path to python executable (if default `python` is ambiguous). |
| `docker_auth` | *DockerAuthSettings\|None* | `None` | See [corresponding truss config](/development/model/base-images#example%3A-docker-hub-authentication). |
### *class* `truss_chains.Compute`
Specifies which compute resources a chainlet has in the *remote* deployment.
Not all combinations can be exactly satisfied by available hardware, in some
cases more powerful machine types are chosen to make sure requirements are met
or over-provisioned. Refer to the
[baseten instance reference](https://docs.baseten.co/deployment/resources).
**Parameters:**
| Name | Type | Default | Description |
| --------------------- | ----------------------------- | ------- | --------------------------------------------------------------------------------------------------------------- |
| `cpu_count` | *int* | `1` | Minimum number of CPUs to allocate. |
| `memory` | *str* | `'2Gi'` | Minimum memory to allocate, e.g. “2Gi” (2 gibibytes). |
| `gpu` | *str\|Accelerator\|None* | `None` | GPU accelerator type, e.g. “A10G”, “A100”, refer to the [truss config](/deployment/resources) for more choices. |
| `gpu_count` | *int* | `1` | Number of GPUs to allocate. |
| `predict_concurrency` | *int\|Literal\['cpu\_count']* | `1` | Number of concurrent requests a single replica of a deployed chainlet handles. |
Concurrency concepts are explained in the [autoscaling guide](/deployment/autoscaling/overview#scaling-triggers).
It is important to understand the difference between predict\_concurrency and
the concurrency target (used for autoscaling, i.e. adding or removing replicas).
Furthermore, the `predict_concurrency` of a single instance is implemented in
two ways:
* Via python’s `asyncio`, if `run_remote` is an async def. This
requires that `run_remote` yields to the event loop.
* With a threadpool if it’s a synchronous function. This requires
that the threads don’t have significant CPU load (due to the GIL).
### *class* `truss_chains.Assets`
Specifies which assets a chainlet can access in the remote deployment.
For example, model weight caching can be used like this:
```python theme={"system"}
import truss_chains as chains
from truss.base import truss_config
mistral_cache = truss_config.ModelRepo(
repo_id="mistralai/Mistral-7B-Instruct-v0.2",
allow_patterns=["*.json", "*.safetensors", ".model"]
)
chains.Assets(cached=[mistral_cache], ...)
```
**Parameters:**
| Name | Type | Default | Description |
| --------------- | ----------------------------- | ------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `cached` | *Iterable\[ModelRepo]* | `()` | One or more `truss_config.ModelRepo` objects. |
| `secret_keys` | *Iterable\[str]* | `()` | Names of secrets stored on baseten, that the chainlet should have access to. You can manage secrets on baseten [here](https://app.baseten.co/settings/secrets). |
| `external_data` | *Iterable\[ExternalDataItem]* | `()` | Data to be downloaded from public URLs and made available in the deployment (via `context.data_dir`). |
# Core
General framework and helper functions.
### *function* `truss_chains.push`
Deploys a chain remotely (with all dependent chainlets).
**Parameters:**
| Name | Type | Default | Description |
| ----------------------- | -------------------------------- | ----------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------ |
| `entrypoint` | *Type\[ChainletT]* | | The chainlet class that serves as the entrypoint to the chain. |
| `chain_name` | *str* | | The name of the chain. |
| `publish` | *bool* | `True` | Whether to publish the chain as a published deployment (it is a draft deployment otherwise) |
| `promote` | *bool* | `True` | Whether to promote the chain to be the production deployment (this implies publishing as well). |
| `only_generate_trusses` | *bool* | `False` | Used for debugging purposes. If set to True, only the underlying truss models for the chainlets are generated in `/tmp/.chains_generated`. |
| `remote` | *str* | `'baseten'` | name of a remote config in .trussrc. If not provided, it will be inquired. |
| `environment` | *str\|None* | `None` | The name of an environment to promote deployment into. |
| `progress_bar` | *Type\[progress.Progress]\|None* | `None` | Optional rich.progress.Progress if output is desired. |
| `include_git_info` | *bool* | `False` | Whether to attach git versioning info (sha, branch, tag) to deployments made from within a git repo. If set to True in .trussrc, it will always be attached. |
* **Returns:**
[*ChainService*](#class-truss-chains-remote-chainservice): A chain service
handle to the deployed chain.
### *class* `truss_chains.deployment.deployment_client.ChainService`
Handle for a deployed chain.
A `ChainService` is created and returned when using `push`. It
bundles the individual services for each chainlet in the chain, and provides
utilities to query their status, invoke the entrypoint etc.
#### *method* get\_info()
Queries the statuses of all chainlets in the chain.
* **Returns:**
List of `DeployedChainlet`, `(name, is_entrypoint, status, logs_url)`
for each chainlet.
#### *property* name *: str*
#### *method* run\_remote(json)
Invokes the entrypoint with JSON data.
**Parameters:**
| Name | Type | Description |
| ------ | ----------- | ---------------------------- |
| `json` | *JSON dict* | Input data to the entrypoint |
* **Returns:**
The JSON response.
#### *property* run\_remote\_url *: str*
URL to invoke the entrypoint.
#### *property* status\_page\_url *: str*
Link to status page on Baseten.
### *function* `truss_chains.make_abs_path_here`
Helper to specify file paths relative to the *immediately calling* module.
E.g. in you have a project structure like this:
```default theme={"system"}
root/
chain.py
common_requirements.text
sub_package/
chainlet.py
chainlet_requirements.txt
```
You can now in `root/sub_package/chainlet.py` point to the requirements
file like this:
```python theme={"system"}
shared = make_abs_path_here("../common_requirements.text")
specific = make_abs_path_here("chainlet_requirements.text")
```
This helper uses the directory of the immediately calling module as an
absolute reference point for resolving the file location. Therefore,
you MUST NOT wrap the instantiation of `make_abs_path_here` into a
function (e.g. applying decorators) or use dynamic code execution.
Ok:
```python theme={"system"}
def foo(path: AbsPath):
abs_path = path.abs_path
foo(make_abs_path_here("./somewhere"))
```
Not Ok:
```python theme={"system"}
def foo(path: str):
dangerous_value = make_abs_path_here(path).abs_path
foo("./somewhere")
```
**Parameters:**
| Name | Type | Description |
| ----------- | ----- | -------------------------- |
| `file_path` | *str* | Absolute or relative path. |
* **Returns:**
*AbsPath*
### *function* `truss_chains.run_local`
Context manager local debug execution of a chain.
The arguments only need to be provided if the chainlets explicitly access any the
corresponding fields of [`DeploymentContext`](#class-truss-chains-deploymentcontext).
**Parameters:**
| Name | Type | Default | Description |
| --------------------- | ------------------------------------------------------------------------------------------ | ------- | -------------------------------------------------------------- |
| `secrets` | *Mapping\[str,str]\|None* | `None` | A dict of secrets keys and values to provide to the chainlets. |
| `data_dir` | *Path\|str\|None* | `None` | Path to a directory with data files. |
| `chainlet_to_service` | *Mapping\[str,[DeployedServiceDescriptor](#class-truss-chains-deployedservicedescriptor)]* | `None` | A dict of chainlet names to service descriptors. |
Example usage (as trailing main section in a chain file):
```python theme={"system"}
import os
import truss_chains as chains
class HelloWorld(chains.ChainletBase):
...
if __name__ == "__main__":
with chains.run_local(
secrets={"some_token": os.environ["SOME_TOKEN"]},
chainlet_to_service={
"SomeChainlet": chains.DeployedServiceDescriptor(
name="SomeChainlet",
display_name="SomeChainlet",
predict_url="https://...",
options=chains.RPCOptions(),
)
},
):
hello_world_chain = HelloWorld()
result = hello_world_chain.run_remote(max_value=5)
print(result)
```
Refer to the [local debugging guide](/development/chain/localdev)
for more details.
### *class* `truss_chains.DeployedServiceDescriptor`
Bases: `pydantic.BaseModel`
Bundles values to establish an RPC session to a dependency chainlet,
specifically with `StubBase`.
**Parameters:**
| Name | Type | Default |
| -------------- | ---------------------------------------------- | ------- |
| `name` | *str* | |
| `display_name` | *str* | |
| `options` | *[RPCOptions](#class-truss-chains-rpcoptions)* | |
| `predict_url` | *str\|None* | `None` |
| `internal_url` | *InternalURL* | `None` |
### *class* `truss_chains.StubBase`
Bases: `BasetenSession`, `ABC`
Base class for stubs that invoke remote chainlets.
Extends `BasetenSession` with methods for data serialization, de-serialization
and invoking other endpoints.
It is used internally for RPCs to dependency chainlets, but it can also be used
in user-code for wrapping a deployed truss model into the Chains framework. It
flexibly supports JSON and pydantic inputs and output. Example usage:
```python theme={"system"}
import pydantic
import truss_chains as chains
class WhisperOutput(pydantic.BaseModel):
...
class DeployedWhisper(chains.StubBase):
# Input JSON, output JSON.
async def run_remote(self, audio_b64: str) -> Any:
return await self.predict_async(
inputs={"audio": audio_b64})
# resp == {"text": ..., "language": ...}
# OR Input JSON, output pydantic model.
async def run_remote(self, audio_b64: str) -> WhisperOutput:
return await self.predict_async(
inputs={"audio": audio_b64}, output_model=WhisperOutput)
# OR Input and output are pydantic models.
async def run_remote(self, data: WhisperInput) -> WhisperOutput:
return await self.predict_async(data, output_model=WhisperOutput)
class MyChainlet(chains.ChainletBase):
def __init__(self, ..., context=chains.depends_context()):
...
self._whisper = DeployedWhisper.from_url(
WHISPER_URL,
context,
options=chains.RPCOptions(retries=3),
)
async def run_remote(self, ...):
await self._whisper.run_remote(...)
```
**Parameters:**
| Name | Type | Description |
| -------------------- | ----------------------------------------------------------------------------- | ----------------------------------------- |
| `service_descriptor` | *[DeployedServiceDescriptor](#class-truss-chains-deployedservicedescriptor)]* | Contains the URL and other configuration. |
| `api_key` | *str* | A baseten API key to authorize requests. |
#### *classmethod* from\_url(predict\_url, context\_or\_api\_key, options=None)
Factory method, convenient to be used in chainlet’s `__init__`-method.
**Parameters:**
| Name | Type | Description |
| -------------------- | ------------------------------------------------------------ | ------------------------------------------------------------------------------------ |
| `predict_url` | *str* | URL to predict endpoint of another chain / truss model. |
| `context_or_api_key` | *[DeploymentContext](#class-truss-chains-deploymentcontext)* | Deployment context object, obtained in the chainlet’s `__init__` or Baseten API key. |
| `options` | *[RPCOptions](#class-truss-chains-rpcoptions)* | RPC options, e.g. retries. |
#### Invocation Methods
* `async predict_async(inputs: PydanticModel, output_model: Type[PydanticModel]) → PydanticModel`
* `async predict_async(inputs: JSON, output_model: Type[PydanticModel]) →
PydanticModel`
* `async predict_async(inputs: JSON) → JSON`
* `async predict_async_stream(inputs: PydanticModel | JSON) -> AsyncIterator[bytes]`
Deprecated synchronous methods:
* `predict_sync(inputs: PydanticModel, output_model: Type[PydanticModel]) → PydanticModel`
* `predict_sync(inputs: JSON, output_model: Type[PydanticModel]) → PydanticModel`
* `predict_sync(inputs: JSON) → JSON`
### *class* `truss_chains.RemoteErrorDetail`
Bases: `pydantic.BaseModel`
When a remote chainlet raises an exception, this pydantic model contains
information about the error and stack trace and is included in JSON form in the
error response.
**Parameters:**
| Name | Type |
| ----------------------- | ------------------- |
| `exception_cls_name` | *str* |
| `exception_module_name` | *str\|None* |
| `exception_message` | *str* |
| `user_stack_trace` | *list\[StackFrame]* |
#### *method* format()
Format the error for printing, similar to how Python formats exceptions
with stack traces.
* **Returns:**
str
### *class* `truss_chains.GenericRemoteException`
Bases: `Exception`
Raised when calling a remote chainlet results in an error and it is not possible
to re-raise the same exception that was raise remotely in the caller.
# Training SDK
Source: https://docs.baseten.co/reference/sdk/training
API reference for the Baseten training SDK.
## Installation
Truss includes the training SDK:
[uv](https://docs.astral.sh/uv/) is a fast Python package manager. Create a virtual environment and install Truss:
```sh theme={"system"}
uv venv && source .venv/bin/activate
uv pip install truss
```
Create a virtual environment and install Truss with pip:
```sh theme={"system"}
python3 -m venv .venv && source .venv/bin/activate
pip install --upgrade truss
```
Create a virtual environment and install Truss with pip:
```sh theme={"system"}
python3 -m venv .venv && .venv\Scripts\activate
pip install --upgrade truss
```
Define your training job in a configuration file (typically `config.py`). Import the SDK and accelerator config:
```python config.py theme={"system"}
from truss_train import definitions
from truss.base import truss_config
```
You can also import classes directly from `truss_train` (for example, `from truss_train import Compute, Runtime`).
***
## Complete example
Copy this `config.py` as a starting point for your training project. It configures [caching](/training/concepts/cache) to persist pip packages between jobs, [checkpointing](/training/concepts/checkpointing) to save model weights, and GPU compute on a single H200 node. Modify the `start_commands`, `environment_variables`, and `accelerator` fields for your use case. For more examples, see [ml-cookbook](https://github.com/basetenlabs/ml-cookbook/tree/main/examples).
```python config.py theme={"system"}
from truss_train import definitions
from truss.base import truss_config
# The Docker image your training code runs in.
BASE_IMAGE = "pytorch/pytorch:2.7.0-cuda12.8-cudnn9-runtime"
# Runtime controls what happens when the container starts: which commands
# run, which secrets are injected, and whether caching and checkpointing
# are enabled.
training_runtime = definitions.Runtime(
start_commands=[
"pip install transformers datasets accelerate",
"torchrun --nproc-per-node=2 train.py",
],
environment_variables={
"HF_TOKEN": definitions.SecretReference(name="hf_access_token"),
"WANDB_API_KEY": definitions.SecretReference(name="wandb_api_key"),
},
# Cache persists pip packages and downloaded models between jobs.
cache_config=definitions.CacheConfig(enabled=True),
# Checkpointing writes model weights to $BT_CHECKPOINT_DIR for
# deployment or resuming later.
checkpointing_config=definitions.CheckpointingConfig(enabled=True),
)
# Compute defines the hardware allocated to each node.
training_compute = definitions.Compute(
node_count=1,
accelerator=truss_config.AcceleratorSpec(
accelerator=truss_config.Accelerator.H200,
count=2,
),
)
# TrainingJob combines the image, compute, and runtime into a single
# unit that Baseten provisions and runs.
training_job = definitions.TrainingJob(
image=definitions.Image(base_image=BASE_IMAGE),
compute=training_compute,
runtime=training_runtime,
)
# TrainingProject groups related jobs under one name. Pushing this
# config creates the project (or reuses it) and submits a new job.
training_project = definitions.TrainingProject(
name="llm-fine-tuning",
job=training_job,
)
```
***
## push
Submits a training job to Baseten. Every config you define with the classes below does nothing until you call `push()`.
When you call `push()`, Baseten:
1. Authenticates with your Baseten account.
2. Creates the [training project](/training/overview) if one with the given name doesn't already exist, or reuses the existing project.
3. Archives your source directory (your training script, data files, and any other local files) and uploads it.
4. Submits a new training job. Baseten provisions the hardware, pulls the container image, mounts any [BDN weights](#weightssource), extracts your source files into the container, and runs your [start\_commands](#runtime).
The job then progresses through the [training lifecycle](/training/lifecycle):
* `CREATED`: Baseten has received the training configuration.
* `DEPLOYING`: Baseten is provisioning compute resources and installing dependencies.
* `RUNNING`: Your training code is actively executing.
* `COMPLETED`: The job has finished. Checkpoints and artifacts have been saved.
* `DEPLOY_FAILED`: The job failed to deploy, likely due to a bad image or resource allocation issue.
* `FAILED`: The job encountered an error. Check the logs for details.
* `STOPPED`: The job was manually stopped.
The CLI command `truss train push config.py` performs the same steps with additional options for team selection and flag overrides.
The `push` function accepts either a file path or a `TrainingProject` object.
```python config.py theme={"system"}
from truss_train import push
# Pass a config file path:
def push(
config: Path,
*,
remote: str = "baseten",
) -> dict
# Pass a TrainingProject object:
def push(
config: TrainingProject,
*,
remote: str = "baseten",
source_dir: Optional[Path] = None,
) -> dict
```
### Parameters
Path to a `config.py` file or a [TrainingProject](#trainingproject) instance. When you pass a `Path`, Baseten imports the module and scans for an instance of `TrainingProject`. The module must contain exactly one.
Remote provider to push to. Defaults to `baseten`.
Root directory whose contents Baseten uploads as the job's working directory. Baseten archives this directory and extracts it into the container before running [start\_commands](#runtime). Only applies when `config` is a `TrainingProject`. Defaults to the current directory.
### Return value
Returns a dictionary containing the created training job. Use the `id` and `training_project.id` values to monitor the job, stream logs, and list checkpoints.
```json Output theme={"system"}
{
"id": "gvpql31",
"training_project_id": "aghi527",
"training_project": {
"id": "aghi527",
"name": "llm-fine-tuning"
},
"current_status": "TRAINING_JOB_CREATED",
"instance_type": { ... },
"name": "fine-tune-v1",
...
}
```
For example, to submit a training job programmatically, pass a `TrainingProject` object to `push()`:
```python submit_job.py theme={"system"}
from pathlib import Path
from truss.base import truss_config
from truss_train import push, definitions
project = definitions.TrainingProject(
name="llm-fine-tuning",
job=definitions.TrainingJob(
image=definitions.Image(base_image="pytorch/pytorch:2.7.0-cuda12.8-cudnn9-runtime"),
compute=definitions.Compute(
accelerator=truss_config.AcceleratorSpec(
accelerator=truss_config.Accelerator.H200,
count=2,
)
),
runtime=definitions.Runtime(
start_commands=["python train.py"],
environment_variables={
"HF_TOKEN": definitions.SecretReference(name="hf_access_token"),
},
),
),
)
result = push(config=project, source_dir=Path("./training"))
print(f"Project ID: {result['training_project']['id']}")
print(f"Job ID: {result['id']}")
```
```text Output theme={"system"}
Project ID: aghi527
Job ID: gvpql31
```
### After submitting
Once `push()` returns, Baseten queues your job and begins provisioning. Use the returned job ID to track progress:
* **Stream logs:** `truss train logs --job-id --tail`
* **Check status:** `truss train view --job-id `
* **List checkpoints:** Use the [get training job checkpoints](/reference/training-api/get-training-job-checkpoints) API.
* **Deploy a checkpoint:** For more information, see [deploy checkpoints](#deploy-checkpoints).
For a complete working example, see the [programmatic training API recipe](https://github.com/basetenlabs/ml-cookbook/tree/main/recipes/programmatic-training-api). For `config.py`-based submission with the CLI, see the [training getting started guide](/training/getting-started).
***
## TrainingProject
Groups related training jobs under a single named project. When you [push](#push) a `TrainingProject`, Baseten creates the project if it doesn't exist, then submits the attached [TrainingJob](#trainingjob). All jobs in a project share the same [project-level cache](/training/concepts/cache) and appear together in the dashboard.
```python config.py theme={"system"}
from truss_train import definitions
project = definitions.TrainingProject(
name="llm-fine-tuning",
job=training_job,
team_name="my-team",
)
```
### Parameters
Project name. Reusing a name adds jobs to the existing project.
Training job to submit. Defines the container image, compute resources, runtime commands, and optional weights. For more information, see [TrainingJob](#trainingjob).
Team that owns this project. Controls access and team-level cache scope.
## TrainingJob
Represents a single training run. Baseten provisions the hardware specified in [Compute](#compute), pulls the container [Image](#image), uploads your source directory, mounts any [WeightsSource](#weightssource) volumes, then executes the [Runtime](#runtime) start commands. For more information, see the [training lifecycle](/training/lifecycle).
```python config.py theme={"system"}
from truss_train import definitions, WeightsSource
from truss.base import truss_config
training_job = definitions.TrainingJob(
name="fine-tune-v1",
image=definitions.Image(base_image="pytorch/pytorch:2.7.0-cuda12.8-cudnn9-runtime"),
compute=definitions.Compute(
accelerator=truss_config.AcceleratorSpec(
accelerator=truss_config.Accelerator.H200,
count=4,
)
),
runtime=definitions.Runtime(
start_commands=["chmod +x ./run.sh && ./run.sh"],
checkpointing_config=definitions.CheckpointingConfig(enabled=True),
cache_config=definitions.CacheConfig(enabled=True),
),
weights=[
WeightsSource(
source="hf://meta-llama/Llama-3.1-8B@main",
mount_location="/app/models/llama",
),
],
)
```
### Parameters
Docker image that provides the training environment, including the OS, CUDA drivers, and pre-installed libraries. For more information, see [Image](#image).
Hardware allocation for each node. Set the GPU type and count via `accelerator`, and increase `node_count` for distributed training. Defaults to `Compute()`. For more information, see [Compute](#compute).
Controls container startup: shell commands to execute, environment variables to inject, and whether to enable caching or checkpointing. Defaults to `Runtime()`. For more information, see [Runtime](#runtime).
Display name for this job in the dashboard and API responses.
Opens an rSSH tunnel so you can attach VS Code or Cursor to the running container for live debugging. For more information, see [InteractiveSession](#interactivesession).
Controls which local files Baseten uploads to the container. Use this to exclude large directories, include files from outside the root, or change the root entirely. For more information, see [Workspace](#workspace).
Model weights that BDN mirrors and mounts read-only in the container. Supports Hugging Face, S3, GCS, Azure, R2, and direct URLs. For more information, see [WeightsSource](#weightssource).
## WeightsSource
Mounts pre-trained model weights into the training container as a read-only volume. Baseten mirrors the weights through [BDN](/development/model/bdn) before provisioning compute, so the data is ready when your container starts. On subsequent jobs with the same source, BDN serves the cached copy, which avoids re-downloading.
```python config.py theme={"system"}
from truss_train import WeightsSource
WeightsSource(
source="hf://Qwen/Qwen3-0.6B",
mount_location="/app/models/Qwen/Qwen3-0.6B",
)
```
```python config.py theme={"system"}
from truss_train import WeightsSource
WeightsSource(
source="s3://my-bucket/training-data",
mount_location="/app/data/training-data",
auth={"auth_method": "CUSTOM_SECRET", "auth_secret_name": "aws_credentials"},
)
```
```python config.py theme={"system"}
from truss_train import WeightsSource
WeightsSource(
source="hf://meta-llama/Llama-3.1-8B@main",
mount_location="/app/models/llama",
allow_patterns=["*.safetensors", "config.json", "tokenizer.*"],
ignore_patterns=["*.md", "*.txt"],
)
```
### Parameters
URI with scheme prefix.
| Scheme | Example | Description |
| ------- | ----------------------------------- | --------------------- |
| `hf://` | `hf://meta-llama/Llama-3.1-8B@main` | Hugging Face Hub. |
| `s3://` | `s3://my-bucket/path/to/data` | Amazon S3. |
| `gs://` | `gs://my-bucket/path/to/data` | Google Cloud Storage. |
| `r2://` | `r2://account_id.bucket/path` | Cloudflare R2. |
For Hugging Face sources, pin to a specific revision with the `@revision` suffix (branch, tag, or commit SHA).
Absolute path where Baseten mounts the weights in the container.
Authentication configuration. See the [BDN configuration reference](/development/model/bdn#configuration-reference).
Baseten secret name for credentials.
File patterns to include during download.
File patterns to exclude during download.
## Image
Sets the Docker image that Baseten pulls to create the training container. The image provides the OS, CUDA drivers, Python version, and any pre-installed libraries your training code needs. Use a public image from Docker Hub or a private image with [DockerAuth](#dockerauth).
```python config.py theme={"system"}
image = definitions.Image(
base_image="pytorch/pytorch:2.7.0-cuda12.8-cudnn9-runtime"
)
```
### Parameters
Full Docker image tag, such as `"pytorch/pytorch:2.7.0-cuda12.8-cudnn9-runtime"`.
Credentials for pulling from private registries like AWS ECR or Google Container Registry. Store actual credentials as [Baseten secrets](/organization/secrets). For more information, see [DockerAuth](#dockerauth).
### DockerAuth
Provides credentials for pulling images from private Docker registries (AWS ECR, Google Container Registry, etc.). Store the actual credential values as secrets in your [Baseten workspace](/organization/secrets) and reference them with [SecretReference](#secretreference).
Authentication method.
Docker registry URL.
IAM credentials for authenticating with AWS ECR. Requires `access_key_secret_ref` and `secret_access_key_secret_ref`. For more information, see [AWSIAMDockerAuth](#awsiamdockerauth).
Service account JSON credentials for authenticating with Google Container Registry. For more information, see [GCPServiceAccountJSONDockerAuth](#gcpserviceaccountjsondockerauth).
Username/password credentials for authenticating with registries that support static credentials (Docker Hub, GHCR, NGC). Not compatible with AWS ECR or GCP Artifact Registry. For more information, see [RegistrySecretDockerAuth](#registrysecretdockerauth).
#### AWSIAMDockerAuth
Authenticates with AWS ECR using IAM credentials.
```python config.py theme={"system"}
from truss.base import truss_config
image = definitions.Image(
base_image="123456789.dkr.ecr.us-east-1.amazonaws.com/my-image:latest",
docker_auth=definitions.DockerAuth(
auth_method=truss_config.DockerAuthType.AWS_IAM,
registry="123456789.dkr.ecr.us-east-1.amazonaws.com",
aws_iam_docker_auth=definitions.AWSIAMDockerAuth(
access_key_secret_ref=definitions.SecretReference(name="aws_access_key"),
secret_access_key_secret_ref=definitions.SecretReference(name="aws_secret_access_key"),
)
)
)
```
AWS access key ID, stored as a [Baseten secret](/organization/secrets) and referenced by name.
AWS secret access key, stored as a [Baseten secret](/organization/secrets) and referenced by name.
#### GCPServiceAccountJSONDockerAuth
Authenticates with Google Container Registry using service account JSON.
```python config.py theme={"system"}
from truss.base import truss_config
image = definitions.Image(
base_image="gcr.io/my-project/my-image:latest",
docker_auth=definitions.DockerAuth(
auth_method=truss_config.DockerAuthType.GCP_SERVICE_ACCOUNT_JSON,
registry="gcr.io",
gcp_service_account_json_docker_auth=definitions.GCPServiceAccountJSONDockerAuth(
service_account_json_secret_ref=definitions.SecretReference(name="gcp_service_account_json"),
)
)
)
```
GCP service account JSON, stored as a [Baseten secret](/organization/secrets) and referenced by name.
#### RegistrySecretDockerAuth
Authenticates with registries that support static username/password credentials, including Docker Hub, GHCR, and NGC. For AWS ECR or GCP Artifact Registry, use [AWSIAMDockerAuth](#awsiamdockerauth) or [GCPServiceAccountJSONDockerAuth](#gcpserviceaccountjsondockerauth) instead.
```python config.py theme={"system"}
from truss.base import truss_config
image = definitions.Image(
base_image="your-registry/your-image:latest",
docker_auth=definitions.DockerAuth(
auth_method=truss_config.DockerAuthType.REGISTRY_SECRET,
registry="docker.io",
registry_secret_docker_auth=definitions.RegistrySecretDockerAuth(
secret_ref=definitions.SecretReference(name="my_docker_cred")
)
)
)
```
Registry credentials in `username:password` format (plaintext, not Base64-encoded), stored as a [Baseten secret](/organization/secrets) and referenced by name.
## Compute
Defines the hardware Baseten allocates for each training job. Set `node_count` above 1 for [multi-node distributed training](/training/concepts/multinode), which provisions multiple identical nodes and injects coordination environment variables (`BT_LEADER_ADDR`, `BT_NODE_RANK`, `BT_GROUP_SIZE`).
```python config.py theme={"system"}
from truss.base import truss_config
compute = definitions.Compute(
node_count=2,
cpu_count=8,
memory="64Gi",
accelerator=truss_config.AcceleratorSpec(
accelerator=truss_config.Accelerator.H200,
count=4,
)
)
```
### Parameters
Number of nodes to provision. Each node gets the full CPU, memory, and GPU allocation.
CPU cores per node.
RAM per node (for example, `"64Gi"`). Defaults to `2Gi`.
GPU type and count per node. For more information, see [AcceleratorSpec](#acceleratorspec).
### AcceleratorSpec
Selects the GPU type and count per node. The `count` determines how many GPUs are available to your training script on each node (exposed as `$BT_NUM_GPUS`).
GPU type.
Available options:
* `A10G`: NVIDIA A10G.
* `H200`: NVIDIA H200.
Number of GPUs per node.
## Runtime
Controls what happens when the training container starts. Baseten executes `start_commands` in order inside the container. Use them to install dependencies, set up data, and launch your training script. Baseten injects environment variables before the first command runs; use [SecretReference](#secretreference) for sensitive values like API keys so they aren't stored in your config file.
```python config.py theme={"system"}
runtime = definitions.Runtime(
start_commands=["chmod +x ./run.sh && ./run.sh"],
environment_variables={
"BATCH_SIZE": "32",
"WANDB_API_KEY": definitions.SecretReference(name="wandb_api_key"),
"HF_TOKEN": definitions.SecretReference(name="hf_access_token"),
},
checkpointing_config=definitions.CheckpointingConfig(enabled=True),
cache_config=definitions.CacheConfig(enabled=True),
)
```
### Parameters
Shell commands that Baseten executes sequentially when the container starts.
Key-value pairs that Baseten injects as env vars. Use [SecretReference](#secretreference) for sensitive values.
Enables writing model checkpoints to persistent storage. When enabled, Baseten mounts a volume and exports `$BT_CHECKPOINT_DIR`. Defaults to `CheckpointingConfig()`. For more information, see [CheckpointingConfig](#checkpointingconfig).
Enables a persistent read-write cache that survives across jobs for pip packages, model downloads, and preprocessed datasets. For more information, see [CacheConfig](#cacheconfig).
Downloads checkpoints from a previous job into the container before `start_commands` run. Use this to resume training or initialize weights from an earlier experiment. For more information, see [LoadCheckpointConfig](#loadcheckpointconfig).
Use `cache_config` with `enabled=True` instead.
### SecretReference
Injects a secret stored in your [Baseten workspace](/organization/secrets) as an environment variable at runtime. Baseten never writes the value to your config file or source code. Use this for API keys, tokens, and credentials.
```python config.py theme={"system"}
secret_ref = definitions.SecretReference(name="wandb_api_key")
```
Name of the secret as it appears in your workspace settings.
### CheckpointingConfig
Enables persistent checkpoint storage for the training job. When `enabled` is true, Baseten mounts a persistent volume and exports `$BT_CHECKPOINT_DIR` as an environment variable pointing to it. Your training script writes model weights, optimizer state, or any artifacts to that directory. These checkpoints survive job termination and can be [deployed to inference](/training/deployment) or [loaded into future jobs](#loadcheckpointconfig). See the [checkpointing guide](/training/concepts/checkpointing) for best practices.
```python config.py theme={"system"}
checkpointing = definitions.CheckpointingConfig(
enabled=True,
volume_size_gib=500,
)
```
Set to `true` to mount a persistent checkpoint volume.
Override the default checkpoint directory path.
Size of the checkpoint volume in GiB. Defaults to a platform-managed size.
### CacheConfig
Enables a persistent read-write cache that survives across jobs. Use the cache for pip packages, downloaded model weights, preprocessed datasets, or any data you don't want to re-download on every run. When `enabled` is true, Baseten mounts two shared directories into the container. When `require_cache_affinity` is true (the default), Baseten schedules the job on a node that already has cached data, which avoids cold starts. See the [cache guide](/training/concepts/cache) for usage patterns.
```python config.py theme={"system"}
cache = definitions.CacheConfig(
enabled=True,
require_cache_affinity=True,
)
```
When enabled, Baseten exports two cache directories as environment variables.
| Environment variable | Description |
| ----------------------- | -------------------------------------------------------------------------------------------------------------------------------- |
| `$BT_PROJECT_CACHE_DIR` | Shared across all jobs in the same [TrainingProject](#trainingproject). Use for project-specific datasets or compiled artifacts. |
| `$BT_TEAM_CACHE_DIR` | Shared across all jobs in the same team. Use for common model weights or shared libraries. |
Set to `true` to mount persistent cache volumes.
Mount the Hugging Face cache at the legacy path for backward compatibility.
Schedule the job on a node with existing cached data when possible.
Base path where Baseten mounts cache directories. Defaults to `/root/.cache`.
### LoadCheckpointConfig
Downloads checkpoints from previous training jobs into the container before `start_commands` run. Use this to resume training from a saved state or to initialize weights from an earlier experiment. Baseten downloads the specified checkpoints to `download_folder` (also exported as `$BT_LOAD_CHECKPOINT_DIR`) and your training script reads them at startup. For more information, see the [loading checkpoints](/training/loading) walkthrough.
```python config.py theme={"system"}
load_config = definitions.LoadCheckpointConfig(
enabled=True,
download_folder="/tmp/loaded_checkpoints",
checkpoints=[
definitions.BasetenCheckpoint.from_latest_checkpoint(project_name="my-project"),
definitions.BasetenCheckpoint.from_named_checkpoint(
checkpoint_name="checkpoint-24",
job_id="abc123",
)
]
)
```
Set to `true` to download checkpoints before `start_commands` run.
One or more checkpoint references to download. Create references with `BasetenCheckpoint.from_latest_checkpoint()` or `BasetenCheckpoint.from_named_checkpoint()`. For more information, see [BasetenCheckpoint](#basetencheckpoint).
Directory where Baseten downloads checkpoints. Exported as `$BT_LOAD_CHECKPOINT_DIR`. Defaults to `/tmp/loaded_checkpoints`.
### BasetenCheckpoint
Creates references to checkpoints saved by previous training jobs. Pass these references to [LoadCheckpointConfig](#loadcheckpointconfig) to download checkpoint data into your container at job start. You can reference checkpoints by project name (gets the most recent), by job ID (gets the most recent from that job), or by exact checkpoint name and job ID.
```python config.py theme={"system"}
latest = definitions.BasetenCheckpoint.from_latest_checkpoint(
project_name="my-fine-tuning-project"
)
specific = definitions.BasetenCheckpoint.from_named_checkpoint(
checkpoint_name="checkpoint-100",
job_id="abc123",
)
runtime = definitions.Runtime(
start_commands=["python train.py"],
load_checkpoint_config=definitions.LoadCheckpointConfig(
enabled=True,
checkpoints=[latest, specific],
)
)
```
#### from\_latest\_checkpoint
Returns a reference to the most recent checkpoint from a project or job. At least one of `project_name` or `job_id` is required.
```python theme={"system"}
BasetenCheckpoint.from_latest_checkpoint(
project_name: Optional[str] = None,
job_id: Optional[str] = None,
)
```
Project name to get the latest checkpoint from.
Job ID to get the latest checkpoint from.
#### from\_named\_checkpoint
Returns a reference to a specific checkpoint by its name and job ID.
```python theme={"system"}
BasetenCheckpoint.from_named_checkpoint(
checkpoint_name: str,
job_id: str,
)
```
Checkpoint name.
Job ID.
## Workspace
Controls which local files Baseten uploads to the training container. By default, Baseten archives the directory containing your `config.py` (or the `source_dir` you pass to [push](#push)) and extracts it into the container's working directory. Use `Workspace` to customize this behavior: exclude large data directories, include files from outside the root, or change the root entirely.
```python config.py theme={"system"}
training_job = definitions.TrainingJob(
image=definitions.Image(base_image="pytorch/pytorch:2.7.0-cuda12.8-cudnn9-runtime"),
workspace=definitions.Workspace(
exclude_dirs=["data", ".git"],
),
)
```
### Parameters
Override the root directory to archive. Defaults to the config file's parent directory.
Additional directories outside `workspace_root` to include in the upload.
Directories to exclude from the upload (for example, `"data"`, `".git"`, `"__pycache__"`).
## InteractiveSession
Opens an [rSSH tunnel](/training/interactive-sessions) to the training container so you can attach VS Code or Cursor for live debugging. The tunnel stays active for `timeout_minutes`, then closes automatically. Use `trigger` to control when the session starts: immediately on job start, only when training fails, or on-demand from the dashboard. See the [interactive sessions guide](/training/interactive-sessions) for setup details.
```python config.py theme={"system"}
from truss_train.definitions import (
InteractiveSession,
InteractiveSessionTrigger,
InteractiveSessionProvider,
InteractiveSessionAuthProvider,
)
training_job = definitions.TrainingJob(
image=definitions.Image(base_image="pytorch/pytorch:2.7.0-cuda12.8-cudnn9-runtime"),
compute=definitions.Compute(
accelerator=truss_config.AcceleratorSpec(accelerator="H200", count=2),
),
runtime=definitions.Runtime(
start_commands=["chmod +x ./run.sh && ./run.sh"],
),
interactive_session=InteractiveSession(
trigger=InteractiveSessionTrigger.ON_FAILURE,
timeout_minutes=-1,
session_provider=InteractiveSessionProvider.VS_CODE,
auth_provider=InteractiveSessionAuthProvider.GITHUB,
),
)
```
### Parameters
Controls when to activate the session. Defaults to `ON_DEMAND`.
Available options:
* `ON_STARTUP`: active from job start.
* `ON_FAILURE`: activates when training exits with a non-zero code.
* `ON_DEMAND`: activates when you change the trigger on a running job.
Minutes before the session expires. Set to `-1` to extend the expiry to 10 years.
IDE for the remote tunnel. Defaults to `VS_CODE`.
Available options:
* `VS_CODE`: VS Code Remote Tunnels.
* `CURSOR`: Cursor Remote Tunnels.
Authentication provider for the device code flow. Defaults to `MICROSOFT`.
Available options:
* `GITHUB`: authenticate via GitHub.
* `MICROSOFT`: authenticate via Microsoft.
***
## Environment variables
Baseten automatically injects these environment variables into every training container. Your training script can read them to discover job metadata, locate checkpoint and cache directories, and coordinate across nodes in [multi-node jobs](/training/concepts/multinode).
### Standard variables
| Variable | Description | Example |
| -------------------------- | -------------------------------- | ------------------------------- |
| `BT_TRAINING_JOB_ID` | Training job ID. | `"gvpql31"` |
| `BT_TRAINING_PROJECT_ID` | Training project ID. | `"aghi527"` |
| `BT_TRAINING_JOB_NAME` | Training job name. | `"gpt-oss-20b-lora"` |
| `BT_TRAINING_PROJECT_NAME` | Training project name. | `"gpt-oss-finetunes"` |
| `BT_NUM_GPUS` | Number of GPUs per node. | `"4"` |
| `BT_CHECKPOINT_DIR` | Checkpoint save directory. | `"/mnt/ckpts"` |
| `BT_LOAD_CHECKPOINT_DIR` | Loaded checkpoints directory. | `"/tmp/loaded_checkpoints"` |
| `BT_PROJECT_CACHE_DIR` | Project-level cache directory. | `"/root/.cache/user_artifacts"` |
| `BT_TEAM_CACHE_DIR` | Team-level cache directory. | `"/root/.cache/team_artifacts"` |
| `BT_RW_CACHE_DIR` | Base read-write cache directory. | `"/root/.cache"` |
| `BT_RETRY_COUNT` | Job retry attempt count. | `"0"` |
### Multi-node variables
For distributed training across multiple nodes:
| Variable | Description | Example |
| ---------------- | ------------------------------ | ------------ |
| `BT_GROUP_SIZE` | Number of nodes in deployment. | `"2"` |
| `BT_LEADER_ADDR` | Leader node address. | `"10.0.0.1"` |
| `BT_NODE_RANK` | Node rank (0 for leader). | `"0"` |
***
## Deploy checkpoints
Deploys trained model checkpoints from a completed training job to Baseten's inference platform. Baseten downloads the checkpoint weights, packages them with a serving runtime, and creates a deployable model endpoint. See the [deployment guide](/training/deployment) for the full workflow.
### Deploy with CLI wizard
Deploy checkpoints interactively with the CLI wizard:
```bash theme={"system"}
truss train deploy_checkpoints --job-id
```
The wizard guides you through selecting checkpoints and configuring deployment. Baseten automatically recognizes checkpoints for full fine-tunes and LoRAs for LLMs and Whisper models.
The `deploy_checkpoints` command doesn't support FSDP checkpoints. Configure these manually in the Truss config.
For optimized inference with TensorRT-LLM, see [Deploy checkpoints with Engine Builder](/engines/performance-concepts/deployment-from-training-and-s3).
### Deploy with static configuration
Create a Python config file for repeatable deployments:
```bash theme={"system"}
truss train deploy_checkpoints --config
```
## DeployCheckpointsConfig
Defines how to deploy checkpoints from a completed training job to a Baseten inference endpoint. Baseten reads the checkpoint weights, selects the correct serving backend based on the model weights format (full, LoRA, or Whisper), and provisions the specified [Compute](#compute) resources.
```python deploy_config.py theme={"system"}
from truss_train import definitions
from truss.base import truss_config
deploy_config = definitions.DeployCheckpointsConfig(
model_name="fine-tuned-llm",
checkpoint_details=definitions.CheckpointList(
base_model_id="meta-llama/Llama-3.1-8B-Instruct",
checkpoints=[
definitions.LoRACheckpoint(
training_job_id="gvpql31",
checkpoint_name="checkpoint-100",
lora_details=definitions.LoRADetails(rank=16),
)
]
),
compute=definitions.Compute(
accelerator=truss_config.AcceleratorSpec(
accelerator=truss_config.Accelerator.H200,
count=1,
)
),
)
```
### Parameters
Checkpoints to deploy, including the base model ID for LoRA and one or more checkpoint references. For more information, see [CheckpointList](#checkpointlist).
Name for the deployed model in the Baseten dashboard.
Environment variables for the inference runtime, such as API keys or serving configuration. For more information, see [DeployCheckpointsRuntime](#deploycheckpointsruntime).
GPU and memory allocation for the inference endpoint. Uses the same [Compute](#compute) configuration as training jobs.
### DeployCheckpointsRuntime
Sets environment variables for the deployed inference endpoint. Use this to inject API keys or configuration that the serving runtime needs.
Key-value pairs that Baseten injects as env vars. Use [SecretReference](#secretreference) for sensitive values.
### CheckpointList
Groups one or more checkpoints for deployment. For LoRA deployments, set `base_model_id` to the Hugging Face model ID you trained the adapters on.
Directory where Baseten downloads checkpoint files during deployment. Defaults to `/tmp/training_checkpoints`.
Hugging Face model ID for the base model. Required for LoRA deployments.
One or more [FullCheckpoint](#fullcheckpoint), [LoRACheckpoint](#loracheckpoint), or [WhisperCheckpoint](#whispercheckpoint) instances.
### Checkpoint types
Baseten supports three checkpoint types. Use the type that matches how your model was trained.
#### FullCheckpoint
Deploys a complete set of model weights from a full fine-tune.
Training job ID.
Checkpoint name.
Auto-set to `full`.
#### LoRACheckpoint
Deploys LoRA adapter weights on top of the base model you specify in [CheckpointList](#checkpointlist).
Training job ID.
Checkpoint name.
Auto-set to `lora`.
LoRA adapter configuration. Set `rank` to match the rank you used during training. Defaults to `LoRADetails()`. Valid values:
* 8, 16, 32, 64, 128, 256, 320, 512.
For more information, see [LoRADetails](#loradetails).
#### WhisperCheckpoint
Deploys fine-tuned Whisper model weights for speech-to-text inference.
Training job ID.
Checkpoint name.
Auto-set to `whisper`.
### LoRADetails
Sets the LoRA rank for adapter deployment. The rank must match the rank you set during training.
LoRA rank. Valid values: 8, 16, 32, 64, 128, 256, 320, 512.
# Truss SDK Reference
Source: https://docs.baseten.co/reference/sdk/truss
Python SDK for deploying and managing models with Truss.
## Authentication
### `truss.login(api_key: str) → None`
Authenticates with Baseten using an API key.
**Parameters:**
| Name | Type | Description |
| --------- | ----- | ---------------- |
| `api_key` | *str* | Baseten API Key. |
***
## Deploying a Model
### `truss.push(target_directory: str, **kwargs) → ModelDeployment`
Deploys a **Truss** model to Baseten.
**Parameters:**
| Name | Type | Description |
| ----------------------------------------- | ---------------- | ----------------------------------------------------------------------------------------------------------- |
| `target_directory` | *str* | Path to the Truss directory to push. |
| `remote` | *Optional\[str]* | Name of the remote in `.trussrc` to push to. |
| `model_name` | *Optional\[str]* | Temporarily override the model name for this deployment without updating `config.yaml`. |
| `promote` | *bool* | Deploy as **published** and promote to production, even if a production deployment exists. |
| `preserve_previous_production_deployment` | *bool* | Preserve the previous production deployment's **autoscaling settings** (only with `promote`). |
| `trusted` | *bool* | Grants **access to secrets** on the remote host. |
| `deployment_name` | *Optional\[str]* | Custom deployment name (must contain only alphanumeric, `.`, `-`, or `_` characters). (Requires `promote`.) |
**Returns:** [ModelDeployment](#class-truss-api-definitions-modeldeployment) – An object representing the deployed model.
***
## Model Deployment Object
### *class* `truss.api.definitions.ModelDeployment`
Represents a deployed model (returned by `truss.push()`).
**Attributes**
`model_id` → `str`: Unique ID of the deployed model.
`model_deployment_id` → `str`:Unique ID of the model deployment.
**Methods**
`wait_for_active()` → bool
Waits for the deployment to become **active**.
**Returns**: `True` when deployment is ready.
**Raises**: An error if deployment fails.
# Create training job
Source: https://docs.baseten.co/reference/training-api/create-training-job
post /v1/training_projects/{training_project_id}/jobs
Creates a training job with the specified configuration.
# Create training project
Source: https://docs.baseten.co/reference/training-api/create-training-project
post /v1/training_projects
Upserts a training project with the specified metadata.
# Delete training job
Source: https://docs.baseten.co/reference/training-api/delete-training-job
delete /v1/training_projects/{training_project_id}/jobs/{training_job_id}
Deletes a training job. Stops it first if still running.
# Delete training project
Source: https://docs.baseten.co/reference/training-api/delete-training-project
delete /v1/training_projects/{training_project_id}
Deletes a training project and all associated training jobs.
# Download training job source code
Source: https://docs.baseten.co/reference/training-api/download-training-job
get /v1/training_projects/{training_project_id}/jobs/{training_job_id}/download
Get the uploaded training job as a S3 Artifact
# Get auth codes for training job
Source: https://docs.baseten.co/reference/training-api/get-auth-codes-for-training-job
get /v1/training_projects/{training_project_id}/jobs/{training_job_id}/auth_codes
Get authentication codes for all nodes of a training job's interactive sessions.
# Get training job
Source: https://docs.baseten.co/reference/training-api/get-training-job
get /v1/training_projects/{training_project_id}/jobs/{training_job_id}
Get the details of an existing training job.
# Get training job checkpoint files
Source: https://docs.baseten.co/reference/training-api/get-training-job-checkpoint-files
get /v1/training_projects/{training_project_id}/jobs/{training_job_id}/checkpoint_files
Get presigned URLs for all checkpoint files for a training job.
# List training job checkpoints
Source: https://docs.baseten.co/reference/training-api/get-training-job-checkpoints
get /v1/training_projects/{training_project_id}/jobs/{training_job_id}/checkpoints
Get the checkpoints for a training job.
# Get training job logs
Source: https://docs.baseten.co/reference/training-api/get-training-job-logs
post /v1/training_projects/{training_project_id}/jobs/{training_job_id}/logs
Get the logs for a training job with the provided filters.
# Get training job metrics
Source: https://docs.baseten.co/reference/training-api/get-training-job-metrics
post /v1/training_projects/{training_project_id}/jobs/{training_job_id}/metrics
Get the metrics for a training job.
# Get training project
Source: https://docs.baseten.co/reference/training-api/get-training-project
get /v1/training_projects/{training_project_id}
Get the details of an existing training project.
# Get training project cache summary
Source: https://docs.baseten.co/reference/training-api/get-training-project-cache-summary
get /v1/training_projects/{training_project_id}/cache/summary
Get the cache summary for the most recent training job in the project.
# List training projects
Source: https://docs.baseten.co/reference/training-api/get-training-projects
get /v1/training_projects
List all training projects for the organization.
# List training jobs
Source: https://docs.baseten.co/reference/training-api/list-training-jobs
get /v1/training_projects/{training_project_id}/jobs
List all training jobs for the training project.
# Overview
Source: https://docs.baseten.co/reference/training-api/overview
Programmatically manage Baseten Training resources.
The Training API manages training projects, jobs, and related resources through a RESTful interface. Use this API to:
* Monitor training job metrics and logs
* Manage training jobs
* Manage checkpoints and artifacts
## Authentication
All Training API requests require authentication with an API key:
```bash theme={"system"}
Authorization: Api-Key EMPTY
```
## Base URL
All Training API endpoints are relative to:
```text theme={"system"}
https://api.baseten.co/v1
```
## Available Endpoints
### Training Projects
| Method | Endpoint | Description |
| -------- | --------------------------------------------------------------------------------------------- | -------------------------- |
| `GET` | [`/training_projects`](/reference/training-api/get-training-projects) | List all training projects |
| `GET` | [`/training_projects/{training_project_id}`](/reference/training-api/get-training-project) | Get a training project |
| `POST` | [`/training_projects`](/reference/training-api/create-training-project) | Create a training project |
| `DELETE` | [`/training_projects/{training_project_id}`](/reference/training-api/delete-training-project) | Delete a training project |
### Training Jobs
**Note: Creating training jobs via REST API is not supported at this time.**
The following endpoints use the relative base path: `/training_projects/{training_project_id}/jobs`
| Method | Endpoint | Description |
| -------- | ----------------------------------------------------------------------------------------------------- | --------------------------------- |
| `POST` | [`.../`](/reference/training-api/create-training-job) | Create a training job |
| `GET` | [`.../`](/reference/training-api/list-training-jobs) | List all jobs in a project |
| `GET` | [`.../{training_job_id}`](/reference/training-api/get-training-job) | Get a specific training job |
| `POST` | [`.../{training_job_id}/stop`](/reference/training-api/stop-training-job) | Stop a training job |
| `DELETE` | [`.../{training_job_id}`](/reference/training-api/delete-training-job) | Delete a training job |
| `POST` | [`.../{training_job_id}/recreate`](/reference/training-api/recreate-training-job) | Recreate a training job |
| `POST` | [`.../{training_job_id}/logs`](/reference/training-api/get-training-job-logs) | Get training job logs |
| `POST` | [`.../{training_job_id}/metrics`](/reference/training-api/get-training-job-metrics) | Get training job metrics |
| `GET` | [`.../{training_job_id}/checkpoints`](/reference/training-api/get-training-job-checkpoints) | List job checkpoints |
| `GET` | [`.../{training_job_id}/checkpoint_files`](/reference/training-api/get-training-job-checkpoint-files) | Get training job checkpoint files |
| `GET` | [`.../{training_job_id}/download`](/reference/training-api/download-training-job) | Download training job artifacts |
Search endpoint:
| Method | Endpoint | Description |
| ------ | ----------------------------------------------------------------------- | ------------------------------- |
| `POST` | [`/training_jobs/search`](/reference/training-api/search-training-jobs) | Search across all training jobs |
# Recreate training job
Source: https://docs.baseten.co/reference/training-api/recreate-training-job
post /v1/training_projects/{training_project_id}/jobs/{training_job_id}/recreate
Create a new training job with the same configuration as an existing training job.
# Search training jobs
Source: https://docs.baseten.co/reference/training-api/search-training-jobs
post /v1/training_jobs/search
Search training jobs for the organization.
# Stop training job
Source: https://docs.baseten.co/reference/training-api/stop-training-job
post /v1/training_projects/{training_project_id}/jobs/{training_job_id}/stop
Stops a training job.
# Truss configuration
Source: https://docs.baseten.co/reference/truss-configuration
Set your model resources, dependencies, and more
The `config.yaml` file defines how your model runs on Baseten: its dependencies,
compute resources, secrets, and runtime behavior. You specify what your model
needs; Baseten handles the infrastructure.
Every Truss includes a `config.yaml` in its root directory. Configuration is
optional, every value has a sensible default.
Common configuration tasks include:
* [Allocate GPU and memory](#resources): compute resources for your instance.
* [Declare environment variables](#environment-variables): environment variables for your model.
* [Configure concurrency](#runtime): parallel request handling.
* [Use a custom Docker image](#base-image): deploy pre-built inference servers.
If you're new to YAML, here's a quick primer.
The default config uses `[]` for empty lists and `{}` for empty dictionaries.
When adding values, the syntax changes to indented lines:
```yaml theme={"system"}
# Empty
requirements: []
secrets: {}
# With values
requirements:
- torch
- transformers
secrets:
hf_access_token: null
```
## Example
The following example shows a config file for a GPU-accelerated text generation model:
```yaml config.yaml theme={"system"}
model_name: my-llm
description: A text generation model.
requirements:
- torch
- transformers
- accelerate
resources:
cpu: "4"
memory: 16Gi
accelerator: L4
secrets:
hf_access_token: null
```
For more examples, see the
[truss-examples](https://github.com/basetenlabs/truss-examples) repository.
## Reference
The name of your model.
This is displayed in the model details page in the Baseten UI.
A description of your model.
The name of the class that defines your Truss model.
This class must implement at least a `predict` method.
The folder containing your model class.
The folder for data files in your Truss. Access it in your model:
```python model/model.py theme={"system"}
class Model:
def __init__(self, **kwargs):
data_dir = kwargs["data_dir"]
# ...
```
The folder for custom packages in your Truss.
Place your own code here to reference in `model.py`. For example, with this project structure:
```output theme={"system"}
stable-diffusion/
packages/
package_1/
subpackage/
script.py
model/
model.py
__init__.py
config.yaml
```
Inside the `model.py` the package can be imported like this:
```python model/model.py theme={"system"}
from package_1.subpackage.script import run_script
class Model:
def __init__(self, **kwargs):
pass
def load(self):
run_script()
...
```
Use `external_package_dirs` to access custom packages located outside your Truss.
This lets multiple Trusses share the same package.
The following example shows a project structure where `shared_utils/` is outside the Truss:
```output theme={"system"}
my-model/
model/
model.py
config.yaml
shared_utils/
helpers.py
```
Specify the path in your `config.yaml`:
```yaml config.yaml theme={"system"}
external_package_dirs:
- ../shared_utils/
```
Then import the package in your `model.py`:
```python model.py theme={"system"}
from shared_utils.helpers import process_input
class Model:
def predict(self, model_input):
return process_input(model_input)
```
Key-value pairs exposed to the environment that the model executes in.
Many Python libraries can be customized using environment variables.
Do not store secret values directly in environment variables (or anywhere in
the config file). See the `secrets` field for information on properly managing
secrets.
```yaml theme={"system"}
environment_variables:
ENVIRONMENT: Staging
DB_URL: https://my_database.example.com/
```
A flexible field for additional metadata.
The entire config file is available to your model at runtime.
**Reserved keys** that Baseten interprets:
* `example_model_input`: Sample input that populates the Baseten playground.
For example, to configure a model with playground input and custom vLLM settings, use the following:
```yaml theme={"system"}
model_metadata:
example_model_input: {"prompt": "What is the meaning of life?"}
vllm_config:
tensor_parallel_size: 1
max_model_len: 4096
```
Path to a dependency file. Supports `requirements.txt`, `pyproject.toml`, and `uv.lock`.
Truss detects the format by filename. Pin versions for reproducibility.
When set to a `pyproject.toml`, Truss installs packages from `[project.dependencies]`.
When set to a `uv.lock`, a sibling `pyproject.toml` must exist in the same directory.
```yaml theme={"system"}
requirements_file: ./requirements.txt
```
```yaml theme={"system"}
requirements_file: ./pyproject.toml
```
```yaml theme={"system"}
requirements_file: ./uv.lock
```
A list of Python dependencies in [pip requirements file format](https://pip.pypa.io/en/stable/reference/requirements-file-format/).
Mutually exclusive with `requirements_file` -- only one can be specified.
For example, to install pinned versions of the dependencies, use the following:
```yaml theme={"system"}
requirements:
- scikit-learn==1.0.2
- threadpoolctl==3.0.0
- joblib==1.1.0
- numpy==1.20.3
- scipy==1.7.3
```
System packages that you would typically install using `apt` on a Debian operating system.
```yaml theme={"system"}
system_packages:
- ffmpeg
- libsm6
- libxext6
```
The Python version to use.
Supported versions:
* `py39`
* `py310`
* `py311`
* `py312`
* `py313`
* `py314`
Declare secrets your model needs at runtime, such as API keys or access tokens.
Store the actual values in your [organization settings](https://app.baseten.co/settings/secrets).
Never store actual secret values in config. Use `null` as a placeholder. The key name must match the secret name in your organization.
```yaml theme={"system"}
secrets:
hf_access_token: null
```
For more information, see [Secrets](/development/model/secrets).
The path to a file containing example inputs for your model.
If true, changes to your model code are automatically reloaded without restarting the server. Useful for development.
Whether to apply library patches for improved compatibility.
## resources
The `resources` section specifies the compute resources that your model needs, including CPU, memory, and GPU resources.
You can configure resources in two ways:
**Option 1: Specify individual resource fields**
```yaml theme={"system"}
resources:
accelerator: A10G
cpu: "4"
memory: 20Gi
```
Baseten provisions the smallest instance that meets the specified constraints.
**Option 2: Specify an exact instance type**
```yaml theme={"system"}
resources:
instance_type: "A10G:4x16"
```
Using `instance_type` lets you select an exact SKU from the [instance type reference](/deployment/resources#instance-type-reference). When `instance_type` is specified, other resource fields are ignored.
CPU resources needed, expressed as either a raw number or "millicpus".
For example, `1000m` and `1` are equivalent.
Fractional CPU amounts can be requested using millicpus.
For example, `500m` is half of a CPU core.
CPU RAM needed, expressed as a number with units.
Units include "Gi" (Gibibytes), "G" (Gigabytes), "Mi" (Mebibytes), and "M" (Megabytes).
For example, `1Gi` and `1024Mi` are equivalent.
`Gi` in `resources.memory` refers to **Gibibytes**, which are slightly larger
than **Gigabytes**.
The GPU type for your instance.
Available GPUs:
* `T4`
* `L4`
* `L40S`
* `A10G`
* `V100`
* `A100`
* `A100_40GB`
* `H100`
* `H100_40GB` ([fractional GPU details](https://www.baseten.co/blog/using-fractional-h100-gpus-for-efficient-model-serving/))
* `H200`
* `B200`
To request multiple GPUs (for example, if the weights don't fit in a single GPU), use the `:` operator:
```yaml theme={"system"}
resources:
accelerator: L4:4 # Requests 4 L4s
```
For more information, see how to [Manage resources](/deployment/resources).
The full SKU name for the instance type. When specified, `cpu`, `memory`, and `accelerator` fields are ignored.
Use this field to select an exact instance type from the [instance type reference](/deployment/resources#instance-type-reference). The format is `:x` for GPU instances or `CPU:x` for CPU-only instances.
```yaml theme={"system"}
resources:
instance_type: "L4:4x16"
```
Examples:
* `L4:4x16`: L4 GPU with 4 vCPUs and 16 GiB RAM.
* `H100:8x80`: H100 GPU with 8 vCPUs and 80 GiB RAM (the exact specs vary by GPU type).
* `CPU:4x16`: CPU-only instance with 4 vCPUs and 16 GiB RAM.
The number of nodes for multi-node deployments. Each node gets the specified resources.
## runtime
Runtime settings for your model instance.
For example, to configure a high-throughput inference server with concurrency and health checks, use the following:
```yaml theme={"system"}
runtime:
predict_concurrency: 256
streaming_read_timeout: 120
health_checks:
restart_threshold_seconds: 600
stop_traffic_threshold_seconds: 300
```
The number of concurrent requests that can run in your model's predict method. Default is 1, meaning `predict` runs one request at a time. Increase this if your model supports parallelism.
See [Autoscaling](/deployment/autoscaling/overview#scaling-triggers) for more detail.
The timeout in seconds for streaming read operations.
If true, enables trace data export with built-in OTEL instrumentation. By default, data is collected internally by Baseten for troubleshooting. You can also export to your own systems. See the [tracing guide](/observability/tracing). May add performance overhead.
If true, sets the Truss server log level to `DEBUG` instead of `INFO`.
The transport protocol for your model. Supports `http` (default), `websocket`, and `grpc`.
```yaml theme={"system"}
runtime:
transport:
kind: websocket
ping_interval_seconds: 30
ping_timeout_seconds: 10
```
Custom health check configuration for your deployments. For details, see [Configuring health checks](/development/model/custom-health-checks#configuring-health-checks).
```yaml theme={"system"}
runtime:
health_checks:
restart_check_delay_seconds: 120
restart_threshold_seconds: 600
stop_traffic_threshold_seconds: 300
```
The delay in seconds before starting restart checks. Defaults to platform-determined value when not set.
The time in seconds after which an unhealthy instance is restarted. Defaults to platform-determined value when not set.
The time in seconds after which traffic is stopped to an unhealthy instance. Defaults to platform-determined value when not set.
## base\_image
Use `base_image` to deploy a custom Docker image. This is useful for running scripts at build time or installing complex dependencies.
For more information, see [Deploy custom Docker images](/development/model/custom-server).
For example, to use the vLLM Docker image as your base, use the following:
```yaml theme={"system"}
base_image:
image: vllm/vllm-openai:v0.7.3
python_executable_path: /usr/bin/python
# ...
```
The path to the Docker image, for example:
* `vllm/vllm-openai`
* `lmsysorg/sglang`
* `nvcr.io/nvidia/nemo:23.03`
When using image tags like `:latest`, Baseten uses a cached copy and may not reflect updates to the image. To pull a specific version, use image digests like `your-image@sha256:abc123...`.
A path to the Python executable on the image, for example `/usr/bin/python`.
```yaml theme={"system"}
base_image:
image: vllm/vllm-openai:latest
python_executable_path: /usr/bin/python
```
Authentication configuration for a private Docker registry.
```yaml theme={"system"}
base_image:
docker_auth:
auth_method: GCP_SERVICE_ACCOUNT_JSON
secret_name: gcp-service-account
registry: us-west2-docker.pkg.dev
```
For more information, see [Private Docker registries](/development/model/private-registries).
The authentication method for the private registry. Supported values:
* `GCP_SERVICE_ACCOUNT_JSON` - authenticate with a [GCP service account](https://cloud.google.com/iam/docs/service-account-overview). Add your service account JSON blob as a Truss secret.
* `AWS_IAM` - authenticate with an [AWS IAM service account](https://docs.aws.amazon.com/IAM/latest/UserGuide/introduction.html). Add `aws_access_key_id` and `aws_secret_access_key` to your Baseten secrets.
* `AWS_OIDC` - authenticate using AWS OIDC federation. Requires `aws_oidc_role_arn` and `aws_oidc_region`.
* `GCP_OIDC` - authenticate using GCP Workload Identity Federation. Requires `gcp_oidc_service_account` and `gcp_oidc_workload_id_provider`.
For `GCP_SERVICE_ACCOUNT_JSON`:
```yaml theme={"system"}
base_image:
docker_auth:
auth_method: GCP_SERVICE_ACCOUNT_JSON
secret_name: gcp-service-account
registry: us-east4-docker.pkg.dev
```
For `AWS_IAM`:
```yaml theme={"system"}
base_image:
docker_auth:
auth_method: AWS_IAM
registry: .dkr.ecr..amazonaws.com
secrets:
aws_access_key_id: null
aws_secret_access_key: null
```
For `AWS_OIDC`:
```yaml theme={"system"}
base_image:
docker_auth:
auth_method: AWS_OIDC
registry: .dkr.ecr..amazonaws.com
aws_oidc_role_arn: arn:aws:iam::123456789012:role/my-role
aws_oidc_region: us-east-1
```
For `GCP_OIDC`:
```yaml theme={"system"}
base_image:
docker_auth:
auth_method: GCP_OIDC
registry: us-east4-docker.pkg.dev
gcp_oidc_service_account: my-sa@my-project.iam.gserviceaccount.com
gcp_oidc_workload_id_provider: projects/123/locations/global/workloadIdentityPools/my-pool/providers/my-provider
```
The Truss secret that stores the credential for authentication. Required for `GCP_SERVICE_ACCOUNT_JSON`. Ensure this secret is added to the `secrets` section.
The registry to authenticate to (e.g., `us-east4-docker.pkg.dev`).
The secret name for the AWS access key ID. Only used with `AWS_IAM` auth method.
The secret name for the AWS secret access key. Only used with `AWS_IAM` auth method.
## docker\_server
Use `docker_server` to deploy a custom Docker image that has its own HTTP server, without writing a `Model` class. This is useful for deploying inference servers like vLLM or SGLang that provide their own endpoints.
See [Deploy custom Docker images](/development/model/custom-server) for usage details.
For example, to deploy vLLM serving Qwen 2.5 3B, use the following:
```yaml theme={"system"}
base_image:
image: vllm/vllm-openai:v0.7.3
docker_server:
start_command: vllm serve Qwen/Qwen2.5-3B-Instruct --enable-prefix-caching
readiness_endpoint: /health
liveness_endpoint: /health
predict_endpoint: /v1/completions
server_port: 8000
# ...
```
The command to start the server. Required when `no_build` is not true.
The port where the server runs. Port 8080 is reserved by Baseten's internal reverse proxy and cannot be used.
The endpoint for inference requests. This is mapped to Baseten's `/predict` route.
The endpoint for [readiness probes](https://kubernetes.io/docs/concepts/configuration/liveness-readiness-startup-probes/#readiness-probe). Determines when the container can accept traffic.
The endpoint for [liveness probes](https://kubernetes.io/docs/concepts/configuration/liveness-readiness-startup-probes/#liveness-probe). Determines if the container needs to be restarted.
The Linux UID to run the server process as inside the container. Use this when your base image expects a specific non-root user (for example, NVIDIA NIM containers).
The specified UID must already exist in the base image. Values `0` (root) and `60000` (platform default) are not allowed.
Baseten automatically sets ownership of `/app`, `/workspace`, the packages directory, and `$HOME` to this UID. If your server writes to other directories, ensure they are writable by this UID in your base image or via `build_commands`.
Skip the build step and deploy the base image as-is. Baseten copies the image to its container registry without running `docker build` or modifying the image in any way. Only available for [custom server deployments](/development/model/custom-server) that use `docker_server`.
When `no_build` is `true`:
* `start_command` is optional. If omitted, the image's original `ENTRYPOINT` runs.
* Environment variables and secrets are available.
* Development mode is not supported. Deploy with `truss push` (published deployments are the default).
Use this for security-hardened images (for example, Chainguard) that must remain unmodified. [Contact support](mailto:support@baseten.co) to enable no-build deployments for your organization.
```yaml config.yaml theme={"system"}
base_image:
image: your-registry/your-hardened-image:latest
docker_server:
no_build: true
server_port: 8000
predict_endpoint: /predict
readiness_endpoint: /health
liveness_endpoint: /health
```
See [No-build deployment](/development/model/custom-server#no-build-deployment) for usage details.
The `/app` directory is reserved by Baseten. Only `/app` and `/tmp` are writable in the container.
## external\_data
Use `external_data` to bundle data into your image at build time. This reduces cold-start time by making data available without downloading it at runtime.
```yaml theme={"system"}
external_data:
- url: https://my-bucket.s3.amazonaws.com/my-data.tar.gz
local_data_path: data/my-data.tar.gz
name: my-data
```
The URL to download data from.
The path on the image where the data will be downloaded to.
A name for the data, useful for readability purposes.
The download backend to use.
## build\_commands
A list of commands to run at build time.
Useful for performing one-off bash commands.
For example, to clone a GitHub repository, use the following:
```yaml theme={"system"}
build_commands:
- git clone https://github.com/comfyanonymous/ComfyUI.git
```
To install Ollama into the container at build time, use the following:
```yaml theme={"system"}
model_name: ollama-tinyllama
base_image:
image: python:3.11-slim
build_commands:
- curl -fsSL https://ollama.com/install.sh | sh
docker_server:
start_command: sh -c "ollama serve & sleep 5 && ollama pull tinyllama && wait"
readiness_endpoint: /api/tags
liveness_endpoint: /api/tags
predict_endpoint: /api/generate
server_port: 11434
resources:
cpu: "4"
memory: 8Gi
```
For more information, see [Build commands](/development/model/custom-server).
## build
The `build` section handles secret access during Docker builds.
Other build-time configuration options are:
* [`build_commands`](#build_commands): shell commands to run during build.
* [`requirements`](#requirements): Python packages to install.
* [`system_packages`](#system_packages): apt packages to install.
* [`base_image`](#base_image): custom Docker base image.
Grants access to secrets during the build.
Provide a mapping between a secret and a path on the image.
You can then access the secret in commands specified in `build_commands` by running `cat` on the file.
For example, to install a pip package from a private GitHub repository, use the following:
```yaml theme={"system"}
build_commands:
- pip install git+https://$(cat /root/my-github-access-token)@github.com/path/to-private-repo.git
build:
secret_to_path_mapping:
my-github-access-token: /root/my-github-access-token
secrets:
my-github-access-token: null
```
Under the hood, this option mounts your secret as a build secret.
The value of your secret will be secure and will not be exposed in your Docker history or logs.
## weights Preview
Use `weights` to configure Baseten Delivery Network (BDN) for model weight delivery with multi-tier caching. This is the recommended approach for optimizing cold starts.
```yaml theme={"system"}
weights:
- source: "hf://meta-llama/Llama-3.1-8B@main"
mount_location: "/models/llama"
allow_patterns: ["*.safetensors", "config.json"]
```
`weights` replaces the deprecated `model_cache` configuration. Use `truss migrate` to automatically convert your configuration.
URI specifying where to fetch weights from. Supported schemes:
* `hf://`: HuggingFace Hub (e.g., `hf://meta-llama/Llama-3.1-8B@main`)
* `s3://`: AWS S3 (e.g., `s3://my-bucket/models/weights`)
* `gs://`: Google Cloud Storage (e.g., `gs://my-bucket/models/weights`)
* `r2://`: Cloudflare R2 (e.g., `r2://account_id.bucket/path`)
Absolute path where weights will be mounted in your container. Must start with `/`.
Name of a Baseten secret containing credentials for private weight sources.
Authentication configuration for accessing private weight sources. Required for OIDC-based authentication. Supported `auth_method` values:
* `CUSTOM_SECRET`: use a Baseten secret (specify `auth_secret_name`).
* `AWS_OIDC`: use AWS OIDC federation (requires `aws_oidc_role_arn` and `aws_oidc_region`).
* `GCP_OIDC`: use GCP Workload Identity Federation (requires `gcp_oidc_service_account` and `gcp_oidc_workload_id_provider`).
For AWS OIDC:
```yaml theme={"system"}
weights:
- source: "s3://my-bucket/models/weights"
mount_location: "/models/weights"
auth:
auth_method: AWS_OIDC
aws_oidc_role_arn: arn:aws:iam::123456789012:role/my-role
aws_oidc_region: us-east-1
```
For GCP OIDC:
```yaml theme={"system"}
weights:
- source: "gs://my-bucket/models/weights"
mount_location: "/models/weights"
auth:
auth_method: GCP_OIDC
gcp_oidc_service_account: my-sa@my-project.iam.gserviceaccount.com
gcp_oidc_workload_id_provider: projects/123/locations/global/workloadIdentityPools/my-pool/providers/my-provider
```
File patterns to include. Uses `fnmatch`-style wildcards. Patterns like `*.safetensors` only match at the root level; use `**/*.safetensors` for recursive matching across subdirectories.
File patterns to exclude. Uses `fnmatch`-style wildcards. Patterns like `*.bin` only match at the root level; use `**/*.bin` for recursive matching across subdirectories.
For full documentation, see [Baseten Delivery Network (BDN)](/development/model/bdn).
## model\_cache Deprecated
`model_cache` is deprecated. Use [`weights`](#weights) instead for faster cold starts through multi-tier caching.
Use `model_cache` to bundle model weights into your image at build time, reducing cold start latency.
For example, to cache Llama 2 7B weights from Hugging Face, use the following:
```yaml theme={"system"}
model_cache:
- repo_id: NousResearch/Llama-2-7b-chat-hf
revision: main
ignore_patterns:
- "*.bin"
use_volume: true
volume_folder: llama-2-7b-chat-hf
```
Despite the name `model_cache`, there are multiple backends supported, not just Hugging Face.
You can also cache weights stored on GCS, S3, or Azure.
The source path for your model weights.
For example, to cache weights from a Hugging Face repo, use the following:
```yaml theme={"system"}
model_cache:
- repo_id: madebyollin/sdxl-vae-fp16-fix
```
Or you can cache weights from buckets like GCS or S3, using the following options:
```yaml theme={"system"}
model_cache:
- repo_id: gcs://path-to-my-bucket
kind: gcs
- repo_id: s3://path-to-my-bucket
kind: s3
```
The source kind for the model cache.
Supported values: `hf` (Hugging Face), `gcs`, `s3`, `azure`.
The revision of your Hugging Face repo.
Required when `use_volume` is true for Hugging Face repos.
If true, caches model artifacts outside the container image. Recommended: `true`.
The location of the mounted folder. Required when `use_volume` is true.
For example, `volume_folder: myrepo` makes the model available under `/app/model_cache/myrepo` at runtime.
File patterns to include in the cache. Uses Unix shell-style wildcards.
By default, all paths are included.
File patterns to ignore, streamlining the caching process. Use Unix shell-style wildcards. Example: `["*.onnx", "Readme.md"]`. By default, nothing is ignored.
The secret name to use for runtime authentication (e.g., for private Hugging Face repos).
## training\_checkpoints
Configuration for deploying models from training checkpoints.
For example, to deploy a model using checkpoints from a training job, use the following:
```yaml theme={"system"}
training_checkpoints:
download_folder: /tmp/training_checkpoints
artifact_references:
- training_job_id: tr_abc123
paths:
- "checkpoint-*"
```
The folder to download the checkpoints to.
A list of artifact references to download.
The training job ID that the artifact reference belongs to.
The paths of the files to download, which can contain `*` or `?` wildcards.
# Baseten platform status
Source: https://docs.baseten.co/status/status
Current operational status of Baseten's services.
This page automatically refreshes with real-time data from our status monitoring system.
All systems are operational.
Normal
Normal
Normal
Last updated: Loading...
# Basics
Source: https://docs.baseten.co/training/concepts/basics
Learn how to get up and running on Baseten Training
This page covers the essential building blocks of Baseten Training. These are
the core concepts you'll need to understand to effectively organize and execute
your training workflows.
## How Baseten Training works
Baseten Training jobs can be launched from any terminal. Training jobs are created from within a directory, and when created, that directory is packaged up and can be pushed up to Baseten.
This allows you to define your Baseten training config, scripts, code, and any other dependencies within the folder.
Within the folder, we require you to include a Baseten training config file such as `config.py`. The `config.py` includes a list of `run_commands`, which can be anything from running a Python file (`python train.py`) to a bash script (`chmod +x run.sh && ./run.sh`).
If you're looking to upload more than 1GB of files, we strongly suggest
uploading your data to an object store and including a download command before
running your training code. To avoid duplicate downloads, check out our
documentation on the [cache](/training/concepts/cache). For more information
on storage options and data ingestion, see our [storage guide](/training/concepts/storage).
## Setting up your workspace
If you'd like to start from one of our existing recipes, you can check out one of the following examples:
**Simple CPU job with raw PyTorch:**
```bash theme={"system"}
truss train init --examples mnist-pytorch
```
**More complex example that trains GPT-OSS-20b:**
```bash theme={"system"}
truss train init --examples oss-gpt-20b-axolotl
```
Your `config.py` contains all infrastructure configuration for your job, which we will cover below.
Your `run.sh` is invoked by the command that runs when the job first begins. Here you can install any Python dependencies not already included in your Docker image, and begin the execution of your code either by calling a Python file with your training code or a launch command.
## Organizing your work with `TrainingProject`s
A `TrainingProject` is a lightweight organization tool to help you group different `TrainingJob`s together.
While there a few technical details to consider, your team can use `TrainingProject`s to facilitate collaboration and organization.
## Running a `TrainingJob`
Once you have a `TrainingProject`, the actual work of training a model happens within a **`TrainingJob`**. Each `TrainingJob` represents a single, complete execution of your training script with a specific configuration.
* **What it is:** A `TrainingJob` is the fundamental unit of execution. It bundles together:
* Your training code.
* A base `image`.
* The `compute` resources needed to run the job.
* The `runtime` configurations like startup commands and environment variables.
* **Why use it:** Each job is a self-contained, reproducible experiment. If you want to try training your model with a different learning rate, more GPUs, or a slightly modified script, you can create new `TrainingJob`s while knowing that previous ones have been persisted on Baseten.
* **Lifecycle:** A job goes through various stages, from being created (`TRAINING_JOB_CREATED`), to resources being set up (`TRAINING_JOB_DEPLOYING`), to actively running your script (`TRAINING_JOB_RUNNING`), and finally to a terminal state like `TRAINING_JOB_COMPLETED`. More details on the job lifecycle can be found on the [Lifecycle](/training/lifecycle) page.
## Compute resources
The `Compute` configuration defines the computational resources your training job will use. This includes:
* **GPU specifications** - Choose from various GPU types based on your model's requirements.
* **CPU and memory** - Configure the amount of CPU and RAM allocated to your job.
* **Node count** - For single-node or multi-node training setups.
Baseten Training supports H200, H100, and A10G GPUs. Choose your GPU type based
on your model's memory requirements and performance needs.
## Base images
Baseten provides pre-configured base images that include common ML frameworks and dependencies. These images are optimized for training workloads and include:
* Popular ML frameworks (PyTorch, VERL, Megatron, Axolotl, etc.).
* GPU drivers and CUDA support.
* Common data science libraries.
You can also use [custom or private images](/development/model/private-registries) if you have specific requirements.
## Securely integrate with external services with `SecretReference`
Successfully training a model often requires many tools and services. Baseten provides **`SecretReference`** for secure handling of secrets.
* **How to use it:** Store your secret (e.g., an API key for Weights & Biases) in your Baseten workspace with a specific name. In your job's configuration (e.g., environment variables), you refer to this secret by its name using `SecretReference`. The actual secret value is never exposed in your code.
* **How it works:** Baseten injects the secret value at runtime under the environment variable name that you specify.
```python theme={"system"}
from truss_train import definitions
runtime = definitions.Runtime(
# ... other runtime options
environment_variables={
"HF_TOKEN": definitions.SecretReference(name="hf_access_token"),
},
)
```
## Running inference on trained models
The journey from training to a usable model in Baseten typically follows this path:
1. A `TrainingJob` with checkpointing enabled, produces one or more model artifacts.
2. You run `truss train deploy_checkpoint` to deploy a model from your most recent training job. You can read more about this at [Serving Trained Models](/training/deployment).
3. Once deployed, your model will be available for inference via API. See more at [Calling Your Model](/inference/calling-your-model).
## Next steps
Now that you understand the basics of Baseten Training, explore these advanced topics to optimize your training workflows:
* **[Cache](/training/concepts/cache)** - Speed up your training iterations by persisting data between jobs and avoiding expensive downloads.
* **[Checkpointing](/training/concepts/checkpointing)** - Manage model checkpoints seamlessly and avoid disk errors during training.
* **[Multinode training](/training/concepts/multinode)** - Scale your training across multiple nodes with high-speed infiniband networking.
# Cache
Source: https://docs.baseten.co/training/concepts/cache
Learn how to use the training cache to speed up your training iterations by persisting data between jobs.
The training cache enables you to persist data between training jobs. This can significantly improve iteration speed by skipping expensive downloads and data transformations.
## How to use the training cache
Set the cache configuration in your `Runtime`:
```python theme={"system"}
from truss_train import definitions
training_runtime = definitions.Runtime(
# ... other configuration options
cache_config=definitions.CacheConfig(enabled=True)
)
```
## Cache directory
By default, the cache will be mounted in two locations:
* `/root/.cache/user_artifacts`, which can be accessed via the [`$BT_PROJECT_CACHE_DIR`](/reference/sdk/training#baseten-provided-environment-variables) environment variable. This cache is shared by all jobs in a project.
* `/root/.cache/team_artifacts`, which can be accessed via the [`$BT_TEAM_CACHE_DIR`](/reference/sdk/training#baseten-provided-environment-variables) environment variable. This cache is shared by all jobs for a team.
Both project and team caches are scoped to a single GPU cluster. Data cached on one cluster (for example, H100) is not available on a different cluster (for example, H200). To use the same data on multiple clusters, duplicate it to each cluster's cache.
## Hugging Face cache mount
You can mount your cache to the Hugging Face cache directory by setting `HF_HOME` to one of the provided mount points plus `/huggingface`. For example, you can set `HF_HOME=$BT_PROJECT_CACHE_DIR/huggingface` to use the project cache directory.
However, there are considerable technical pitfalls when trying to read from the cache with multiple processes, as Hugging Face doesn't work well with distributed filesystems. To help enable this use case, ensure your dataset processors or process count is set to 1 to minimize the number of concurrent readers.
## Seeding your data and models
For multi-gpu training, you should ensure that your data is seeded before running multi-process training jobs. You can do this by separating out a data loading script and a training script.
For a 400 GB HF Dataset, you can expect to save *nearly an hour* of compute time for each job - data download and preparation have been done already!
## Cache management
You can inspect the contents of the cache through CLI with `truss train cache summarize `. This visibility into what's in the cache can help you verify your code is working as expected, and additionally manage files and artifacts you no longer need.
When you delete a project, all data in the project's training cache (`$BT_PROJECT_CACHE_DIR`) is permanently deleted with no archival or recovery option. See [Management](/training/management) for details.
# Checkpointing
Source: https://docs.baseten.co/training/concepts/checkpointing
Learn how to use Baseten's checkpointing feature to manage model checkpoints and avoid disk errors during training.
With checkpointing enabled, you can manage your model checkpoints seamlessly and avoid common training issues.
## Benefits of checkpointing
* **Avoid catastrophic out of disk errors**: We mount additional storage at the checkpointing directory to help avoid out of disk errors during your training run.
* **Maximize GPU utilization**: When checkpointing is enabled, any data written to the checkpointing directory will be uploaded to the cloud by a separate process, allowing you to maximize GPU time spent training.
* **Seamless checkpoint management**: Checkpoints are automatically uploaded to cloud storage for easy access and management.
## Enabling checkpointing
To enable checkpointing, add a `CheckpointingConfig` to the `Runtime` and set `enabled` to `True`:
```python theme={"system"}
from truss_train import definitions
training_runtime = definitions.Runtime(
# ... other configuration options
checkpointing_config=definitions.CheckpointingConfig(enabled=True)
)
```
## Using the checkpoint directory
Baseten will automatically export the [`$BT_CHECKPOINT_DIR`](/reference/sdk/training#baseten-provided-environment-variables) environment variable in your job's environment.
**Write your checkpoints to the `$BT_CHECKPOINT_DIR` directory so Baseten can automatically backup and preserve them.**
## Serving checkpoints
Once your training is complete, you can serve your model checkpoints using Baseten's serving infrastructure. Learn more about [serving checkpoints](/training/deployment).
When you delete a job or project, all undeployed checkpoints are permanently deleted with no archival or recovery option. Deployed checkpoints aren't affected. See [Management](/training/management) for details.
# Multinode training
Source: https://docs.baseten.co/training/concepts/multinode
Learn how to configure and run multinode training jobs with Baseten Training.
Baseten Training supports multinode training via infiniband for distributed training across multiple nodes.
## Configuring multinode training
To deploy a multinode training job:
* Configure the `Compute` resource in your `TrainingJob` by setting the `node_count` to the number of nodes you'd like to use (e.g. 2).
```python theme={"system"}
from truss_train import definitions
compute = definitions.Compute(
node_count=2, # Use 2 nodes for multinode training
# ... other compute configuration options
)
```
## Environment variables
Make sure you've properly integrated with the
[Baseten provided environment variables](/reference/sdk/training#baseten-provided-environment-variables)
for distributed training.
## Network configuration
Baseten provides high-speed infiniband networking between nodes to ensure efficient communication during distributed training. This enables:
* Fast gradient synchronization.
* Efficient parameter updates.
* Low-latency communication between nodes.
## Checkpointing in multinode training
Checkpointing behavior varies across training frameworks in multinode setups. One common pattern is to use the shared cache directory that all nodes can access:
```bash theme={"system"}
# Use shared volume with job name for checkpointing
ckpt_dir="${BT_PROJECT_CACHE_DIR}/${BT_TRAINING_JOB_NAME}"
```
Then ensure you write to `ckpt_dir`. This ensures all nodes write to the same checkpoint location. For comprehensive framework-specific examples and patterns, see the [Training Cookbook](https://github.com/basetenlabs/ml-cookbook).
Keep in mind that these checkpoints will not be backed up by Baseten since they are not stored in \$BT\_CHECKPOINT\_DIR. Make sure to copy them there at some point to ensure they are preserved.
## Common practices
When setting up multinode training:
1. **Data loading**: Ensure your data loading is properly distributed across nodes.
2. **Seeding**: Use consistent seeding across all nodes for reproducible results.
3. **Monitoring**: Monitor training metrics across all nodes to ensure balanced training.
# Storage and data ingestion
Source: https://docs.baseten.co/training/concepts/storage
Load model weights and training data into Baseten training containers through BDN, S3, Hugging Face, and GCS.
Training jobs need model weights, training datasets, and configuration files.
Baseten provides multiple ways to get data into your training container, from
cached delivery through
[Baseten Delivery Network (BDN)](/development/model/bdn) to direct downloads in
your training script.
## Load weights and data with BDN
Use the [`weights`](/reference/sdk/training#weightssource) parameter on
[`TrainingJob`](/reference/sdk/training#trainingjob) to mount model weights and
training data into your container through BDN. BDN mirrors your data once and
serves it from multi-tier caches, so subsequent jobs start faster.
BDN mirrors your weights to Baseten storage during the `CREATED` state, before any compute is provisioned.
This mirroring step is not billed.
Once the job enters `DEPLOYING`, compute billing begins. This includes the time BDN takes to mount cached weights into your container.
Cached weights mount faster than first-time downloads, reducing billable deploy time on subsequent jobs.
Each weight source specifies a remote URI and a local mount path. When your
container starts, the data is already available at the `mount_location`. No
download code needed in your training script.
### Hugging Face and S3 example
Load model weights from Hugging Face and training data from S3, mounted into the training container before your code runs:
```python config.py theme={"system"}
from truss_train import TrainingProject, TrainingJob, Image, Compute, Runtime, WeightsSource
from truss.base.truss_config import AcceleratorSpec
training_job = TrainingJob(
image=Image(base_image="pytorch/pytorch:2.7.0-cuda12.8-cudnn9-runtime"),
compute=Compute(
accelerator=AcceleratorSpec(accelerator="H200", count=1),
),
runtime=Runtime(
start_commands=["python train.py"],
),
weights=[
WeightsSource(
source="hf://Qwen/Qwen3-0.6B",
mount_location="/app/models/Qwen/Qwen3-0.6B",
),
WeightsSource(
source="s3://my-bucket/training-data",
mount_location="/app/data/training-data",
),
],
)
training_project = TrainingProject(name="qwen3-finetune", job=training_job)
```
In your training script, reference the mount paths directly:
```python train.py theme={"system"}
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained("/app/models/Qwen/Qwen3-0.6B")
tokenizer = AutoTokenizer.from_pretrained("/app/models/Qwen/Qwen3-0.6B")
# Training data is available at /app/data/training-data/
```
### Supported sources
BDN supports these URI schemes:
| Scheme | Example | Description |
| ------- | ----------------------------------- | --------------------- |
| `hf://` | `hf://meta-llama/Llama-3.1-8B@main` | Hugging Face Hub. |
| `s3://` | `s3://my-bucket/path/to/data` | Amazon S3. |
| `gs://` | `gs://my-bucket/path/to/data` | Google Cloud Storage. |
| `r2://` | `r2://account_id.bucket/path` | Cloudflare R2. |
For Hugging Face sources, pin to a specific revision with the `@revision` suffix (branch, tag, or commit SHA).
### Authentication
Private or gated sources require authentication.
Add an `auth` block to your `WeightsSource`:
Store a [Hugging Face token](https://huggingface.co/settings/tokens) as a [Baseten secret](/development/model/secrets):
```python theme={"system"}
WeightsSource(
source="hf://meta-llama/Llama-3.1-8B@main",
mount_location="/app/models/llama",
auth={"auth_method": "CUSTOM_SECRET", "auth_secret_name": "hf_access_token"},
)
```
Store AWS credentials as a JSON [Baseten secret](/development/model/secrets):
```python theme={"system"}
WeightsSource(
source="s3://my-bucket/training-data",
mount_location="/app/data/training-data",
auth={"auth_method": "CUSTOM_SECRET", "auth_secret_name": "aws_credentials"},
)
```
The secret value must contain `aws_access_key_id`, `aws_secret_access_key`, and `aws_region`.
For the full list of authentication options and source-specific configuration, see the [BDN configuration reference](/development/model/bdn#configuration-reference).
### Filtering files
Use `allow_patterns` and `ignore_patterns` to download only the files you need:
```python theme={"system"}
WeightsSource(
source="hf://meta-llama/Llama-3.1-8B@main",
mount_location="/app/models/llama",
allow_patterns=["*.safetensors", "config.json", "tokenizer.*"],
ignore_patterns=["*.md", "*.txt"],
)
```
***
## Storage types overview
Baseten Training provides three types of storage:
| Storage type | Persistence | Use case |
| ------------------------------------------------- | --------------------------- | --------------------------------------------------------------------- |
| [Training cache](/training/concepts/cache) | Persistent between jobs | Large model downloads, preprocessed datasets, shared artifacts. |
| [Checkpointing](/training/concepts/checkpointing) | Backed up to cloud storage | Model checkpoints, training artifacts you want to deploy or download. |
| Ephemeral storage | Cleared after job completes | Temporary files, intermediate outputs. |
Training cache is scoped to a single GPU cluster. Data cached on one cluster (for example, H100) is not available on a different cluster (for example, H200). To use the same data on multiple clusters, duplicate it to each cluster's cache or load it through BDN.
### Ephemeral storage
Ephemeral storage is cleared when your job completes. Use it for:
* Temporary files during training.
* Intermediate outputs that don't need to persist.
* Scratch space for data processing.
Ephemeral storage is typically limited to a few GBs and cannot affect other containers on the same node.
## Loading data in your training script
When data isn't available through a BDN-supported URI scheme, download it directly in your training script.
This works well for datasets loaded from framework-specific libraries or custom download logic.
Use [Baseten secrets](/organization/secrets) to authenticate to your S3 bucket.
1. Add your AWS credentials as secrets in your Baseten account.
2. Reference the secrets in your job configuration:
```python theme={"system"}
from truss_train import definitions
runtime = definitions.Runtime(
environment_variables={
"AWS_ACCESS_KEY_ID": definitions.SecretReference(name="aws_access_key_id"),
"AWS_SECRET_ACCESS_KEY": definitions.SecretReference(name="aws_secret_access_key"),
},
)
```
3. Download from S3 in your training script:
```python theme={"system"}
import boto3
s3 = boto3.client('s3')
s3.download_file('my-bucket', 'training-data.tar.gz', '/path/to/local/file')
```
To avoid re-downloading large datasets on each job, download to the [training cache](/training/concepts/cache) and check if files exist before downloading.
Reference a Hugging Face dataset in your training code:
```python theme={"system"}
from datasets import load_dataset
ds = load_dataset("your-username/your-dataset", split="train")
```
For private datasets, authenticate using a Hugging Face token stored in [Baseten secrets](/organization/secrets):
```python theme={"system"}
runtime = definitions.Runtime(
environment_variables={
"HF_TOKEN": definitions.SecretReference(name="hf_access_token"),
},
)
```
Authenticate via [Baseten secrets](/organization/secrets) and download in your training code:
```python theme={"system"}
from google.cloud import storage
client = storage.Client()
bucket = client.bucket('my-bucket')
blob = bucket.blob('training-data.tar.gz')
blob.download_to_filename('/path/to/local/file')
```
## Data size and limits
| Size | Description |
| ------ | ------------------------- |
| Small | A few GBs. |
| Medium | Up to 1 TB (most common). |
| Large | 1-10 TB. |
The default training cache is 1 TB.
[Contact support](mailto:support@baseten.co) to increase the cache size for larger datasets.
## Data security
Data transfer happens within Baseten's VPC using secure connections.
Baseten doesn't share customer data across tenants and maintains a zero data retention policy.
For self-hosted deployments, training can use storage buckets in your own AWS or GCP account.
## Storage performance
Read and write speeds vary by cluster and storage configuration:
| Storage type | Write speed | Read speed |
| -------------- | ------------------- | ------------------- |
| Node storage | 1.2-1.8 GB/s | 1.7-2.1 GB/s |
| Training cache | 340 MB/s - 1.0 GB/s | 470 MB/s - 1.6 GB/s |
For workloads with high I/O requirements or large storage requirements, [contact support](mailto:support@baseten.co).
## Next steps
* **[BDN configuration reference](/development/model/bdn#configuration-reference)**: Full list of weight source options, authentication methods, and supported URI schemes.
* **[Cache](/training/concepts/cache)**: Persist data between jobs and speed up training iterations.
* **[Checkpointing](/training/concepts/checkpointing)**: Save and manage model checkpoints during training.
* **[Multinode training](/training/concepts/multinode)**: Scale training across multiple nodes with shared cache access.
# Serving your trained model
Source: https://docs.baseten.co/training/deployment
How to deploy checkpoints from Baseten Training jobs as usable models.
Baseten Training seamlessly integrates with Baseten's model deployment capabilities. Once your `TrainingJob` has produced model checkpoints, you can deploy them as fully operational model endpoints.
**This feature works with HuggingFace compatible LLMs**, allowing you to easily deploy fine-tuned language models directly from your training checkpoints with a single command.
For optimized inference performance with TensorRT-LLM, BEI and Baseten Inference Stack, see [Deploy checkpoints with Engine Builder](/engines/performance-concepts/deployment-from-training-and-s3).
To deploy checkpoints, first ensure you have a `TrainingJob` that's running with a `checkpointing_config` enabled.
```python theme={"system"}
runtime = definitions.Runtime(
start_commands=[
"/bin/sh -c './run.sh'",
],
checkpointing_config=definitions.CheckpointingConfig(
enabled=True,
),
)
```
In your training code or configuration, ensure that your checkpoints are being written to the checkpointing directory, which can be referenced via [`$BT_CHECKPOINT_DIR`](/reference/sdk/training#baseten-provided-environment-variables).
The contents of this directory are uploaded to Baseten's storage and made immediately available for deployment.
*(You can optionally specify a `checkpoint_path` in your `checkpointing_config` if you prefer to write to a specific directory).* The default location is "/tmp/training\_checkpoints".
To deploy your checkpoint(s) as a `Deployment`, you can:
### CLI Deployment
```bash theme={"system"}
truss train deploy_checkpoints [OPTIONS]
```
**Options:**
| Option | Type | Description |
| --------------------------- | ---- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `--job-id` | TEXT | Job ID to deploy checkpoints from. If not specified, deploys from the most recent training job. |
| `--project` | TEXT | Project name or project ID. |
| `--project-id` | TEXT | Project ID. |
| `--config` | TEXT | Path to a Python file that defines a `DeployCheckpointsConfig` (see [Advanced CLI Deployment](#advanced-cli-deployment)). |
| `--dry-run` | FLAG | Generate a Truss config without deploying. Useful for inspecting or customizing the config before deployment. |
| `--truss-config-output-dir` | TEXT | Path to output the Truss config to. Defaults to `truss_configs/_`, or `truss_configs/dry_run_` when using `--dry-run`. |
| `--remote` | TEXT | Remote to use. |
This will deploy the most recent checkpoint from your training job as an inference endpoint.
### UI Deployment
You can also deploy checkpoints directly from the Baseten UI by pressing the dropdown menu on your completed training job and selecting "Deploy" on your selected checkpoint.
### Advanced CLI Deployment
You can also:
* run `truss train deploy_checkpoints [--job-id ]` and follow the setup wizard.
* define an instance of a `DeployCheckpointsConfig` class (this is helpful for small changes that aren't provided by the wizard) and run `truss train deploy_checkpoints --config `.
When `deploy_checkpoints` is run, `truss` will construct a deployment `config.yml` and store it on disk. By default, the config is written to `truss_configs/_`. You can control the output location with `--truss-config-output-dir`.
To inspect or customize the config before deploying, use `--dry-run` to generate the config without deploying:
```bash theme={"system"}
truss train deploy_checkpoints --job-id --dry-run
```
If you'd like to modify the resulting deployment config, you can copy it into a permanent directory and customize it as needed.
This file defines the source of truth for the deployment and can be deployed independently via `truss push`. See [deployments](../deployment/deployments) for more details.
After successful deployment, your model will be deployed on Baseten, where you can run inference requests and evaluate performance. See [Calling Your Model](/inference/calling-your-model) for more details.
To download the files you saved to the checkpointing directory or understand the file structure, you can run `truss train get_checkpoint_urls [--job-id=]` to get a JSON file containing presigned URLs for each training job.
The JSON file contains the following structure:
```json theme={"system"}
{
"timestamp": "2025-06-23T13:44:16.485905+00:00",
"job": {
"id": "03yv1l3",
"created_at": "2025-06-18T14:30:30.480Z",
"current_status": "TRAINING_JOB_COMPLETED",
"error_message": null,
"instance_type": {
"id": "H200:2x8x128x1600",
"name": "H200:2x8x128x1600 - 2 Nodes of 8 H200 GPUs, 1128 GiB VRAM, 128 vCPUs, 1600 GiB RAM",
"memory_limit_mib": 1650000,
"millicpu_limit": 127900,
"gpu_count": 8,
"gpu_type": "H200",
"gpu_memory_limit_mib": 1155072
},
"updated_at": "2025-06-18T14:30:30.510Z",
"training_project_id": "lqz9o34",
"training_project": {
"id": "lqz9o34",
"name": "checkpointing"
}
},
"checkpoint_artifacts": [
{
"url": "https://bt-training-eqwnwwp-f815d6cd-19bf-4589-bfcb-da76cd8432c0.s3.amazonaws.com/training_projects/lqz9o34/jobs/03yv1l3/rank-0/checkpoint-24/tokenizer_config.json?AWSAccessKeyId=AKIARLZO4BEQO4Q2A5NH&Signature=0vdzJf0686wNE1d9bm4%2Bw9ik5lY%3D&Expires=1751291056",
"relative_file_name": "checkpoint-24/tokenizer_config.json",
"node_rank": 0
}
...
]
}
```
**Important notes about the presigned URLs:**
* The presigned URLs expire after **7 days** from generation
* These URLs are primarily intended for **evaluation and testing purposes**, not for long-term inference deployments
* For production deployments, consider copying the checkpoint files to your Truss model directory and downloading them in the model's `load()` function
## Complex and custom use cases
* Custom Model Architectures
* Weights Sharded Across Nodes
Examine the structure of your files with `truss train get_checkpoint_urls --job-id=`. If a file looks like this:
```json theme={"system"}
{
"url": "https://bt-training-eqwnwwp-f815d6cd-19bf-4589-bfcb-da76cd8432c0.s3.amazonaws.com/training_projects/lqz9o34/jobs/03yv1l3/rank-4/checkpoint-10/weights.safetensors?AWSAccessKeyId=AKIARLZO4BEQO4Q2A5NH&Signature=0vdzJf0686wNE1d9bm4%2Bw9ik5lY%3D&Expires=1751291056",
"relative_file_name": "checkpoint-10/weights.safetensors",
"node_rank": 4
}
```
In your Truss configuration, add a section like this: Wildcards `*` match to an arbitrary number of chars while `?` matches to one.
```yaml theme={"system"}
training_checkpoints:
download_folder: /tmp/training_checkpoints
artifact_references:
- training_job_id:
paths:
- rank-*/checkpoint-10/ # Pull in all the files for checkpoint-10 across all nodes
```
When your model pod starts up, you can read the file from the path `/tmp/training_checkpoints/rank-[node-rank]/[relative_file_name]`. For the example above, the file can be read from:
```
/tmp/training_checkpoints//rank-4/checkpoint-10/weights.safetensors
```
# Get started
Source: https://docs.baseten.co/training/getting-started
Run your first training job and deploy it to production.
Baseten Training runs your training code on managed cloud GPUs. You bring your
own framework, point it at a GPU type, and submit. Baseten handles provisioning,
syncs checkpoints as they're saved, and deploys any checkpoint as a production
endpoint in one command.
This tutorial fine-tunes Qwen3-4B with LoRA on a single H100, from job
submission to calling the deployed model.
You'll set up a project directory, define your infrastructure in a configuration file, and write the training scripts that run on an H100.
## Prerequisites
* **Baseten account**: [Sign up for Baseten](https://app.baseten.co/).
* **API key**: Generate an API key from [Settings > API keys](https://app.baseten.co/settings/account/api_keys).
* **Truss**: Install Truss and log in:
[uv](https://docs.astral.sh/uv/) is a fast Python package manager.
```bash theme={"system"}
uv venv && source .venv/bin/activate
uv pip install truss
truss login
```
```bash theme={"system"}
python -m venv .venv && source .venv/bin/activate
pip install --upgrade truss
truss login
```
```bash theme={"system"}
python -m venv .venv && .venv\Scripts\activate
pip install --upgrade truss
truss login
```
## Create your training project
```bash theme={"system"}
mkdir my-training-project && cd my-training-project
```
### Write your configuration file
Your configuration file uses the `truss_train` library to define your training
infrastructure as Python objects:
* [`TrainingProject`](/reference/sdk/training#trainingproject): the top-level container for your project.
* [`TrainingJob`](/reference/sdk/training#trainingjob): a single job within a project, combining:
* [`Image`](/reference/sdk/training#image): what container to run.
* [`Compute`](/reference/sdk/training#compute): what hardware to provision.
* [`Runtime`](/reference/sdk/training#runtime): how to start training and what to persist.
This is the file Baseten reads when you submit a job. It tells the platform
which GPU to provision, which container image to use, and where to sync
checkpoints.
Create `config.py`:
```python config.py theme={"system"}
from truss_train import (
TrainingProject,
TrainingJob,
Image,
Compute,
Runtime,
CacheConfig,
CheckpointingConfig,
)
from truss.base.truss_config import AcceleratorSpec
BASE_IMAGE = "pytorch/pytorch:2.7.0-cuda12.8-cudnn9-runtime"
training_runtime = Runtime(
start_commands=[
"chmod +x ./run.sh && ./run.sh",
],
cache_config=CacheConfig(enabled=True),
checkpointing_config=CheckpointingConfig(enabled=True),
)
training_compute = Compute(
accelerator=AcceleratorSpec(accelerator="H100", count=1),
)
training_job = TrainingJob(
image=Image(base_image=BASE_IMAGE),
compute=training_compute,
runtime=training_runtime,
)
training_project = TrainingProject(
name="qwen3-4b-lora-sft",
job=training_job,
)
```
`CacheConfig` avoids re-downloading models and datasets between jobs.
`CheckpointingConfig` tells Baseten to sync your saved checkpoints so you can
deploy them later.
### Write your training scripts
Create `run.sh` to install dependencies and launch training. This tutorial uses
`pip install` in the start command, but you can also pre-install dependencies in
a [custom base image](/training/concepts/basics#base-images).
```bash run.sh theme={"system"}
#!/bin/bash
set -eux
pip install "trl>=0.20.0" "peft>=0.17.0" "transformers>=4.55.0" "datasets"
python train.py
```
Your `train.py` is your own training code. Baseten runs it as-is, so you can use
any framework or training loop that works locally. In this example, we'll
fine-tune [Qwen3-4B](https://huggingface.co/Qwen/Qwen3-4B) on the
[pirate-ultrachat-10k](https://huggingface.co/datasets/winglian/pirate-ultrachat-10k)
dataset using LoRA with [TRL](https://huggingface.co/docs/trl) (Transformer
Reinforcement Learning). The dataset teaches the model to respond in pirate
dialect, so you'll know fine-tuning worked when the deployed model starts saying
"Ahoy, matey!"
```python train.py theme={"system"}
import os
from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
from peft import LoraConfig
from trl import SFTConfig, SFTTrainer
MODEL_ID = "Qwen/Qwen3-4B"
DATASET_ID = "winglian/pirate-ultrachat-10k"
dataset = load_dataset(DATASET_ID, split="train")
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)
if tokenizer.pad_token is None:
tokenizer.pad_token = tokenizer.eos_token
model = AutoModelForCausalLM.from_pretrained(
MODEL_ID,
torch_dtype=torch.bfloat16,
device_map="auto",
use_cache=False,
)
peft_config = LoraConfig(
r=8,
lora_alpha=16,
target_modules="all-linear",
lora_dropout=0.05,
task_type="CAUSAL_LM",
)
training_args = SFTConfig(
learning_rate=2e-4,
num_train_epochs=1,
max_steps=50,
logging_steps=5,
per_device_train_batch_size=4,
gradient_accumulation_steps=4,
gradient_checkpointing=True,
max_length=1024,
warmup_ratio=0.1,
lr_scheduler_type="cosine",
save_steps=25,
bf16=True,
output_dir=os.getenv("BT_CHECKPOINT_DIR", "./checkpoints"),
)
trainer = SFTTrainer(
model=model,
args=training_args,
train_dataset=dataset,
processing_class=tokenizer,
peft_config=peft_config,
)
trainer.train()
trainer.save_model(training_args.output_dir)
print(f"Training complete. Model saved to {training_args.output_dir}")
```
Save checkpoints to `$BT_CHECKPOINT_DIR` so Baseten can sync and deploy them.
Baseten sets this variable automatically when checkpointing is enabled.
With `save_steps=25` and `max_steps=50`, the trainer saves LoRA checkpoints at
steps 25 and 50.
## Submit your training job
Now that your project is set up, submit your training job. The CLI packages your
files, creates the training project, and starts the job on your specified GPU.
```bash theme={"system"}
truss train push config.py
```
You'll see:
```output theme={"system"}
✨ Training job successfully created!
🪵 View logs for your job via 'truss train logs --job-id --tail'
🔍 View metrics for your job via 'truss train metrics --job-id '
🌐 View job in the UI: https://app.baseten.co/training//logs/
```
Copy the `job_id` to use in the next steps.
## Monitor your training job
Tail logs in real time with the job ID from the previous step.
```bash theme={"system"}
truss train logs --job-id --tail
```
You can also view logs, metrics, and job status in the [Baseten dashboard](https://app.baseten.co/training/).
## Deploy your trained model
When training finishes, Baseten syncs your checkpoints automatically. You'll see:
```output theme={"system"}
Training complete. Model saved to /mnt/ckpts
Job has exited. Syncing checkpoints...
```
Deploy your checkpoint to Baseten's inference platform. The deployment downloads
the base model weights and serves them with your LoRA adapter using vLLM. This
step requires `hf_access_token` in [Baseten Secrets](/organization/secrets)
because the serving layer downloads the base model separately.
```bash theme={"system"}
truss train deploy_checkpoints
```
Follow the interactive prompts to select a checkpoint, name your model, and choose a GPU.
```output theme={"system"}
Fetching checkpoints for training job ...
? Use spacebar to select/deselect checkpoints to deploy.
○ .
○ checkpoint-50
❯ ○ checkpoint-25
? Enter the model name for your deployment: my-fine-tuned-model
? Select the GPU type to use for deployment: H100
? Select the number of H100 GPUs to use for deployment: 1
? Enter the huggingface secret name: hf_access_token
Successfully created model version: deployment-1
Model version ID:
```
Deploy from the [Baseten dashboard](https://app.baseten.co/training/):
1. Select your training job.
2. Open the **Checkpoints** tab and choose a checkpoint.
3. Click **Deploy** and configure your model name, instance type, and scaling settings.
### Test your deployment
Call your deployed model using the OpenAI-compatible chat format. The `model` field matches the checkpoint name you selected during deployment.
```bash theme={"system"}
export BASETEN_API_KEY="paste-your-api-key-here"
curl -X POST https://model-.api.baseten.co/v1/chat/completions \
-H "Authorization: Api-Key $BASETEN_API_KEY" \
-H "Content-Type: application/json" \
-d '{"model": "checkpoint-25", "messages": [{"role": "user", "content": "What is the best way to learn Python programming?"}]}'
```
```python theme={"system"}
from openai import OpenAI
client = OpenAI(
api_key="YOUR-API-KEY",
base_url="https://model-w7pdg4yw.api.baseten.co/environments/production/sync/v1"
)
response = client.chat.completions.create(
model="checkpoint-50",
messages=[{"role": "user", "content": "What is the best way to learn Python programming?"}],
)
print(response.choices[0].message.content)
```
```bash theme={"system"}
truss predict --model --data '{"model": "checkpoint-25", "messages": [{"role": "user", "content": "What is the best way to learn Python programming?"}]}'
```
The fine-tuned model responds in pirate dialect, confirming that the LoRA adapter is active:
```output theme={"system"}
Ahoy there matey! Seeking knowledge of Python programming? Well, it's a
treasure trove, but it takes patience and practice to find the gold...
```
## Next steps
* [Monitor and manage training jobs](/training/management): for logs, metrics, and job lifecycle commands.
* [Training SDK reference](/reference/sdk/training): for all configuration options, including [base images](/reference/sdk/training#image), [secrets](/reference/sdk/training#secretreference), [private registries](/reference/sdk/training#dockerauth), and [`.truss_ignore` syntax](/reference/cli/training/training-cli#ignoring-files-and-folders).
* Browse the [ML Cookbook](https://github.com/basetenlabs/ml-cookbook): for framework examples and [advanced recipes](https://github.com/basetenlabs/ml-cookbook/tree/main/recipes).
# Interactive sessions (rSSH)
Source: https://docs.baseten.co/training/interactive-sessions
Connect to training containers for remote debugging and development via VS Code or Cursor Remote Tunnels.
Interactive sessions use rSSH (remote SSH) to connect your local IDE to a
training container. Unlike traditional SSH, rSSH doesn't require SSH keys, open
ports, or direct network access. Instead, it uses
[VS Code Remote Tunnels](https://code.visualstudio.com/docs/remote/tunnels) or
Cursor's equivalent. You authenticate via a device code flow through Microsoft
or GitHub, and the tunnel connects your IDE to the container securely.
Use rSSH to debug a failed training job, inspect state on a running job, or
develop interactively without resubmitting.
## Prerequisites
* **VS Code** or **Cursor** installed locally.
* The **[Remote - Tunnels](https://marketplace.visualstudio.com/items?itemName=ms-vscode.remote-server)** extension installed in your IDE.
* A **Microsoft** or **GitHub** account for device flow authentication.
## Quick start
This walkthrough uses the
[MNIST PyTorch example](https://github.com/basetenlabs/ml-cookbook/tree/main/examples/mnist-pytorch/training)
to push a training job with rSSH enabled, then connects to the container.
### 1. Clone the example
Clone the ml-cookbook and navigate to the MNIST training example:
```bash theme={"system"}
git clone https://github.com/basetenlabs/ml-cookbook.git
cd ml-cookbook/examples/mnist-pytorch/training
```
### 2. Configure and push the job
Add an interactive session to your `config.py`:
```python config.py theme={"system"}
from truss_train import TrainingProject, TrainingJob, Image, Compute, Runtime
from truss_train.definitions import (
InteractiveSession,
InteractiveSessionTrigger,
InteractiveSessionAuthProvider,
)
from truss.base.truss_config import AcceleratorSpec
training_job = TrainingJob(
image=Image(base_image="pytorch/pytorch:2.7.0-cuda12.8-cudnn9-runtime"),
compute=Compute(
accelerator=AcceleratorSpec(accelerator="H200", count=1),
),
runtime=Runtime(
start_commands=["python train.py"],
),
interactive_session=InteractiveSession(
trigger=InteractiveSessionTrigger.ON_STARTUP,
auth_provider=InteractiveSessionAuthProvider.MICROSOFT, # You can also use GITHUB
),
)
training_project = TrainingProject(name="mnist-training", job=training_job)
```
Push the job:
```bash theme={"system"}
truss train push config.py
```
Once the job is running, retrieve the auth code using `truss train isession`:
```bash theme={"system"}
truss train isession --job-id
```
The expected output will look simialr to this:
```
Interactive Sessions for Job:
Replica ID Tunnel Name Auth Code Auth URL Generated At (Local)
r0 bt-session--0 AB12-CD34 https://login.microsoftonline.com/… 14:30:00 PST
```
You can also view this table in `truss train logs --job-id --tail`, where it auto-refreshes every 30 seconds alongside your training logs.
### 3. Authenticate and connect
Connecting to the tunnel relies on the
[Remote - Tunnels](https://marketplace.visualstudio.com/items?itemName=ms-vscode.remote-server)
extension in your IDE.
1. Open the **Auth URL** from the table in your browser.
2. Enter the **Auth Code** shown in the table.
3. Connect to the tunnel in your IDE:
1) Open the command palette (`Cmd+Shift+P` on macOS, `Ctrl+Shift+P` on Windows/Linux).
2) Select **Remote-Tunnels: Connect to Tunnel**.
3) Select the tunnel named `bt-session--` (for example, `bt-session-abc123-0`).
**One-time setup:**
Cursor doesn't include the required tunnel extensions by default.
You'll need to sideload them from VS Code.
1. In **VS Code**, find the following extensions in the marketplace and download each as a VSIX file using the **Download VSIX** option on the extension page:
* [Remote Explorer](https://marketplace.visualstudio.com/items?itemName=ms-vscode.remote-explorer) by Microsoft
* [Remote - Tunnels](https://marketplace.visualstudio.com/items?itemName=ms-vscode.remote-server) by Microsoft
2. In **Cursor**, open the command palette (`Cmd+Shift+P`) and select **Extensions: Install from VSIX...**. Install both VSIX files.
3. Open the command palette and select **Remote Explorer: Focus on Remotes (Tunnels/SSH) View**.
4. Click **Sign in to tunnels registered with Microsoft** and complete the authentication flow.
**Connect to the tunnel:**
1. Open the command palette (`Cmd+Shift+P`).
2. Select **Remote-Tunnels: Connect to Tunnel**.
3. Choose **Microsoft** when prompted.
4. Select the tunnel named `bt-session--` (for example, `bt-session-abc123-0`).
Open your workspace to the desired folder path (typically `/app` or `/workspace`) to start debugging, editing your training script, or running commands.
## Trigger modes
The trigger mode controls when the rSSH session's container stays alive for interactive use.
Baseten generates the tunnel and auth code for all modes. The trigger determines the session *lifecycle*:
| Mode | When to use | Behavior |
| ------------ | -------------------------------------------------------------------------- | -------------------------------------------------------------------------------------------------------------------------- |
| `on_startup` | Develop interactively, run commands, test code while training runs. | Session is active from job start. Your `start_commands` still run alongside the session. |
| `on_failure` | Debug a failing training run. Your most common choice for production jobs. | Session activates when training exits with a non-zero exit code. The container stays alive for you to inspect the failure. |
| `on_demand` | Decide later whether you need a session. This is the default. | Session activates when you authenticate through the device code flow, or when you change the trigger on a running job. |
Auth codes appear in `truss train isession` as soon as the tunnel starts, regardless of trigger mode.
With `on_failure`, the container stays alive for interactive use only after training fails.
With `on_demand`, the container stays alive only after you authenticate or explicitly change the trigger.
### Activating an on-demand session
If you pushed a job with `on_demand` (the default), activate the session by completing the device code flow: open the **Auth URL** and enter the **Auth Code** from `truss train isession`.
You can also activate the session by changing the trigger on a running job:
```bash theme={"system"}
truss train update_session --trigger on_startup
```
## Configuration
Configure interactive sessions with CLI flags or the Python SDK.
CLI flags override SDK values when both are set.
Pass `--interactive` to [`truss train push`](/reference/cli/training/training-cli#push) with a [trigger mode](#trigger-modes):
```bash theme={"system"}
truss train push config.py \
--interactive on_startup \
--interactive-timeout-minutes 120
```
Set `timeout_minutes` to `-1` to extend the session expiry to 10 years. See [Timeout and expiry](#timeout-and-expiry) for details.
See the [CLI reference](/reference/cli/training/training-cli#push) for all `push` options.
Add an [`InteractiveSession`](/reference/sdk/training#interactivesession) to the `interactive_session` field on your [`TrainingJob`](/reference/sdk/training#trainingjob):
```python config.py theme={"system"}
from truss_train import TrainingProject, TrainingJob, Image, Compute, Runtime
from truss_train.definitions import (
InteractiveSession,
InteractiveSessionTrigger,
InteractiveSessionAuthProvider,
InteractiveSessionProvider,
)
from truss.base.truss_config import AcceleratorSpec
training_job = TrainingJob(
image=Image(base_image="pytorch/pytorch:2.7.0-cuda12.8-cudnn9-runtime"),
compute=Compute(
accelerator=AcceleratorSpec(accelerator="H200", count=2),
),
runtime=Runtime(
start_commands=["chmod +x ./run.sh && ./run.sh"],
),
interactive_session=InteractiveSession(
trigger=InteractiveSessionTrigger.ON_FAILURE,
timeout_minutes=480,
auth_provider=InteractiveSessionAuthProvider.MICROSOFT, # You can also use GITHUB
session_provider=InteractiveSessionProvider.VS_CODE, # You can also use CURSOR
),
)
training_project = TrainingProject(name="my-training-project", job=training_job)
```
See the [SDK reference](/reference/sdk/training#interactivesession) for all `InteractiveSession` fields.
## Session management
### View session status
Check auth codes and connection status:
```bash theme={"system"}
truss train isession --job-id
```
### Monitor with live logs
The `--tail` flag displays a live view with the session table pinned at the top and training logs streaming below:
```bash theme={"system"}
truss train logs --job-id --tail
```
## Timeout and expiry
Sessions expire based on the `timeout_minutes` setting (default: 480 minutes, or 8 hours). Set `timeout_minutes` to `-1` to extend the expiry to 10 years.
1. When the tunnel starts successfully, Baseten sets the expiry to `now + timeout_minutes`.
2. Each time the tunnel reconnects, the expiry resets to `now + timeout_minutes`.
3. When the expiry passes, the session ends and the container shuts down.
The timeout resets on tunnel reconnection, not on general IDE activity.
If you disconnect and reconnect, the timer resets.
If you stay connected but idle, the session expires after the configured timeout.
### What happens when a session expires
When a session expires, Baseten signals the container to shut down gracefully.
Baseten doesn't hard-kill the container. It receives the signal and exits cleanly.
Baseten preserves any files you saved to `$BT_CHECKPOINT_DIR`, but you lose unsaved work in the container's local filesystem.
## Multi-node sessions
For [multi-node training jobs](/training/concepts/multinode), Baseten creates one rSSH session per node.
Each node gets its own auth code, and you connect to each node independently.
Tunnel names follow the format `bt-session--`, where `node_rank` starts at 0. For example, a 2-node job produces:
* `bt-session-abc123-0` (node 0)
* `bt-session-abc123-1` (node 1)
The `truss train isession` command displays auth codes for all nodes in a single table.
# Lifecycle
Source: https://docs.baseten.co/training/lifecycle
Understanding the different states and transitions in a Baseten training job's lifecycle.
A training job in Baseten progresses through several states from creation to completion. Understanding these states helps you monitor and manage your training jobs effectively.
## Job states
| State | Description | Active | Terminal |
| ---------------------------- | --------------------------------------------------------------------------------------------------------------------------- | ------ | -------- |
| `TRAINING_JOB_CREATED` | Initial state when a job is first created. Baseten has received the training configuration and persisted it to our records. | ✅ | |
| `TRAINING_JOB_DEPLOYING` | Baseten is deploying the job, including provisioning compute resources and installing dependencies. | ✅ | |
| `TRAINING_JOB_RUNNING` | The training code is actively executing. | ✅ | |
| `TRAINING_JOB_COMPLETED` | The job has successfully finished execution. Any checkpoints or artifacts have been saved and uploaded. | | ✅ |
| `TRAINING_JOB_DEPLOY_FAILED` | The job failed to deploy. This is likely due to a bad image or a resource allocation issue. | | ✅ |
| `TRAINING_JOB_FAILED` | The job encountered an error and could not complete successfully. Check the logs for error details. | | ✅ |
| `TRAINING_JOB_STOPPED` | The job was manually stopped by a user. | | ✅ |
## State transitions
Jobs typically progress through states in the following order:
1. `TRAINING_JOB_CREATED` → `TRAINING_JOB_DEPLOYING`: Automatic transition once resources are allocated
2. `TRAINING_JOB_DEPLOYING` → `TRAINING_JOB_RUNNING`: Automatic transition once environment setup is complete
3. `TRAINING_JOB_RUNNING` → `TRAINING_JOB_COMPLETED`: Automatic transition upon successful completion
A job may enter `TRAINING_JOB_FAILED` from any state if an error occurs. Similarly, `TRAINING_JOB_STOPPED` can be entered from any active state (`DEPLOYING` or `RUNNING`) when manually stopped.
You can monitor these state transitions using the CLI command:
```bash theme={"system"}
truss train view # shows all active jobs
truss train view --job-id # shows a specific job
```
Or track a specific job's progress with:
```bash theme={"system"}
truss train logs --job-id --tail
```
# Loading checkpoints
Source: https://docs.baseten.co/training/loading
Resume training from existing checkpoints to continue where you left off.
Checkpoint loading lets you resume training from previously saved model states. When enabled, Baseten automatically downloads your specified checkpoints to the training environment before your training code starts.
**Use cases:**
* Resume failed training jobs.
* Incremental training and fine-tuning.
## Accessing downloaded checkpoints
Checkpoints are available through the `BT_LOAD_CHECKPOINT_DIR` environment variable. For single-node training, they're located in `BT_LOAD_CHECKPOINT_DIR/rank-0/`. For multi-node training, each node's checkpoints are in `BT_LOAD_CHECKPOINT_DIR/rank-/`.
## Checkpoint reference
Create references to checkpoints using the `BasetenCheckpoint` factory:
### From latest
```python theme={"system"}
# Load the latest checkpoint from a project
BasetenCheckpoint.from_latest_checkpoint(project_name="my-training-project")
# Load the latest checkpoint from a previous job
BasetenCheckpoint.from_latest_checkpoint(job_id="gvpql31")
```
**Parameters:**
* `project_name`: Load the latest checkpoint from the most recent job in this project.
* `job_id`: Load the latest checkpoint from this specific job.
* Both parameters: Load the latest checkpoint from that specific job in that project.
### From named
```python theme={"system"}
# Pin your starting point to a specific checkpoint
BasetenCheckpoint.from_named_checkpoint(checkpoint_name="checkpoint-20", job_id="gvpql31")
```
**Parameters:**
* `checkpoint_name`: The name of the specific checkpoint to load.
* `job_id`: The job that contains the named checkpoint.
* Both parameters: Load the named checkpoint from that specific job in that project.
## Configuration examples
Here are practical examples of how to configure checkpoint loading in your training jobs:
### From latest
```python theme={"system"}
# Latest checkpoint from project
load_config = LoadCheckpointConfig(
enabled=True,
checkpoints=[
BasetenCheckpoint.from_latest_checkpoint(project_name="gpt-finetuning")
]
)
# Latest checkpoint from specific job
load_config = LoadCheckpointConfig(
enabled=True,
checkpoints=[
BasetenCheckpoint.from_latest_checkpoint(job_id="gvpql31")
]
)
```
### From named
```python theme={"system"}
# Specific named checkpoint
load_config = LoadCheckpointConfig(
enabled=True,
checkpoints=[
BasetenCheckpoint.from_named_checkpoint(
checkpoint_name="checkpoint-20",
job_id="gvpql31"
)
]
)
# Named checkpoint with custom download location
load_config = LoadCheckpointConfig(
enabled=True,
download_folder="/tmp/my_checkpoints",
checkpoints=[
BasetenCheckpoint.from_named_checkpoint(
checkpoint_name="checkpoint-20",
job_id="rwnojdq"
)
]
)
```
**Configuration parameters:**
* `enabled`: Set to `True` to enable checkpoint loading.
* `checkpoints`: List containing checkpoint references.
* `download_folder`: Optional custom download location (defaults to `/tmp/loaded_checkpoints`).
## Complete TrainingJob setup
```python theme={"system"}
from truss_train import LoadCheckpointConfig, BasetenCheckpoint, CheckpointingConfig, TrainingJob, Image, Runtime, TrainingProject
from truss_train.definitions import CacheConfig
# Configure checkpoint loading
load_checkpoint_config = LoadCheckpointConfig(
enabled=True,
download_folder="/tmp/loaded_checkpoints",
checkpoints=[
BasetenCheckpoint.from_latest_checkpoint(job_id="previous_job_id")
]
)
# Configure checkpointing for saving new checkpoints
checkpointing_config = CheckpointingConfig(
enabled=True,
checkpoint_path="/tmp/training_checkpoints"
)
# Create TrainingJob
job = TrainingJob(
image=Image(base_image="your-base-image"),
runtime=Runtime(
checkpointing_config=checkpointing_config,
load_checkpoint_config=load_checkpoint_config,
start_commands=["chmod +x ./run.sh && ./run.sh"],
cache_config=CacheConfig(enabled=True)
),
)
project = TrainingProject(name="my-training-project", job=job)
```
## Using checkpoints in your training code
Access loaded checkpoints using the `BT_LOAD_CHECKPOINT_DIR` environment variable:
```python theme={"system"}
from transformers import AutoModelForSequenceClassification, AutoTokenizer, TrainingArguments, Trainer
from transformers.trainer_utils import get_last_checkpoint
import os
def train():
checkpoint_dir = os.environ.get("BT_LOAD_CHECKPOINT_DIR")
last_checkpoint = None
if checkpoint_dir:
last_checkpoint = get_last_checkpoint(checkpoint_dir)
if last_checkpoint:
print(f"✅ Resuming from checkpoint: {last_checkpoint}")
model = AutoModelForSequenceClassification.from_pretrained(last_checkpoint)
tokenizer = AutoTokenizer.from_pretrained(checkpoint_dir)
else:
print("⚠️ No checkpoint found, starting from scratch")
model = AutoModelForSequenceClassification.from_pretrained("your-base-model")
tokenizer = AutoTokenizer.from_pretrained("your-base-model")
else:
print("ℹ️ No checkpoint loading configured")
model = AutoModelForSequenceClassification.from_pretrained("your-base-model")
tokenizer = AutoTokenizer.from_pretrained("your-base-model")
training_args = TrainingArguments(
output_dir=os.environ.get("BT_CHECKPOINT_DIR", "/tmp/training_checkpoints"),
save_strategy="steps",
save_steps=1000,
load_best_model_at_end=True,
)
trainer = Trainer(model=model, args=training_args)
trainer.train(resume_from_checkpoint=last_checkpoint)
```
# Management
Source: https://docs.baseten.co/training/management
How to monitor, manage, and interact with your Baseten Training projects and jobs.
Once you've submitted training jobs, Baseten provides tools to manage your `TrainingProject`s and individual `TrainingJob`s. You can use the [CLI](/reference/cli/training/training-cli) or the [API](/reference/training-api/overview) to manage your jobs.
## `TrainingProject` management
* **Listing Projects:** To view all your training projects:
```bash theme={"system"}
truss train view
```
This command will list all `TrainingProject`s you have access to, typically showing their names and IDs. Additionally, this command will show all active jobs.
* **Viewing Jobs within a Project:** To see all jobs associated with a specific project, use its `project` (obtained when creating the project or from `truss train view`):
```bash theme={"system"}
truss train view --project
```
* **Deleting a `TrainingProject`:** You can delete a training project via the API, or through the dashboard.
Using the API:
```bash theme={"system"}
curl -X DELETE https://api.baseten.co/v1/training_projects/ \
-H "Authorization: Api-Key YOUR_API_KEY"
```
From the Baseten dashboard:
1. Select the training project you want to delete.
2. Type the project name (for example, `demo/qwen3-0.6b`) to confirm.
3. Select **Delete**.
When you delete a project, the following data is permanently deleted with no archival or recovery option:
* All undeployed [checkpoints](/training/concepts/checkpointing) from every job in the project
* All data in the project's [training cache](/training/concepts/cache) (`$BT_PROJECT_CACHE_DIR`)
Checkpoints that have been [deployed](/training/deployment) aren't affected.
## `TrainingJob` management
After submitting a job with `truss train push config.py`, you receive a `project_id` and `job_id`.
* **Listing Jobs:** As shown above, you can list all jobs within a project using:
```bash theme={"system"}
truss train view --project
```
This will typically show job IDs, statuses, creation times, etc.
* **Checking Status and Retrieving Logs:** To view the logs for a specific job, you can tail them in real-time or fetch existing logs.
* To view logs for the most recently submitted job in the current context (e.g., if you just pushed a job from your current terminal directory):
```bash theme={"system"}
truss train logs --tail
```
* To view logs for a specific job using its `job-id`:
```bash theme={"system"}
truss train logs --job-id [--tail]
```
Add `--tail` to follow the logs live.
* **Understanding Job Statuses:**
The `truss train view` and `truss train logs` commands will help you track which status a job is in. For more on the job lifecycle, see the [Lifecycle](/training/lifecycle) page.
* **Stopping a `TrainingJob`:** If you need to stop a running job, use the `stop` command with the job's project ID and job ID:
```bash theme={"system"}
truss train stop --job-id
truss train stop --all # Stops all active jobs; Will prompt the user for confirmation.
```
This will transition the job to the `TRAINING_JOB_STOPPED` state.
* **Deleting a `TrainingJob`:** You can delete a training job via the API, or through the dashboard.
Using the API:
```bash theme={"system"}
curl -X DELETE https://api.baseten.co/v1/training_projects//jobs/ \
-H "Authorization: Api-Key YOUR_API_KEY"
```
From the Baseten dashboard:
1. Select the project containing the job.
2. Select the job you want to delete.
3. Type the job name (for example, `job-2`) to confirm.
4. Select **Delete**.
When you delete a job, all undeployed checkpoints are deleted permanently. There's no archival or recovery option. Checkpoints that have been [deployed](/training/deployment) aren't affected.
* **Understanding Job Outputs & Checkpoints:**
* The primary outputs of a successful `TrainingJob` are model **checkpoints** (if checkpointing is enabled and configured).
* These checkpoints are stored by Baseten. Refer to the [Checkpointing section in Core Concepts](/training/concepts#checkpointing) for how `CheckpointingConfig` works.
* When you are ready to [deploy a model](/training/deployment), you will specify which checkpoints to use. The `model_name` you assign during deployment (via `DeployCheckpointsConfig`) becomes the identifier for this trained model version derived from your specific job's checkpoints.
* You can see the available checkpoints for a job via the [Training API](/reference/training-api/get-training-job-checkpoints).
# Training on Baseten
Source: https://docs.baseten.co/training/overview
Own your intelligence and train custom models with our developer-first training infrastructure.
Baseten provides a flexible training platform that enables you to bring your own training scripts, use the latest training techniques, and fine-tune the newest models.
Train models and serve them in production, all on one platform. Baseten automatically stores your checkpoints during training and makes them ready for deployment. No downloading weights, no re-uploading, no separate infrastructure. Your fine-tuned model goes from checkpoint to production endpoint in a single command.
```bash theme={"system"}
# Train your model
truss train push config.py
# Deploy from the checkpoint
truss train deploy_checkpoints --job-id
```
## Train and serve on one platform
The train-to-serve workflow is seamless:
1. **Set up your training project:** Bring any framework or start with a template.
2. **Configure your training job:** Define compute, runtime, and checkpointing settings.
3. **Run on managed infrastructure:** Use H200, H100, or A10G GPUs, single-node or multi-node.
4. **Checkpoints sync automatically:** Baseten stores checkpoints as training progresses.
5. **Deploy your fine-tuned model:** Go from checkpoint to production endpoint in one command.
No infrastructure management. No manual file transfers. Bring any framework (Axolotl, TRL, VeRL, Megatron, or your own training code) and your trained model serves traffic within minutes of training completion.
## Supported frameworks
Baseten Training is framework-agnostic. Use whatever framework fits your workflow.
| Framework | Best for | Example |
| --------- | ------------------------------------------------ | ---------------------------------------------------------------------------------------------------------------------- |
| Axolotl | Configuration-driven fine-tuning with LoRA/QLoRA | [oss-gpt-20b-axolotl](https://github.com/basetenlabs/ml-cookbook/tree/main/examples/oss-gpt-20b-axolotl) |
| TRL | SFT, DPO, and GRPO with Hugging Face | [oss-gpt-20b-lora-trl](https://github.com/basetenlabs/ml-cookbook/tree/main/examples/oss-gpt-20b-lora-trl) |
| TRL | LoRA DPO fine-tuning | [qwen3-8b-lora-dpo-trl](https://github.com/basetenlabs/ml-cookbook/tree/main/examples/qwen3-8b-lora-dpo-trl) |
| VeRL | Reinforcement learning with custom rewards | [qwen3-8b-lora-verl](https://github.com/basetenlabs/ml-cookbook/tree/main/examples/qwen3-8b-lora-verl) |
| MS-Swift | Long-context and multilingual training | [qwen3-30b-mswift-multinode](https://github.com/basetenlabs/ml-cookbook/tree/main/examples/qwen3-30b-mswift-multinode) |
Browse the [ML Cookbook](https://github.com/basetenlabs/ml-cookbook) for more examples including multi-node training with FSDP and DeepSpeed.
## Key features
### Checkpoint management
Checkpoints sync automatically to Baseten storage during training. You can:
* **Deploy** any checkpoint as a production endpoint with [`truss train deploy_checkpoints`](/training/deployment).
* **Download** checkpoints for local evaluation and analysis.
* **Resume** from any checkpoint if a job fails or you want to train further.
Learn more about [checkpointing](/training/concepts/checkpointing).
### BDN weight and data loading
Load model weights and training data through [Baseten Delivery Network (BDN)](/training/concepts/storage#load-weights-and-data-with-bdn). Mount weights from Hugging Face, S3, GCS, Azure, R2, or any HTTPS URL directly into your training container with no download code needed. BDN mirrors weights before compute is provisioned and caches them for faster mounting on subsequent jobs.
Learn more about [storage and data ingestion](/training/concepts/storage).
### Persistent caching
Speed up training iterations by caching models, datasets, and preprocessed data between jobs. The cache persists across training runs, so you don't re-download 70B models every time.
Learn more about the [training cache](/training/concepts/cache).
### Multi-node training
Scale training across multiple GPU nodes with InfiniBand networking. Baseten handles node orchestration, communication setup, and environment variables. You just set `node_count` in your configuration.
Learn more about [multi-node training](/training/concepts/multinode).
### Interactive development with rSSH
Debug training jobs interactively with SSH-like access to your training containers via VS Code or Cursor Remote Tunnels.
Connect to a running or failed job, inspect state, and iterate without resubmitting.
Learn more about [interactive sessions](/training/interactive-sessions).
## Next steps
Run your first training job and deploy the result.
Production-ready examples for various frameworks and models.
## Reference
* [CLI reference](/reference/cli/training/training-cli)
* [SDK reference](/reference/sdk/training)
* [API reference](/reference/training-api/overview)
# Deployments
Source: https://docs.baseten.co/troubleshooting/deployments
Troubleshoot common problems during model deployment
## Issue: `truss push` can't find `config.yaml`
```sh theme={"system"}
[Errno 2] No such file or directory: '/Users/philipkiely/Code/demo_docs/config.yaml'
```
### Fix: set correct target directory
The directory `truss push` is looking at is not a Truss. Make sure you're giving `truss push` access to the correct directory by:
* Running `truss push` from the directory containing the Truss. You should see the file `config.yaml` when you run `ls` in your working directory.
* Or passing the target directory as an argument, such as `truss push /path/to/my-truss`.
## Issue: unexpected failure during model build
During the model build step, there can be unexpected failures from temporary circumstances. An example is a network error while downloading model weights from Hugging Face or installing a Python package from PyPi.
### Fix: restart deploy from Baseten UI
First, check your model logs to determine the exact cause of the error. If it's an error during model download, package installation, or similar, you can try restarting the deploy from the model dashboard in your workspace.
***
## Autoscaling issues
Before troubleshooting, review [Autoscaling](/deployment/autoscaling/overview) for parameter details and [Traffic patterns](/deployment/autoscaling/traffic-patterns) for pattern-specific recommendations.
### Latency spikes during scaling events
**Symptoms**: TTFT (time to first token) or p95/p99 latency degrades when replicas are added or removed.
**Causes**:
* Replicas terminated while handling in-flight requests
* Cold start delays while new replicas initialize
**Solutions** (in order of priority):
1. Increase [**scale-down delay**](/deployment/autoscaling/overview#scale-down-delay) (e.g., 300s → 900s) to reduce how often replicas are removed.
2. Increase [**min replicas**](/deployment/autoscaling/overview#minimum-replicas) to reduce cold start frequency.
3. Lower [**target utilization**](/deployment/autoscaling/overview#target-utilization) to provide more headroom during scaling.
### Replicas oscillating (thrash)
**Symptoms**: Replica count bounces repeatedly (e.g., 8↔9) even with relatively stable traffic.
**Causes**: Autoscaler reacting to short-term traffic noise or internal model fluctuations.
**Solutions** (in order of priority):
1. Increase **scale-down delay**: this is the primary lever for oscillation.
2. Increase [**autoscaling window**](/deployment/autoscaling/overview#autoscaling-window) to smooth out noise.
3. Only then consider lowering **target utilization** for more headroom.
Don't use target utilization as the primary fix for thrash. Scale-down delay is more effective and doesn't waste capacity.
### Slow scale-up / "Scaling up replicas" persists
**Symptoms**: New replicas take many minutes (or longer) to become ready. The deployment shows "Scaling up replicas" for an extended period.
**Causes**:
* GPU capacity not available in your region
* Slow model initialization (large weights, slow downloads)
**Solutions**:
1. **Pre-warm** by bumping min replicas via API before expected load spikes.
2. Contact support about capacity pool availability.
3. Check if optimized images are being used (look for "streaming-enabled image" in logs).
### Model scales to zero before testing
**Symptoms**: A newly deployed model scales down to zero before you can send your first test request.
**Solution**: Set `min_replica = 1` during testing. After testing, you can set it back to 0 if you want scale-to-zero behavior.
### Async queue growing without bound
**Symptoms**: The async queue size keeps increasing and requests are not being processed fast enough.
**Cause**: Requests are arriving faster than the deployment can process them.
**Solutions**:
1. Increase [**max replicas**](/deployment/autoscaling/overview#maximum-replicas) to add more processing capacity.
2. Increase [**concurrency target**](/deployment/autoscaling/overview#concurrency-target) if your model can handle more concurrent requests.
3. Lower **target utilization** to trigger scaling earlier.
### Bill higher than expected
**Symptoms**: GPU costs are higher than anticipated, especially during low-traffic periods.
**Solutions**:
1. Raise **concurrency target** to squeeze more throughput from each replica.
2. Monitor **p95 latency** as you raise concurrency. If latency stays stable, keep raising; if it rises sharply, you've gone too far.
3. Enable **scale-to-zero** (min replicas = 0) for intermittent workloads.
4. Review your traffic patterns and adjust settings accordingly. See [Traffic patterns](/deployment/autoscaling/traffic-patterns).
### Cold starts taking too long
**Symptoms**: First request after scale-from-zero takes several minutes. Logs show extended time in model loading or container initialization.
**Causes**:
* Large model weights (10s–100s of GB)
* Slow network downloads from model registries
* Heavy initialization code in `load()` method
**Solutions**:
1. Look for "streaming-enabled image" in logs. This confirms image streaming is active.
2. Keep `min_replica ≥ 1` to avoid cold starts entirely.
3. Pre-warm before expected traffic spikes using the [autoscaling API](/reference/management-api/deployments/autoscaling/updates-a-deployments-autoscaling-settings).
See [Cold starts](/deployment/autoscaling/cold-starts) for detailed optimization strategies.
### Development deployment won't scale
**Symptoms**: Development deployment stays at 1 replica regardless of traffic. Can't change autoscaling settings.
**Cause**: Development deployments have fixed autoscaling settings that cannot be modified. Max replicas is locked at 1.
**Solution**: Promote to a production deployment to enable full autoscaling. Development deployments are optimized for iteration with live reload, not traffic handling.
See [Development deployments](/deployment/autoscaling/overview#development-deployments) for the fixed settings.
### Not sure which traffic pattern I have
**Symptoms**: Unsure how to configure autoscaling because traffic behavior is unclear.
**Solution**:
1. Go to your model's **Metrics** tab in the Baseten dashboard.
2. Look at **Inference volume** and **Replicas** over the past week.
3. Identify your pattern:
| You see... | Pattern | Key settings to adjust |
| ------------------------------------------- | --------------- | ------------------------------------------- |
| Frequent small spikes returning to baseline | Noisy/jittery | Longer autoscaling window |
| Sharp jumps that stay high | Bursty | Short window, long delay, lower utilization |
| Long flat periods with occasional bursts | Batch/scheduled | Scale-to-zero, pre-warming |
| Gradual rises and falls | Smooth/steady | Higher utilization is safe |
See [Traffic patterns](/deployment/autoscaling/traffic-patterns) for detailed recommendations.
### Concurrency target misconfigured
**Symptoms**: Either unexpectedly high costs OR high latency despite having replicas available.
**Diagnosis**:
* **Too low** (common): Running many more replicas than needed. Default of 1 is conservative but expensive.
* **Too high**: Requests queue at replicas, causing latency even when replica count looks healthy.
**Solutions**:
1. Benchmark your model to find actual throughput capacity.
2. Use starting points by model type:
| Model type | Starting concurrency |
| ----------------------- | -------------------- |
| Standard Truss | 1 |
| vLLM / LLM inference | 32–128 |
| Text embeddings (TEI) | 32 |
| Image generation (SDXL) | 1 |
3. Gradually increase while monitoring p95 latency. Stop when latency rises sharply.
See [Concurrency target](/deployment/autoscaling/overview#concurrency-target) for full guidance.
For detailed autoscaling configuration, see [Autoscaling](/deployment/autoscaling/overview). For pattern-specific recommendations, see [Traffic patterns](/deployment/autoscaling/traffic-patterns).
# Inference
Source: https://docs.baseten.co/troubleshooting/inference
Troubleshoot common problems during model inference
## Model I/O issues
### Error: JSONDecodeError
```
json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)
```
This error means you're attempting to pass a model input that is not JSON-serializable. For example, you might have left out the double quotes required for a valid string:
```sh theme={"system"}
truss predict -d 'This is not a string' # Wrong
truss predict -d '"This is a string"' # Correct
```
## Model version issues
### Error: No OracleVersion matches the given query
```
```
Make sure that the model ID or deployment ID you're passing is correct and that the associated model has not been deleted.
Additionally, make sure you're using the correct endpoint:
* [Production environment endpoint](/reference/inference-api/predict-endpoints/environments-predict).
* [Development deployment endpoint](/reference/inference-api/predict-endpoints/development-predict).
* [Deployment endpoint](/reference/inference-api/predict-endpoints/deployment-predict).
## Authentication issues
### Error: Service provider not found
```
ValueError: Service provider example-service-provider not found in ~/.trussrc
```
This error means your `~/.trussrc` is incomplete or incorrect. It should be formatted as follows:
```
[baseten]
remote_provider = baseten
api_key = YOUR.API_KEY
remote_url = https://app.baseten.co
```
### Error: You have to log in to perform the request
```
```
This error occurs on `truss predict` when the API key in `~/.trussrc` for a given host is missing or incorrect. To fix it, update your API key in the `~/.trussrc` file.
### Error: Please check the API key you provided
```
{
"error": "please check the api-key you provided"
}
```
This error occurs when using `curl` or similar to call the model via its API endpoint when the API key passed in the request header is not valid. Make sure you're using a valid API key then try again.