---
name: Baseten
description: Use when deploying AI models to production, building multi-step inference pipelines with Chains, running training jobs, or calling hosted LLMs through Model APIs. Reach for this skill when agents need to package models, configure deployments, manage autoscaling, call inference endpoints, or troubleshoot deployment issues.
metadata:
    mintlify-proj: baseten
    version: "1.0"
---

# Baseten Skill

## Product summary

Baseten is a training and inference platform for deploying AI models at scale. Agents use **Truss** (an open-source model packaging tool) to containerize models and configuration, then deploy to Baseten for autoscaling, observability, and optimized serving. For quick inference without deployment, **Model APIs** provide OpenAI-compatible endpoints for hosted models. **Chains** orchestrates multi-step pipelines where each step runs on independent hardware. Key files: `config.yaml` (model configuration), `model.py` (inference logic). CLI: `truss init`, `truss push`, `truss watch`, `truss predict`. Primary docs: https://docs.baseten.co

## When to use

- **Deploying models**: Agent is packaging a model (custom code or config-only) and pushing to production with `truss push`
- **Iterating on models**: Agent is testing changes with `truss push --watch` for live-reload development deployments
- **Calling inference**: Agent is making API requests to deployed models or Model APIs using `/predict` or `/async_predict` endpoints
- **Building pipelines**: Agent is orchestrating multi-model workflows with Chains (e.g., RAG, audio transcription, multi-step image generation)
- **Configuring resources**: Agent is setting GPU type, memory, dependencies, secrets, or autoscaling parameters in `config.yaml`
- **Troubleshooting deployments**: Agent is debugging build failures, autoscaling issues, cold starts, or unhealthy replicas
- **Training and fine-tuning**: Agent is launching training jobs and deploying checkpoints to production

## Quick reference

### Essential CLI commands

| Command | Purpose |
|---------|---------|
| `truss init` | Scaffold a new model project |
| `truss push` | Deploy a published (production-ready) deployment |
| `truss push --watch` | Create development deployment with live-reload |
| `truss watch` | Re-attach to existing development deployment |
| `truss predict -d '{...}'` | Test inference locally or against deployed model |
| `truss push --promote` | Deploy and promote directly to production environment |
| `chains push <file.py>` | Deploy a Chain pipeline |

### config.yaml essentials

```yaml
model_name: my-model
resources:
  accelerator: L4          # GPU type (L4, H100, etc.)
  cpu: "4"                 # CPU cores
  memory: 16Gi             # RAM
requirements:
  - torch
  - transformers
environment_variables:
  MY_VAR: value
secrets:
  hf_access_token: null   # Injected at runtime
```

### Inference endpoints

| Endpoint | Use case |
|----------|----------|
| `/production/predict` | Sync inference on production deployment |
| `/development/predict` | Sync inference on dev deployment |
| `/production/async_predict` | Async inference (returns request ID) |
| `/async_request/{id}` | Check async request status |
| `/production/wake` | Pre-warm deployment before traffic spike |

### Autoscaling parameters (defaults)

| Parameter | Default | What it controls |
|-----------|---------|------------------|
| `min_replica` | 0 | Minimum instances (0 = scale-to-zero) |
| `max_replica` | 1 | Maximum instances |
| `concurrency_target` | 1 | Requests per replica before scaling |
| `target_utilization_percentage` | 70% | Headroom before scaling triggers |
| `autoscaling_window` | 60s | Time window for traffic analysis |
| `scale_down_delay` | 900s | Wait before removing idle replicas |

## Decision guidance

### When to use config-only vs. custom code

| Scenario | Approach | Why |
|----------|----------|-----|
| Deploying standard LLM from Hugging Face | Config-only (Engine-Builder) | Engines handle optimization; no custom code needed |
| Need preprocessing/postprocessing | Custom Python code in `model.py` | Define `load()` and `predict()` methods |
| Using vLLM, SGLang, or custom Docker server | Custom server with base_image | Map routes and configure health checks |
| Multi-step workflow (RAG, transcription) | Chains | Each step runs independently with own hardware |

### When to use development vs. published deployments

| Scenario | Deployment type | Command |
|----------|-----------------|---------|
| Building and testing locally | Development | `truss push --watch` |
| Ready for production traffic | Published | `truss push` |
| Need full autoscaling (>1 replica) | Published | `truss push` then configure autoscaling |
| Iterating with live reload | Development | `truss watch` (re-attach) |

### When to use sync vs. async inference

| Scenario | Endpoint | Why |
|----------|----------|-----|
| Real-time API (chat, embedding) | `/predict` | Synchronous, immediate response |
| Long-running task (training, batch processing) | `/async_predict` | Returns immediately; poll status later |
| Streaming responses (token-by-token) | `/predict` with streaming | Use OpenAI SDK with `stream=True` |

### Concurrency target by model type

| Model type | Starting value | Reasoning |
|-----------|----------------|-----------|
| Standard Truss model | 1 | Processes one request at a time |
| vLLM / LLM inference | 32–128 | Batches requests internally |
| Text embeddings (TEI) | 32 | Batches efficiently |
| Image generation (SDXL) | 1 | Consumes all GPU memory per request |
| BEI embeddings | 96+ | High-throughput batching |

## Workflow

### Deploy a model (config-only)

1. **Initialize**: Run `truss init` and name your model.
2. **Configure**: Edit `config.yaml` with model name, GPU type, and dependencies.
3. **Test locally**: Run `truss predict -d '{"input": "test"}'` to verify.
4. **Deploy**: Run `truss push` to create a published deployment.
5. **Verify**: Check Baseten dashboard for deployment status and logs.
6. **Call inference**: Use `/production/predict` endpoint with API key.

### Iterate on a model with live reload

1. **Start watch**: Run `truss push --watch` to create development deployment.
2. **Make changes**: Edit `model.py`, `config.yaml`, or requirements.
3. **Truss detects changes**: Automatically patches the running deployment (seconds).
4. **Test**: Call `/development/predict` to verify changes.
5. **Repeat**: Continue editing and testing without redeploying.
6. **Promote**: When satisfied, run `truss push --promote` to production.

### Configure autoscaling for production

1. **Identify traffic pattern**: Review metrics dashboard (Inference volume, Replicas over time).
2. **Set replica bounds**: Choose `min_replica` (0 for scale-to-zero, ≥1 for always-on) and `max_replica` (cost ceiling).
3. **Tune concurrency**: Start with model-type defaults (e.g., 32 for vLLM), monitor p95 latency, adjust.
4. **Adjust timing**: Increase `scale_down_delay` if replicas oscillate; increase `autoscaling_window` if traffic is noisy.
5. **Apply settings**: Use UI or API to update autoscaling configuration.
6. **Monitor**: Watch metrics for latency spikes, oscillation, or unexpected costs.

### Call a deployed model

1. **Get credentials**: Retrieve model ID from dashboard and API key from settings.
2. **Construct URL**: `https://model-{model_id}.api.baseten.co/environments/production/predict`
3. **Add auth header**: `Authorization: Api-Key {BASETEN_API_KEY}`
4. **Send JSON**: POST with model input as JSON body.
5. **Handle response**: Parse JSON response or stream tokens if streaming enabled.

### Build a multi-step pipeline with Chains

1. **Define chainlets**: Create Python classes inheriting from `ChainletBase` with `run_remote()` methods.
2. **Mark entrypoint**: Decorate orchestrator chainlet with `@chains.mark_entrypoint`.
3. **Inject dependencies**: Use `chains.depends()` to wire chainlets together.
4. **Deploy**: Run `chains push my_chain.py` to deploy all chainlets.
5. **Call**: Invoke entrypoint via `/production/predict` endpoint.

## Common gotchas

- **config.yaml not found**: Run `truss push` from the directory containing `config.yaml`, or pass the path explicitly.
- **Development deployments don't autoscale**: Max replicas is locked at 1. Promote to production to enable full autoscaling.
- **Hot reload doesn't re-run `__init__()` or `load()`**: If you add new instance state in those methods, stop watch and do a full redeploy with `truss push --watch`.
- **Concurrency target too low**: Default of 1 is conservative but expensive. For vLLM/LLM, start at 32–128 to reduce replica count.
- **Scale-down delay too short**: Replicas oscillate when traffic briefly dips. Increase delay to 900s+ to prevent thrashing.
- **Cold starts on scale-from-zero**: Large models take minutes to load. Set `min_replica ≥ 1` for production or pre-warm with `/wake` endpoint.
- **Secrets in config.yaml**: Never hardcode API keys. Use `secrets` section and inject at runtime; Baseten encrypts them.
- **Model cache deprecated**: Use `weights` instead for faster cold starts through multi-tier caching.
- **Requirements file conflicts**: If using `requirements_file`, don't also use inline `requirements` list—pick one.
- **Async requests timeout**: Default timeout is 24 hours. For longer jobs, poll `/async_request/{id}` periodically.

## Verification checklist

Before submitting a deployment:

- [ ] `config.yaml` exists in the Truss root directory
- [ ] Model name and description are set
- [ ] GPU/CPU resources match model requirements
- [ ] All Python packages are listed in `requirements` or `requirements_file`
- [ ] Secrets are defined in `secrets` section, not hardcoded
- [ ] `model.py` has `load()` and `predict()` methods (if custom code)
- [ ] Local test passes: `truss predict -d '{...}'`
- [ ] Deployment reaches `READY` status in dashboard
- [ ] Inference endpoint responds with correct output format
- [ ] Autoscaling settings are configured for production (if applicable)
- [ ] Logs show no errors or warnings during model loading

## Resources

- **Comprehensive navigation**: https://docs.baseten.co/llms.txt
- **Truss configuration reference**: https://docs.baseten.co/reference/truss-configuration
- **Inference API reference**: https://docs.baseten.co/reference/inference-api/overview
- **Autoscaling guide**: https://docs.baseten.co/deployment/autoscaling/overview
- **Troubleshooting deployments**: https://docs.baseten.co/troubleshooting/deployments

---

> For additional documentation and navigation, see: https://docs.baseten.co/llms.txt