---
name: Baseten
description: Documentation and capabilities reference for Baseten
metadata:
    mintlify-proj: baseten
    version: "1.0"
---

## Capabilities

Baseten enables agents to deploy and manage AI models at scale with minimal infrastructure overhead. Agents can package any model using Truss, deploy it with automatic scaling, run inference through REST APIs, orchestrate multi-step workflows with Chains, train and fine-tune models, and monitor performance in production. The platform supports multiple inference modes (sync, async, streaming, WebSocket), provides OpenAI-compatible endpoints for LLMs, and integrates with popular frameworks like LangChain and LlamaIndex.

## Skills

### Model Deployment & Packaging

- **Package models with Truss**: Create containerized model servers without Docker knowledge using the Truss framework. Define Python dependencies, system packages, GPU settings, environment variables, and custom inference logic in a structured project.
- **Deploy models to production**: Push models to Baseten using `truss push` for development deployments or promote to production environments. Every deployment automatically gets a REST API endpoint.
- **Configure model environments**: Set up staging and production environments with independent autoscaling settings and stable endpoints for controlled rollouts.
- **Live-reload development**: Use `truss watch` to iterate on deployed models with instant feedback during development without redeploying.

### Inference APIs

- **Synchronous inference**: Call deployed models via REST API endpoints at `https://model-{model_id}.api.baseten.co/environments/production/predict` with JSON payloads. Authenticate with API keys in the Authorization header.
- **Asynchronous inference**: Submit long-running tasks with `/async_predict` endpoints that return immediately with a request ID. Results are delivered to webhook endpoints with automatic retries and priority queuing (priority 0-2).
- **Streaming inference**: Enable real-time token streaming for LLMs using `/predict` endpoints with `stream: true` parameter. Useful for chat applications and reducing time-to-first-token.
- **WebSocket connections**: Establish persistent WebSocket connections for bidirectional communication at `/environments/{env}/websocket` endpoints.
- **OpenAI SDK compatibility**: Use OpenAI SDKs directly by pointing to `https://inference.baseten.co/v1` base URL for Model APIs. Supports chat completions, structured outputs, tool calling, and reasoning.

### Model APIs (Managed LLMs)

- **Access pre-optimized LLMs**: Enable models like DeepSeek V3, Kimi K2, GLM 4.7, and OpenAI GPT OSS 120B through OpenAI-compatible endpoints without deploying custom infrastructure.
- **Structured outputs**: Generate JSON conforming to Pydantic schemas using the `response_format` parameter.
- **Tool calling**: Let models call functions you define with automatic parsing of function calls.
- **Extended reasoning**: Control thinking time for complex tasks with the `reasoning` parameter.

### Multi-Step Workflows with Chains

- **Orchestrate pipelines**: Build multi-step inference workflows combining multiple models, business logic, and external APIs using the Chains framework.
- **Parallel execution**: Run independent steps concurrently using async/await patterns within Chainlets.
- **Chainlet composition**: Define reusable Chainlet classes that can be composed into larger workflows with dependency injection.
- **Truss integration**: Wrap deployed Truss models as Stubs within Chains for seamless integration of existing models.
- **Error handling**: Implement retry logic, error propagation, and graceful degradation across workflow steps.
- **Streaming outputs**: Stream results from Chains using `predict_async_stream` for real-time response delivery.

### Training & Fine-tuning

- **Launch training jobs**: Submit containerized training jobs with `truss train push config.py` supporting any framework (Axolotl, TRL, VeRL, Megatron, Unsloth, MS-Swift).
- **Checkpoint management**: Automatically sync checkpoints during training to Baseten storage. Deploy any checkpoint as a production endpoint with `truss train deploy_checkpoints --job-id <job_id>`.
- **Multi-node training**: Scale training across multiple GPU nodes with InfiniBand networking. Baseten handles orchestration and environment setup.
- **Persistent caching**: Cache models, datasets, and preprocessed data between training runs to avoid re-downloading large files.
- **Interactive debugging**: Use rSSH for SSH-like access to training environments for interactive debugging and experimentation.

### Deployment Management

- **Autoscaling configuration**: Control scaling with parameters: min/max replicas, concurrency target, target utilization (%), autoscaling window (seconds), and scale-down delay. Scale to zero for cost efficiency.
- **Deployment promotion**: Promote deployments from development to staging to production with zero-downtime rollouts.
- **Environment management**: Create multiple environments (staging, production, custom) with independent autoscaling and traffic routing.
- **Resource allocation**: Select instance types (L4, A10G, H100, B200) and configure GPU/CPU resources based on model requirements.
- **Deployment activation/deactivation**: Pause deployments to stop resource usage or delete permanently when no longer needed.

### Performance Optimization

- **Engine selection**: Choose optimized inference engines based on model type:
  - **BEI**: Embeddings and classification with up to 1400 embeddings/sec throughput
  - **Engine-Builder-LLM**: Dense LLMs with lookahead decoding and structured outputs
  - **BIS-LLM**: Mixture-of-Experts models with KV-aware routing and advanced features
- **Quantization**: Apply FP8, FP4, or KV-cache quantization to reduce memory and improve speed.
- **Speculative decoding**: Enable lookahead decoding for faster token generation in code and structured content.
- **Batch optimization**: Configure batch scheduler policies and max token limits for throughput optimization.

### Observability & Monitoring

- **Metrics dashboard**: Monitor inference volume, response times (p50/p90/p95/p99), request/response sizes, replica counts, CPU/GPU usage, and memory utilization.
- **Async queue metrics**: Track time in queue, queue size, and async request status for long-running tasks.
- **Health checks**: Configure custom health check endpoints to validate model readiness.
- **Tracing**: Export traces to understand request flow and identify bottlenecks.
- **Metrics export**: Send metrics to Prometheus, Datadog, Grafana, or New Relic for integration with existing monitoring.

### Organization & Access Control

- **API key management**: Create and manage API keys for authentication across inference, management, and training APIs.
- **Secrets management**: Store sensitive credentials (API keys, tokens, passwords) securely and inject into deployments.
- **Team management**: Segment resources across multiple teams with role-based access control (Enterprise).
- **Restricted environments**: Control which teams can access specific environments.

## Workflows

### Deploy a Model End-to-End

1. Initialize a Truss project: `truss init my-model`
2. Implement model loading and inference in `model.py` with `load()` and `predict()` methods
3. Configure dependencies and resources in `config.yaml` (Python packages, system dependencies, GPU type)
4. Test locally: `truss run` or `truss predict -d '{"input": "data"}'`
5. Deploy to development: `truss push` (supports live-reload with `truss watch`)
6. Test in production environment: Call `/development/predict` endpoint with API key
7. Promote to production: Use dashboard or Management API to promote deployment to production environment
8. Monitor: Check metrics dashboard for latency, throughput, and resource usage

### Build a RAG Pipeline with Chains

1. Create Chainlet for vector search that queries a database
2. Create Chainlet for LLM inference that generates responses
3. Create entrypoint Chainlet that orchestrates the workflow
4. Deploy with `truss chains deploy`
5. Call via `/environments/production/run_remote` endpoint with query input
6. Results flow through vector search → LLM → response delivery

### Fine-tune and Deploy a Model

1. Create training configuration with framework (TRL, Axolotl, etc.)
2. Submit training job: `truss train push config.py`
3. Monitor training progress and checkpoints in dashboard
4. Once training completes, deploy checkpoint: `truss train deploy_checkpoints --job-id <job_id>`
5. Fine-tuned model is immediately available for inference
6. Configure autoscaling and promote to production as needed

### Handle Long-Running Inference

1. Set up HTTPS webhook endpoint to receive results
2. Submit async request to `/async_predict` with webhook URL and priority (0-2)
3. Receive `request_id` immediately
4. Request enters queue and processes when capacity available
5. Upon completion, Baseten POSTs result to webhook with `X-BASETEN-SIGNATURE` header
6. Verify signature using webhook secret: `HMAC-SHA256(body, secret)`
7. Check request status anytime with `/async_request/{request_id}` endpoint

### Integrate with LangChain/LlamaIndex

1. Deploy model on Baseten and get model ID
2. In LangChain: Create custom LLM class pointing to `https://model-{model_id}.api.baseten.co/environments/production/predict`
3. In LlamaIndex: Use Baseten as LLM provider with API key authentication
4. Build RAG or agent workflows using Baseten models as inference backend
5. Leverage Baseten's autoscaling and monitoring for production reliability

## Integration

Baseten integrates with:

- **LangChain**: Use Baseten models as LLM providers in LangChain workflows and agents
- **LlamaIndex**: Integrate Baseten models for RAG applications
- **LiteLLM**: Access Baseten models through LiteLLM's unified interface
- **Vercel AI SDK**: Power Next.js applications with Baseten models
- **OpenAI SDK**: Use OpenAI-compatible endpoints for drop-in compatibility
- **Monitoring platforms**: Export metrics to Prometheus, Datadog, Grafana, New Relic
- **Custom frameworks**: Any framework supporting HTTP/REST, WebSocket, or gRPC protocols
- **Hugging Face**: Deploy models from Hugging Face Hub with automatic weight loading
- **Cloud storage**: Access models and datasets from S3, GCS, or other cloud providers

## Context

**Key Concepts:**

- **Truss**: Open-source model packaging framework that standardizes model containerization without requiring Docker knowledge
- **Deployment**: A containerized instance of a model serving inference requests via API
- **Environment**: A logical grouping of deployments (e.g., staging, production) with stable endpoints and independent autoscaling
- **Chainlet**: A reusable component in a Chain that performs a specific task (model inference, data processing, API call)
- **Autoscaling**: Dynamic adjustment of replica count based on traffic to balance performance and cost
- **Cold start**: Time required to initialize a model when scaling from zero replicas
- **Concurrency**: Number of requests a single replica can process simultaneously
- **Async inference**: Fire-and-forget pattern where requests are queued and results delivered via webhook

**Important Constraints:**

- Async requests can queue for up to 72 hours and run for up to 1 hour
- Async inference is incompatible with streaming output
- Webhook delivery is best-effort with 3 total attempts (1 initial + 2 retries)
- Model outputs are not stored by Baseten; save critical outputs in your model's `postprocess()` method
- Development deployments scale to only 1 replica and don't support autoscaling
- Rate limits: 12,000 async predict requests/minute (org-level), 20 status polling requests/second

**Best Practices:**

- Use Model APIs for quick access to pre-optimized LLMs without deployment overhead
- Enable autoscaling with appropriate min/max replicas and concurrency targets for your traffic patterns
- Use Chains for complex multi-model workflows rather than monolithic deployments
- Monitor async queue metrics to identify bottlenecks in long-running workloads
- Implement webhook signature verification for security
- Cache training artifacts to speed up iterative fine-tuning
- Use appropriate quantization (FP8/FP4) to reduce memory and improve throughput
- Set up health checks for custom servers to ensure model readiness before serving traffic

---

> For additional documentation and navigation, see: https://docs.baseten.co/llms.txt