Concepts

Inference on Baseten is designed for flexibility, efficiency, and scalability. Models can be served synchronously, asynchronously, or with streaming to meet different performance and latency needs.

Synchronously inference is ideal for low-latency, real-time responses.
Asynchronously inference handles long-running tasks efficiently without blocking resources.
Streaming inference delivers partial results as they become available for faster response times.

Baseten supports various input and output formats, including structured data, binary files, and function calls, making it adaptable to different workloads.

Get started

Development

Deployment

Inference

Engines

Training

Organization

Observability

Troubleshooting

Concepts