Inference on Baseten is designed for flexibility, efficiency, and scalability. Models can be served synchronously, asynchronously, or via streaming to meet different performance and latency needs.

  • Synchronously inference is ideal for low-latency, real-time responses.
  • Asynchronously inference handles long-running tasks efficiently without blocking resources.
  • Streaming inference delivers partial results as they become available for faster response times.

Baseten supports various input and output formats, including structured data, binary files, and function calls, making it adaptable to different workloads.