Skip to main content
Baseten gives you three ways to run inference, each suited to a different stage of a project. You can start with a hosted model, deploy your own when you need control, and compose multiple models into a pipeline when the problem demands it.

Choose your approach

Model APIs are the fastest path to inference. You call a hosted open-source model through an OpenAI-compatible endpoint. There’s no deployment step, no GPU selection, and no scaling configuration. If the model you need is in the supported list, you can make your first call in under a minute. Self-deployed models give you dedicated GPUs and full control over the serving stack. You point Baseten at a model on Hugging Face, choose a GPU, and truss push builds an optimized container with an API endpoint. For models that need custom preprocessing, postprocessing, or architectures that the config-only path doesn’t support, you write a Python Model class with your own inference logic. Self-deployed models support engine selection, autoscaling, and environment promotion. Chains let you orchestrate multi-step inference across independent services. Each step in a Chain runs on its own hardware with its own scaling rules. A Chain can call self-deployed models, external APIs, or any Python code. Use Chains when your workflow involves multiple models (like a RAG pipeline with retrieval and generation) or when different steps need different hardware (like CPU for preprocessing and GPU for inference). These three approaches aren’t mutually exclusive. Many projects start with a Model API call during prototyping, move to a self-deployed model for customization, and eventually wrap the model in a Chain as the system grows.

The development cycle

Self-deployed models and Chains share the same iteration workflow. You push a development deployment, make changes with live reload, and publish when you’re ready for production traffic.
  1. Push to development. Run truss push --watch to create a development deployment. This is a single-replica instance with live reload enabled, designed for fast iteration rather than production traffic.
  2. Iterate with live reload. Run truss watch to start a file watcher that syncs local changes to your development deployment in seconds, without rebuilding the container. You edit code, save, and see the result in the deployment logs.
  3. Publish to production. Run truss push to create an immutable, production-ready deployment with full autoscaling. Promote it to an environment for a stable endpoint URL that doesn’t change between versions.
Development deployments have slightly lower performance than published deployments and are limited to one replica. They exist to give you a fast feedback loop, not to serve real traffic.