Skip to main content
When you run truss push, Baseten creates a deployment: a running instance of your model on GPU infrastructure with an API endpoint. This page explains how deployments are managed, versioned, and scaled.

Deployments

A deployment is a single version of your model running on specific hardware. Every truss push creates a new deployment. You can have multiple deployments of the same model running simultaneously, which is how you test new versions without affecting production traffic. Deployments can be deactivated to stop serving (and stop incurring cost) or deleted permanently when no longer needed.

Environments

As your model matures, you need a way to manage releases. Environments provide stable endpoints that persist across deployments. A typical setup has a development environment for testing and a production environment for live traffic. Each environment maintains its own autoscaling settings, metrics, and endpoint URL. When a new deployment is ready, you promote it to an environment, and traffic shifts to the new version without changing the endpoint your application calls.

Resources

Every deployment runs on a specific instance type that defines its GPU, CPU, and memory allocation. Choosing the right instance balances inference speed against cost. You set the instance type in your config.yaml before deployment, or adjust it later in the dashboard. Smaller models run well on an L4 (24 GB VRAM), while large LLMs may need A100s or H100s with tensor parallelism across multiple GPUs.

Autoscaling

You donโ€™t manage replicas manually. Autoscaling adjusts the number of running instances based on incoming traffic. You configure a minimum and maximum replica count, a concurrency target, and a scale-down delay. When traffic drops, replicas scale down (optionally to zero, eliminating all cost). When traffic spikes, new replicas come up within seconds. Cold start optimization and network acceleration keep response times fast even when scaling from zero.