Model resources

Allocate GPUs and other resources to your deployed models
This doc exists to help you answer two questions:
  1. 1.
    Vertical scale: what is the least expensive instance type that can reliably and performantly run your model?
  2. 2.
    Horizontal scale: How can you handle variable traffic in a cost-effective way while providing desired response times?

Configuring model resources

Baseten gives you flexibility in configuring model resources to match your unique requirements. Through vertical scaling (increasing instance size) and horizontal scaling (increasing replica count), you can handle even the most demanding production workloads.
The model resources panel shows essential resource information

In your Truss

If you know your model resource requirements and are using Truss to package your model, you can specify your model resources directly in your model's config.yaml.
Resources from config.yaml are only read when creating a new model, not when creating a new version of an existing model or updating a draft model.
Here's an example config:
cpu: "3"
memory: 14Gi
use_gpu: true
accelerator: A10G
This config specifies the minimum resource requirements for the model. When deployed, the model will be placed on the least expensive instance that meets or exceeds those requirements. In this example, the model would be run on an A10Gx4x16 instance, providing the configured A10 GPU and slightly exceeding the minimum vCPU and RAM requirements.

From the model page

In the bottom-left corner of a model's page, you can view the resources allocated to the model. To modify the model's resources, click "Update resources" and a modal will appear.
The resource configuration modal gives you options and expected costs

Vertical scale: instance type

In the modal, you can select the right instance type for your model. If your model does not require a GPU to run inference, turn off the "Add a GPU" toggle to see instance types without attached GPUs, which are substantially less expensive.
If you do need a GPU, you can pick between the T4 and A10. This choice depends on your model requirements. Here's a resource comparing the T4 and A10 GPUs.
Your instance type also sets your CPU count and RAM. How much you need of each depends on your model and its inputs and outputs. Multiple instance sizes are available for both T4 and A10 GPUs.
If your model has multiple active versions, updating the model resources for one version will update all other active versions to use the same resources.

Horizontal scale: autoscaling replicas

When the traffic to your model is variable, autoscaling adds and removes replicas—identical copies of your instance—to horizontally scale your model and handle more concurrent requests when demand is high and save infrastructure costs when demand is low.

Scale to zero

If you're just testing your model or anticipate light and inconsistent traffic, scale to zero can save you substantial amounts of money.
Note that when a model is scaled to zero and receives a request, there is a delay as the model must be allocated resources before it can start processing the request. This delay is called the cold start time.
To prevent frequent cold starts, you can adjust the scaling delay to keep your model alive for up to an hour after it spins up. That way, during a burst of traffic, only the first request deals with the cold start time. If your autoscaling config has a max replica count greater than one, use the three settings explained in the next section instead of just the scaling delay setting.

Autoscaling to multiple replicas

Autoscaling to multiple replicas lets you handle more concurrent requests
All three configuration options apply when scaling beyond a single replica:
  • Scaling delay: The time the autoscaler waits before spinning up and spinning down new replicas.
  • Additional scale down delay: The additional time the autoscaler waits before spinning down a replica.
  • Concurrency target: The number of concurrent requests you want each replica to be responsible for handling.
These three settings, in combination, allow flexible response to your specific traffic patterns. The autoscaler only provisions new nodes when the moving average of requests per replica exceeds the concurrency target for the duration of a scaling delay. It applies the additional scale down delay before spinning down replicas to prevent delays from cold starts after a brief lull in traffic.
For example, with:
  • Replica range: 1-3
  • Scaling delay: 300 seconds
  • Additional scale down delay: 600 seconds
  • Concurrency target: 3
The model hums along on a single replica during light traffic. When there start to consistently be four or more concurrent requests across a 5-minute (300-second) period, it spins up another replica. As traffic mounts and the number of concurrent requests hits seven (more than three per replica), a third replica is spun up. Now, no matter how many concurrent requests there are, no more replicas will be spun up as the maximum replica count has been hit. As traffic falls, once the concurrent requests is at or below six for a 15-minute (300+600 second) period, the third replica will be spun down, and the second replica will be spun down after it falls to an average of three concurrent requests across a 15-minute period, leaving only one replica.

Model resources and usage-based billing

Model resources are billed according to our usage-based pricing. You can view your usage and bills in your workspace billing tab.
To save money, you can:
  • Make sure your model is running on the least expensive instance type that can suitably handle running invocations.
  • Set the minimum replicas to zero to allow scale-to-zero when your model is not in use.
  • Deactivate model versions when they are not needed, as inactive versions can't consume resources.
If you have any questions about your bill or the potential cost of operating a model, please contact us.