Model resources

Allocate GPUs and other resources to your deployed models
This doc exists to help you answer two questions:
  1. 1.
    Vertical scale: what is the least expensive instance type that can reliably and performantly run your model?
  2. 2.
    Horizontal scale: How can you handle variable traffic in a cost-effective way while providing desired response times?

Configuring model resources

Baseten gives you flexibility in configuring model resources to match your unique requirements. Through vertical scaling (increasing instance size) and horizontal scaling (increasing replica count), you can handle even the most demanding production workloads.
The model resources panel shows essential resource information

In your Truss

If you know your model resource requirements and are using Truss to package your model, you can specify your model resources directly in your model's config.yaml.
Resources from config.yaml are only read when creating a new model, not when creating a new version of an existing model or updating a draft model.
Here's an example config:
cpu: "3"
memory: 14Gi
use_gpu: true
accelerator: A10G
This config specifies the minimum resource requirements for the model. When deployed, the model will be placed on the least expensive instance that meets or exceeds those requirements. In this example, the model would be run on an A10Gx4x16 instance, providing the configured A10 GPU and slightly exceeding the minimum vCPU and RAM requirements.

From the model page

In the bottom-left corner of a model's page, you can view the resources allocated to the model. To modify the model's resources, click "Update resources" and a modal will appear.
The resource configuration modal gives you options and expected costs

Vertical scale: instance type

In the modal, you can select the right instance type for your model. If your model does not require a GPU to run inference, turn off the "Add a GPU" toggle to see instance types without attached GPUs, which are substantially less expensive.
If you do need a GPU, you can pick between the T4 and A10. This choice depends on your model requirements. Here's a resource comparing the T4 and A10 GPUs.
Your instance type also sets your CPU count and RAM. How much you need of each depends on your model and its inputs and outputs. Multiple instance sizes are available for both T4 and A10 GPUs.
If your model has multiple active versions, updating the model resources for one version will update all other active versions to use the same resources.

Horizontal scale: autoscaling replicas

When the traffic to your model is variable, autoscaling adds and removes replicas—identical copies of your instance—to horizontally scale your model and handle more concurrent requests when demand is high and save infrastructure costs when demand is low.

Scale to zero

If you're just testing your model or anticipate light and inconsistent traffic, scale to zero can save you substantial amounts of money. Scale to zero is enabled by default for new models deployed to Baseten.
To reduce the frequency of cold starts, you can adjust the scaling delay to keep your model alive for up to an hour after it received its most recent request.

Cold starts

When a model is scaled to zero and receives a request, the model server needs to scale back up before inference can happen. This delay is called the "cold start" time.

Network accelerator

We developed a network accelerator to speed up model loads from common model artifact stores, including HuggingFace, CloudFront, S3, and OpenAI. Our accelerator employs byte range downloads in the background to maximize the parallelism of downloads. If you prefer to disable this network acceleration for your Baseten workspace, please contact our support team at [email protected].

Cold start pods

To shorten cold start times, we spin up specifically designated pods to accelerate model loading that are not counted toward your ordinary model resources. You may see these pods in your metrics and see logs from them formatted as coldboost: <log record>.

Autoscaling to multiple replicas

Autoscaling to multiple replicas lets you handle more concurrent requests
All three configuration options apply when scaling beyond a single replica:
  • Scaling delay (default: 20 min): The time the autoscaler waits before spinning up and spinning down new replicas.
  • Additional scale down delay (default: 0 min): The additional time the autoscaler waits before spinning down a replica.
  • Concurrency target (default: 3 requests): The number of concurrent requests you want each replica to be responsible for handling.
These three settings, in combination, allow flexible response to your specific traffic patterns. Draft models respect model-level scaling delay for scale-to-zero but cannot scale past one replica.

Model resources and usage-based billing

Model resources are billed according to our usage-based pricing. You can view your usage and bills in your workspace billing tab.
To save money, you can:
  • Make sure your model is running on the least expensive instance type that can suitably handle running invocations.
  • Set the minimum replicas to zero to allow scale-to-zero when your model is not in use.
  • Deactivate model versions when they are not needed, as inactive versions can't consume resources.
If you have any questions about your bill or the potential cost of operating a model, please contact us.