Links

Model resources

View and adjust the compute resources allocated to your deployed model.
Model resources are only configurable in workspaces with paid plans. For free workspaces, all models are deployed on a free instance with 1 vCPU and 2 GiB of memory.
By default, models are deployed on a single instance with 1 vCPU and 2 GiB of memory. This configuration is sufficient for testing most models and serving small workloads, but for larger models and robust production workloads, you can configure additional resources.
The model resources panel shows essential resource information
Model resources include instance type, GPU, and replica range. Any change to model resources is applied to all active versions of that model.
To configure model resources, click "Update resources" in the bottom left corner to launch the configuration modal.
The resource configuration modal gives you options and expected costs

Understanding model resources

Baseten gives you flexibility in configuring model resources to match your unique requirements. Through vertical scaling (increasing instance size) and horizontal scaling (increasing replica count), you can handle even the most demanding production workloads.

Vertical scaling

Choosing a more powerful instance type lets you scale vertically, handling more load. If invoking your model is computationally expensive, or you have high sustained traffic, you need vertical scaling.
Instance sizing is based on two factors:
  • vCPUs determine the processing power available to your deployed model.
  • Memory determines the amount of data the model can access while running. Memory is measured in GiB (Gibibytes: 2^30 bytes), which are about 7% larger than GB (Gigabytes: 10^9 bytes).

GPU inference

Model inference can run on a CPU or GPU depending on the model. For models that require GPU inference, you can select an instance with GPU resources. To do so, toggle on "Enable a GPU" and select an appropriate instance from the instance list.
GPU-equipped instances measure GPU and CPU memory separately.
If you're unsure how to configure your model resources to handle anticipated traffic, please contact us.

Horizontal scaling (Autoscaling)

Scaling parameters let you scale horizontally, handling more traffic. If you have spikes of usage, you need horizontal scaling.
Horizontal scaling is handled by autoscaling replicas. A replica is a single instance of a server behind a model deployment. Multiple replicas run at the same time and share the load of multiple requests to a model. Deployments have a minimum of one replica (the original deployment), but can dynamically scale up to the configured maximum number of replicas to handle increased traffic. When the load decreases, the replicas are spun down to reduce your infrastructure costs.
If you're unsure how to configure your model resources to handle anticipated traffic, please contact us.

Usage-based billing

Additional compute allocated to your models is billed according to our usage-based pricing plan. Only active versions of your models (not pre-trained models) are billed for.
A couple of notes on cost savings:
  • While prices are listed per hour for convenience, usage based billing is calculated by the minute. So if you only need to spin something up for fifteen minutes, you won't be billed for the rest of the hour.
  • You can save money by deactivating model versions when they are not needed, as inactive versions don't consume resources.

How model resource billing is calculated

Let's say, for the sake of easy math, that you're using an instance that costs 10 cents per hour. Your model has two active versions (a primary and alternate version) and will run for a 720-hour average month. Multiplying these together:
.1 dollars/hour * 2 versions * 720 hours = $144

How autoscaling affects bills

However, you also have autoscaling replicas set up to handle spikes in traffic. Most of the time, the model only requires a single replica, which is why the estimated cost only assumes the minimum configured replica count.
Let's imagine that during the month, the primary version of the model has 100 hours of elevated traffic that require four replicas to serve. In that case, the bill would be:
.1 dollars/hour * 2 versions * 720 hours = $144
+
.1 dollars/hour * 1 version * 3 extra replicas * 100 hours = $30
=
$174
If you have any questions about your bill or the potential cost of operating a model, please contact us.