Model resources
View and adjust the compute resources allocated to your deployed model.
Your model deployment is initially provisioned with enough resources to support building, testing, and sharing an application. Default resources vary slightly model-to-model.

Scaling parameters

Scaling parameters let you scale horizontally, handling more traffic. If you have lots of users at the same time, you need horizontal scaling.
A replica is a single instance of a server behind a model deployment. Multiple replicas run at the same time and share the load of multiple requests against a model. Deployments have a minimum of one replica (the original deployment), but can dynamically scale up to the maximum number of replicas to handle increased traffic. When the load decreases, the replicas are spun down to reduce infrastructure costs.
Parallelism specifies how many requests can be processed by an instance concurrently. Processing requests in parallel reduces latency as each request spends less time in the task queue waiting to be processed.
Scaling Parameters

Model resources

Model resources let you scale vertically, handling more processing per request. If each user is asking for the result of a difficult computation, you need vertical scaling.
CPU cores determine the processing power available to your deployed model. Thanks to Baseten's cluster architecture, you don't need to provision full CPU cores. While you probably shouldn't provision 1/1000th of a CPU core, the architecture allows for that level of granularity.
Memory is like the RAM on your computer and determines the amount of data the model can access while running. Memory is measured in Mebibytes (2^20 bytes), which are about 5% larger than Megabytes (10^6 bytes).
By default, model deployments don't have access to a GPU or other hardware accelerator. For some types of computation, a GPU can be orders of magnitude faster than a CPU. If your model is specifically optimized to run on a GPU, one can be added to your deployment.
Model Resources
Baseten's Kubernetes-based architecture is highly scalable for demanding models and high-traffic applications. If you need more resources for your deployed model, please contact us.
Copy link
Outline
Scaling parameters
Model resources