Every ML model served on Baseten runs on an “instance,” meaning a set of hardware allocated to the model server. You can choose how much hardware is allocated to your model server by choosing an instance type for your model.
Picking the right model server instance is all about making smart tradeoffs:
- If you don’t allocate enough resources, model deployment and inference will be slow or may fail altogether.
- Picking an instance that’s too big leaves you paying for unnecessary overhead.
This document will help you navigate that tradeoff and pick the appropriate instance when deploying your model.
- Instance: a fixed set of hardware for running ML model inference.
- vCPU: virtual CPU cores for general computing tasks.
- RAM: memory for the CPU.
- GPU: the graphics card for ML inference tasks.
- VRAM: specialized memory attached to the GPU.
Setting model server resources
There are two ways to specify model server resources:
- Before initial deployment in your Truss.
- After initial deployment in the Baseten UI.
In your Truss
You can specify resources in your Truss. You must configure these resources before running
truss push on the Truss for the first time; any changes to the resources field after the first deployment will be ignored.
Here’s an example for Stable Diffusion XL:
resources: accelerator: A10G cpu: "4" memory: 16Gi use_gpu: true
On deployment, your model will be assigned the smallest and cheapest available instance type that satisfies the resource constraints. For example, for
resources.cpu, a Truss that specifies
"4" will be assigned a 4-core instance, while specifying
"8" will yield an 8-core instance.
resources.memory stands for Gibibytes, which are slightly larger than Gigabytes.
In the model dashboard
After the model has been deployed, the only way to update the instance type it uses is in the model dashboard on Baseten.
For more information on picking the right model resources, see the instance type reference.