Instance type reference
Choosing the right resources for your model inference workload requires carefully balancing performance and cost. This page lists every instance type currently available on Baseten to help you pick the best fit for serving your model.
CPU-only instance reference
Instances with no GPU start at $0.00058 per minute.
Available instance types
Instance | Cost/minute | vCPU | RAM |
---|---|---|---|
1×2 | $0.00058 | 1 | 2 GiB |
1×4 | $0.00086 | 1 | 4 GiB |
2×8 | $0.00173 | 2 | 8 GiB |
4×16 | $0.00346 | 4 | 16 GiB |
8×32 | $0.00691 | 8 | 32 GiB |
16×64 | $0.01382 | 16 | 64 GiB |
What can it run?
CPU-only instances are a cost-effective way to run inference on a variety of models like:
- Many
transformers
pipeline models, such as the Text classification pipeline from the Truss quickstart, run well on the smallest instance, the 1x2. - Smaller extractive question answering models like LayoutLM Document QA run well on the midsize 4x16 instance.
- Many text embeddings models, like this sentence transformers model don’t need a GPU to run. Pick an 4x16 or larger instance for best performance, especially when creating embeddings for a larger corpus of text.
GPU instance reference
Available instance types
Instance | Cost/minute | vCPU | RAM | GPU | VRAM |
---|---|---|---|---|---|
T4x4x16 | $0.01052 | 4 | 16 GiB | NVIDIA T4 | 16 GiB |
T4x8x32 | $0.01504 | 8 | 32 GiB | NVIDIA T4 | 16 GiB |
T4x16x64 | $0.02408 | 16 | 64 GiB | NVIDIA T4 | 16 GiB |
A10Gx4x16 | $0.02012 | 4 | 16 GiB | NVIDIA A10 | 24 GiB |
A10Gx8x32 | $0.02424 | 8 | 32 GiB | NVIDIA A10 | 24 GiB |
A10Gx16x64 | $0.03248 | 16 | 64 GiB | NVIDIA A10 | 24 GiB |
A10G:2x24x96 | $0.05672 | 24 | 96 GiB | 2 NVIDIA A10s | 48 GiB |
A10G:4x48x192 | $0.11344 | 48 | 192 GiB | 4 NVIDIA A10s | 96 GiB |
A10G:8x192x768 | $0.32576 | 192 | 768 GiB | 8 NVIDIA A10s | 188 GiB |
V100x8x61 | $0.06120 | 16 | 61 GiB | NVIDIA V100 | 16 GiB |
A100x12x144 | $0.10240 | 12 | 144 GiB | 1 NVIDIA A100 | 80 GiB |
A100:2x24x288 | $0.20480 | 24 | 288 GiB | 2 NVIDIA A100s | 160 GiB |
A100:3x36x432 | $0.30720 | 36 | 432 GiB | 3 NVIDIA A100s | 240 GiB |
A100:4x48x576 | $0.40960 | 48 | 576 GiB | 4 NVIDIA A100s | 320 GiB |
A100:5x60x720 | $0.51200 | 60 | 720 GiB | 5 NVIDIA A100s | 400 GiB |
A100:6x72x864 | $0.61440 | 72 | 864 GiB | 6 NVIDIA A100s | 480 GiB |
A100:7x84x1008 | $0.71680 | 84 | 1008 GiB | 7 NVIDIA A100s | 560 GiB |
A100:8x96x1152 | $0.81920 | 96 | 1152 GiB | 8 NVIDIA A100s | 640 GiB |
NVIDIA T4
Instances with an NVIDIA T4 GPU start at $0.01052 per minute.
GPU specs
The T4 is an Turing-series GPU with:
- 2,560 CUDA cores
- 320 Tensor cores
- 16 GiB VRAM
Available instance types
Instance | Cost/minute | vCPU | RAM | GPU | VRAM |
---|---|---|---|---|---|
T4x4x16 | $0.01052 | 4 | 16 GiB | NVIDIA T4 | 16 GiB |
T4x8x32 | $0.01504 | 8 | 32 GiB | NVIDIA T4 | 16 GiB |
T4x16x64 | $0.02408 | 16 | 64 GiB | NVIDIA T4 | 16 GiB |
What can it run?
T4-equipped instances can run inference for models like:
- Whisper, transcribing 5 minutes of audio in 31.4 seconds with Whisper small.
- While the T4’s 16 GiB of VRAM is insufficient for 7 billion parameter LLMs, it can run smaller 3B parameter models like StableLM.
NVIDIA A10
Instances with the NVIDIA A10 GPU start at $0.02012 per minute.
GPU specs
The A10 is an Ampere-series GPU with:
- 9,216 CUDA cores
- 288 Tensor cores
- 24 GiB VRAM
- 600 GiB/s Memory bandwidth
This enables the card to reach 125 teraFLOPS in fp16 operations, the most common quantization for large language models.
Available instance types
Instance | Cost/minute | vCPU | RAM | GPU | VRAM |
---|---|---|---|---|---|
A10Gx4x16 | $0.02012 | 4 | 16 GiB | NVIDIA A10 | 24 GiB |
A10Gx8x32 | $0.02424 | 8 | 32 GiB | NVIDIA A10 | 24 GiB |
A10Gx16x64 | $0.03248 | 16 | 64 GiB | NVIDIA A10 | 24 GiB |
A10G:2x24x96 | $0.05672 | 24 | 96 GiB | 2 NVIDIA A10s | 48 GiB |
A10G:4x48x192 | $0.11344 | 48 | 192 GiB | 4 NVIDIA A10s | 96 GiB |
A10G:8x192x768 | $0.32576 | 192 | 768 GiB | 8 NVIDIA A10s | 188 GiB |
What can it run?
Single A10s are great for running 7 billion parameter LLMs, and multi-A10 instances can work together to run larger models.
A10-equipped instances can run inference for models like:
- Most 7-billion-parameter LLMs, such as Llama-2-chat 7B, in fp16 precision.
- Stable Diffusion in 1.77 seconds for 50 steps and Stable Diffusion XL in 6 seconds for 20 steps.
- Whisper, transcribing 5 minutes of audio in 23.9 seconds with Whisper small.
NVIDIA V100
Instances with the NVIDIA V100 GPU start at $0.06120 per minute.
GPU specs
The V100 is an Volta-series GPU with 16GiB of VRAM.
Available instance types
Instance | Cost/minute | vCPU | RAM | GPU | VRAM |
---|---|---|---|---|---|
V100x8x61 | $0.06120 | 16 | 61 GiB | NVIDIA V100 | 16 GiB |
NVIDIA A100
A100s are not enabled by default. Reach out to support@baseten.co to get A100s enabled for your workspace.
Instances with the NVIDIA A100 GPU start at $0.10240 per minute.
GPU specs
The A100 is an Ampere-series GPU with:
- 6,912 CUDA cores
- 432 Tensor cores
- 80 GiB VRAM
- 1,935 GiB/s Memory bandwidth
This enables the card to reach 312 teraFLOPS in fp16 operations, the most common quantization for large language models.
Available instance types
Instance | Cost/minute | vCPU | RAM | GPU | VRAM |
---|---|---|---|---|---|
A100x12x144 | $0.10240 | 12 | 144 GiB | 1 NVIDIA A100 | 80 GiB |
A100:2x24x288 | $0.20480 | 24 | 288 GiB | 2 NVIDIA A100s | 160 GiB |
A100:3x36x432 | $0.30720 | 36 | 432 GiB | 3 NVIDIA A100s | 240 GiB |
A100:4x48x576 | $0.40960 | 48 | 576 GiB | 4 NVIDIA A100s | 320 GiB |
A100:5x60x720 | $0.51200 | 60 | 720 GiB | 5 NVIDIA A100s | 400 GiB |
A100:6x72x864 | $0.61440 | 72 | 864 GiB | 6 NVIDIA A100s | 480 GiB |
A100:7x84x1008 | $0.71680 | 84 | 1008 GiB | 7 NVIDIA A100s | 560 GiB |
A100:8x96x1152 | $0.81920 | 96 | 1152 GiB | 8 NVIDIA A100s | 640 GiB |
What can it run?
A100s are the largest and most powerful GPUs currently available on Baseten. They’re great for large language models, high-performance image generation, and other demanding tasks.
A100-equipped instances can run inference for models like:
- Most 13-billion-parameter LLMs, such as Llama-2-chat 13B, in fp16 precision.
- Stable Diffusion in 0.89 seconds for 50 steps and Stable Diffusion XL in 1.92 seconds for 20 steps (with
torch.compile
and max autotune). - Most 70-billion-parameter LLMs, such as Llama-2-chat 70B, in fp16 precision, on 2 A100s.
- The 180-billion-parameter LLM Falcon 180B, in fp16 precision, on 5 A100s.