Instance type reference
Specs and recommendations for every instance type on Baseten
Choosing the right resources for your model inference workload requires carefully balancing performance and cost. This page lists every instance type currently available on Baseten to help you pick the best fit for serving your model.
CPU-only instance reference
Instances with no GPU start at $0.00058 per minute.
Available instance types
Instance | Cost/minute | vCPU | RAM |
---|---|---|---|
1×2 | $0.00058 | 1 | 2 GiB |
1×4 | $0.00086 | 1 | 4 GiB |
2×8 | $0.00173 | 2 | 8 GiB |
4×16 | $0.00346 | 4 | 16 GiB |
8×32 | $0.00691 | 8 | 32 GiB |
16×64 | $0.01382 | 16 | 64 GiB |
What can it run?
CPU-only instances are a cost-effective way to run inference on a variety of models like:
- Many
transformers
pipeline models, such as the Text classification pipeline from the Truss quickstart, run well on the smallest instance, the 1x2. - Smaller extractive question answering models like LayoutLM Document QA run well on the midsize 4x16 instance.
- Many text embeddings models, like this sentence transformers model don’t need a GPU to run. Pick an 4x16 or larger instance for best performance, especially when creating embeddings for a larger corpus of text.
GPU instance reference
Available instance types
Instance | Cost/minute | vCPU | RAM | GPU | VRAM |
---|---|---|---|---|---|
T4x4x16 | $0.01052 | 4 | 16 GiB | NVIDIA T4 | 16 GiB |
T4x8x32 | $0.01504 | 8 | 32 GiB | NVIDIA T4 | 16 GiB |
T4x16x64 | $0.02408 | 16 | 64 GiB | NVIDIA T4 | 16 GiB |
L4x4x16 | $0.01414 | 4 | 16 GiB | NVIDIA L4 | 24 GiB |
L4:2x4x16 | $0.04002 | 24 | 96 GiB | 2 NVIDIA L4s | 48 GiB |
L4:4x48x192 | $0.08003 | 48 | 192 GiB | 4 NVIDIA L4s | 96 GiB |
A10Gx4x16 | $0.02012 | 4 | 16 GiB | NVIDIA A10G | 24 GiB |
A10Gx8x32 | $0.02424 | 8 | 32 GiB | NVIDIA A10G | 24 GiB |
A10Gx16x64 | $0.03248 | 16 | 64 GiB | NVIDIA A10G | 24 GiB |
A10G:2x24x96 | $0.05672 | 24 | 96 GiB | 2 NVIDIA A10Gs | 48 GiB |
A10G:4x48x192 | $0.11344 | 48 | 192 GiB | 4 NVIDIA A10Gs | 96 GiB |
A10G:8x192x768 | $0.32576 | 192 | 768 GiB | 8 NVIDIA A10Gs | 188 GiB |
V100x8x61 | $0.06120 | 16 | 61 GiB | NVIDIA V100 | 16 GiB |
A100x12x144 | $0.10240 | 12 | 144 GiB | 1 NVIDIA A100 | 80 GiB |
A100:2x24x288 | $0.20480 | 24 | 288 GiB | 2 NVIDIA A100s | 160 GiB |
A100:3x36x432 | $0.30720 | 36 | 432 GiB | 3 NVIDIA A100s | 240 GiB |
A100:4x48x576 | $0.40960 | 48 | 576 GiB | 4 NVIDIA A100s | 320 GiB |
A100:5x60x720 | $0.51200 | 60 | 720 GiB | 5 NVIDIA A100s | 400 GiB |
A100:6x72x864 | $0.61440 | 72 | 864 GiB | 6 NVIDIA A100s | 480 GiB |
A100:7x84x1008 | $0.71680 | 84 | 1008 GiB | 7 NVIDIA A100s | 560 GiB |
A100:8x96x1152 | $0.81920 | 96 | 1152 GiB | 8 NVIDIA A100s | 640 GiB |
H100x26x234 | $0.16640 | 26 | 234 GiB | 1 NVIDIA H100 | 80 GiB |
H100:2x52x468 | $0.33280 | 52 | 468 GiB | 2 NVIDIA H100s | 160 GiB |
H100:4x104x936 | $0.66560 | 104 | 936 GiB | 4 NVIDIA H100s | 320 GiB |
H100:8x208x1872 | $1.33120 | 208 | 1872 GiB | 8 NVIDIA H100s | 640 GiB |
H100MIG:3gx13x117 | $0.08250 | 13 | 117 GiB | Fractional NVIDIA H100 | 40 GiB |
NVIDIA T4
Instances with an NVIDIA T4 GPU start at $0.01052 per minute.
GPU specs
The T4 is an Turing-series GPU with:
- 2,560 CUDA cores
- 320 Tensor cores
- 16 GiB VRAM
Available instance types
Instance | Cost/minute | vCPU | RAM | GPU | VRAM |
---|---|---|---|---|---|
T4x4x16 | $0.01052 | 4 | 16 GiB | NVIDIA T4 | 16 GiB |
T4x8x32 | $0.01504 | 8 | 32 GiB | NVIDIA T4 | 16 GiB |
T4x16x64 | $0.02408 | 16 | 64 GiB | NVIDIA T4 | 16 GiB |
What can it run?
T4-equipped instances can run inference for models like:
- Whisper, transcribing 5 minutes of audio in 31.4 seconds with Whisper small.
- While the T4’s 16 GiB of VRAM is insufficient for 7 billion parameter LLMs, it can run smaller 3B parameter models like StableLM.
NVIDIA L4
The L4 is an Ada Lovelace GPU with:
- 7,680 CUDA cores
- 240 Tensor cores
- 24 GiB VRAM
- 300 GiB/s Memory bandwidth
This enables the card to reach 121 teraFLOPS in fp16 operations, the most common quantization for large language models.
Available instance types
Instance | Cost/minute | vCPU | RAM | GPU | VRAM |
---|---|---|---|---|---|
L4x4x16 | $0.01414 | 4 | 16 GiB | NVIDIA L4 | 24 GiB |
L4:2x4x16 | $0.04002 | 24 | 96 GiB | 2 NVIDIA L4s | 48 GiB |
L4:4x48x192 | $0.08003 | 48 | 192 GiB | 4 NVIDIA L4s | 96 GiB |
What can it run?
The L4 is a great choice for running inference on models like Stable Diffusion XL but not LLMs due to limited memory bandwidth.
NVIDIA A10G
Instances with the NVIDIA A10G GPU start at $0.02012 per minute.
GPU specs
The A10G is an Ampere-series GPU with:
- 9,216 CUDA cores
- 288 Tensor cores
- 24 GiB VRAM
- 600 GiB/s Memory bandwidth
This enables the card to reach 70 teraFLOPS in fp16 operations, the most common quantization for large language models.
Available instance types
Instance | Cost/minute | vCPU | RAM | GPU | VRAM |
---|---|---|---|---|---|
A10Gx4x16 | $0.02012 | 4 | 16 GiB | NVIDIA A10G | 24 GiB |
A10Gx8x32 | $0.02424 | 8 | 32 GiB | NVIDIA A10G | 24 GiB |
A10Gx16x64 | $0.03248 | 16 | 64 GiB | NVIDIA A10G | 24 GiB |
A10G:2x24x96 | $0.05672 | 24 | 96 GiB | 2 NVIDIA A10Gs | 48 GiB |
A10G:4x48x192 | $0.11344 | 48 | 192 GiB | 4 NVIDIA A10Gs | 96 GiB |
A10G:8x192x768 | $0.32576 | 192 | 768 GiB | 8 NVIDIA A10Gs | 188 GiB |
What can it run?
Single A10Gs are great for running 7 billion parameter LLMs, and multi-A10 instances can work together to run larger models.
A10G-equipped instances can run inference for models like:
- Most 7-billion-parameter LLMs, such as Mistral 7B, at float16 precision.
- Stable Diffusion in 1.77 seconds for 50 steps and Stable Diffusion XL in 6 seconds for 20 steps.
- Whisper, transcribing 5 minutes of audio in 23.9 seconds with Whisper small.
NVIDIA V100
Instances with the NVIDIA V100 GPU start at $0.06120 per minute.
GPU specs
The V100 is an Volta-series GPU with 16GiB of VRAM.
Available instance types
Instance | Cost/minute | vCPU | RAM | GPU | VRAM |
---|---|---|---|---|---|
V100x8x61 | $0.06120 | 16 | 61 GiB | NVIDIA V100 | 16 GiB |
NVIDIA A100
Instances with the NVIDIA A100 GPU start at $0.10240 per minute.
GPU specs
The A100 is an Ampere-series GPU with:
- 6,912 CUDA cores
- 432 Tensor cores
- 80 GiB VRAM
- 1,935 GiB/s Memory bandwidth
This enables the card to reach 312 teraFLOPS in fp16 operations, the most common quantization for large language models.
Available instance types
Instance | Cost/minute | vCPU | RAM | GPU | VRAM |
---|---|---|---|---|---|
A100x12x144 | $0.10240 | 12 | 144 GiB | 1 NVIDIA A100 | 80 GiB |
A100:2x24x288 | $0.20480 | 24 | 288 GiB | 2 NVIDIA A100s | 160 GiB |
A100:3x36x432 | $0.30720 | 36 | 432 GiB | 3 NVIDIA A100s | 240 GiB |
A100:4x48x576 | $0.40960 | 48 | 576 GiB | 4 NVIDIA A100s | 320 GiB |
A100:5x60x720 | $0.51200 | 60 | 720 GiB | 5 NVIDIA A100s | 400 GiB |
A100:6x72x864 | $0.61440 | 72 | 864 GiB | 6 NVIDIA A100s | 480 GiB |
A100:7x84x1008 | $0.71680 | 84 | 1008 GiB | 7 NVIDIA A100s | 560 GiB |
A100:8x96x1152 | $0.81920 | 96 | 1152 GiB | 8 NVIDIA A100s | 640 GiB |
What can it run?
A100s are the second-largest and most powerful GPUs currently available on Baseten. They’re great for large language models, high-performance image generation, and other demanding tasks.
A100-equipped instances can run inference for models like:
- Mixtral 8x7B on a single A100 in int8 precision.
- Stable Diffusion in 0.89 seconds for 50 steps and Stable Diffusion XL in 1.92 seconds for 20 steps (with
torch.compile
and max autotune). - Most 70-billion-parameter LLMs, such as Llama-2-chat 70B, in fp16 precision, on 2 A100s.
- The 180-billion-parameter LLM Falcon 180B, in fp16 precision, on 5 A100s.
NVIDIA H100
Instances with the NVIDIA H100 GPU start at $0.1664 per minute.
GPU specs
The H100 is an Hopper-series GPU with:
- 16,896 CUDA cores
- 640 Tensor cores
- 80 GiB VRAM
- 3.35 TB/s Memory bandwidth
This enables the card to reach 990 teraFLOPS in fp16 operations, the most common quantization for large language models.
Available instance types
Instance | Cost/minute | vCPU | RAM | GPU | VRAM |
---|---|---|---|---|---|
H100x26x234 | $0.16640 | 26 | 234 GiB | 1 NVIDIA H100 | 80 GiB |
H100:2x52x468 | $0.33280 | 52 | 468 GiB | 2 NVIDIA H100s | 160 GiB |
H100:4x104x936 | $0.66560 | 104 | 936 GiB | 4 NVIDIA H100s | 320 GiB |
H100:8x208x1872 | $1.33120 | 208 | 1872 GiB | 8 NVIDIA H100s | 640 GiB |
What can it run?
H100s are the most powerful GPUs currently available on Baseten. They’re great for large language models, high-performance image generation, and other demanding tasks.
H100-equipped instances can run inference for models like:
- Mixtral 8x7B on a single H100 in fp16 precision.
- 20 steps of Stable Diffusion XL in 1.31 seconds.
- Most 70-billion-parameter LLMs, such as Llama-2-chat 70B, in fp16 precision, on 2 H100s.
NVIDIA H100mig
Instances with the NVIDIA H100mig GPU start at $0.08250 per minute.
GPU specs
The H100mig family of instances runs on a fractional share of an H100 GPU using Nvidia’s Multi-Instance GPU (MIG) virtualization technology. Currently we support a single instance type H100MIG:3gx13x117
with access to 1/2 the memory and 3/7 the compute of a full H100. This results in:
- 7,242 CUDA cores
- 40 GiB VRAM
- 1.675 TB/s Memory bandwidth
Available instance types
Instance | Cost/minute | vCPU | RAM | GPU | VRAM |
---|---|---|---|---|---|
H100MIG:3gx13x117 | $0.08250 | 13 | 117 GiB | Fractional NVIDIA H100 | 40 GiB |
What can it run?
H100mig provides access to the same state-of-the-art AI inference architecture as the H100 in a smaller package. Based on our benchmarks, it can achieve higher throughput than an single A100 GPUs and has a lower cost per minute.