Choosing the right resources for your model inference workload requires carefully balancing performance and cost. This page lists every instance type currently available on Baseten to help you pick the best fit for serving your model.

CPU-only instance reference

Instances with no GPU start at $0.00058 per minute.

Available instance types

InstanceCost/minutevCPURAM
1×2$0.0005812 GiB
1×4$0.0008614 GiB
2×8$0.0017328 GiB
4×16$0.00346416 GiB
8×32$0.00691832 GiB
16×64$0.013821664 GiB

What can it run?

CPU-only instances are a cost-effective way to run inference on a variety of models like:

GPU instance reference

Available instance types

InstanceCost/minutevCPURAMGPUVRAM
T4x4x16$0.01052416 GiBNVIDIA T416 GiB
T4x8x32$0.01504832 GiBNVIDIA T416 GiB
T4x16x64$0.024081664 GiBNVIDIA T416 GiB
L4x4x16$0.01414416 GiBNVIDIA L424 GiB
L4:2x4x16$0.040022496 GiB2 NVIDIA L4s48 GiB
L4:4x48x192$0.0800348192 GiB4 NVIDIA L4s96 GiB
A10Gx4x16$0.02012416 GiBNVIDIA A10G24 GiB
A10Gx8x32$0.02424832 GiBNVIDIA A10G24 GiB
A10Gx16x64$0.032481664 GiBNVIDIA A10G24 GiB
A10G:2x24x96$0.056722496 GiB2 NVIDIA A10Gs48 GiB
A10G:4x48x192$0.1134448192 GiB4 NVIDIA A10Gs96 GiB
A10G:8x192x768$0.32576192768 GiB8 NVIDIA A10Gs188 GiB
V100x8x61$0.061201661 GiBNVIDIA V10016 GiB
A100x12x144$0.1024012144 GiB1 NVIDIA A10080 GiB
A100:2x24x288$0.2048024288 GiB2 NVIDIA A100s160 GiB
A100:3x36x432$0.3072036432 GiB3 NVIDIA A100s240 GiB
A100:4x48x576$0.4096048576 GiB4 NVIDIA A100s320 GiB
A100:5x60x720$0.5120060720 GiB5 NVIDIA A100s400 GiB
A100:6x72x864$0.6144072864 GiB6 NVIDIA A100s480 GiB
A100:7x84x1008$0.71680841008 GiB7 NVIDIA A100s560 GiB
A100:8x96x1152$0.81920961152 GiB8 NVIDIA A100s640 GiB
H100x26x234$0.1664026234 GiB1 NVIDIA H10080 GiB
H100:2x52x468$0.3328052468 GiB2 NVIDIA H100s160 GiB
H100:4x104x936$0.66560104936 GiB4 NVIDIA H100s320 GiB
H100:8x208x1872$1.331202081872 GiB8 NVIDIA H100s640 GiB
H100MIG:3gx13x117$0.0825013117 GiBFractional NVIDIA H10040 GiB

NVIDIA T4

Instances with an NVIDIA T4 GPU start at $0.01052 per minute.

GPU specs

The T4 is an Turing-series GPU with:

  • 2,560 CUDA cores
  • 320 Tensor cores
  • 16 GiB VRAM

Available instance types

InstanceCost/minutevCPURAMGPUVRAM
T4x4x16$0.01052416 GiBNVIDIA T416 GiB
T4x8x32$0.01504832 GiBNVIDIA T416 GiB
T4x16x64$0.024081664 GiBNVIDIA T416 GiB

What can it run?

T4-equipped instances can run inference for models like:

  • Whisper, transcribing 5 minutes of audio in 31.4 seconds with Whisper small.
  • While the T4’s 16 GiB of VRAM is insufficient for 7 billion parameter LLMs, it can run smaller 3B parameter models like StableLM.

NVIDIA L4

The L4 is an Ada Lovelace GPU with:

  • 7,680 CUDA cores
  • 240 Tensor cores
  • 24 GiB VRAM
  • 300 GiB/s Memory bandwidth

This enables the card to reach 121 teraFLOPS in fp16 operations, the most common quantization for large language models.

Available instance types

InstanceCost/minutevCPURAMGPUVRAM
L4x4x16$0.01414416 GiBNVIDIA L424 GiB
L4:2x4x16$0.040022496 GiB2 NVIDIA L4s48 GiB
L4:4x48x192$0.0800348192 GiB4 NVIDIA L4s96 GiB

What can it run?

The L4 is a great choice for running inference on models like Stable Diffusion XL but not LLMs due to limited memory bandwidth.

NVIDIA A10G

Instances with the NVIDIA A10G GPU start at $0.02012 per minute.

GPU specs

The A10G is an Ampere-series GPU with:

  • 9,216 CUDA cores
  • 288 Tensor cores
  • 24 GiB VRAM
  • 600 GiB/s Memory bandwidth

This enables the card to reach 70 teraFLOPS in fp16 operations, the most common quantization for large language models.

Available instance types

InstanceCost/minutevCPURAMGPUVRAM
A10Gx4x16$0.02012416 GiBNVIDIA A10G24 GiB
A10Gx8x32$0.02424832 GiBNVIDIA A10G24 GiB
A10Gx16x64$0.032481664 GiBNVIDIA A10G24 GiB
A10G:2x24x96$0.056722496 GiB2 NVIDIA A10Gs48 GiB
A10G:4x48x192$0.1134448192 GiB4 NVIDIA A10Gs96 GiB
A10G:8x192x768$0.32576192768 GiB8 NVIDIA A10Gs188 GiB

What can it run?

Single A10Gs are great for running 7 billion parameter LLMs, and multi-A10 instances can work together to run larger models.

A10G-equipped instances can run inference for models like:

NVIDIA V100

Instances with the NVIDIA V100 GPU start at $0.06120 per minute.

GPU specs

The V100 is an Volta-series GPU with 16GiB of VRAM.

Available instance types

InstanceCost/minutevCPURAMGPUVRAM
V100x8x61$0.061201661 GiBNVIDIA V10016 GiB

NVIDIA A100

Instances with the NVIDIA A100 GPU start at $0.10240 per minute.

GPU specs

The A100 is an Ampere-series GPU with:

  • 6,912 CUDA cores
  • 432 Tensor cores
  • 80 GiB VRAM
  • 1,935 GiB/s Memory bandwidth

This enables the card to reach 312 teraFLOPS in fp16 operations, the most common quantization for large language models.

Available instance types

InstanceCost/minutevCPURAMGPUVRAM
A100x12x144$0.1024012144 GiB1 NVIDIA A10080 GiB
A100:2x24x288$0.2048024288 GiB2 NVIDIA A100s160 GiB
A100:3x36x432$0.3072036432 GiB3 NVIDIA A100s240 GiB
A100:4x48x576$0.4096048576 GiB4 NVIDIA A100s320 GiB
A100:5x60x720$0.5120060720 GiB5 NVIDIA A100s400 GiB
A100:6x72x864$0.6144072864 GiB6 NVIDIA A100s480 GiB
A100:7x84x1008$0.71680841008 GiB7 NVIDIA A100s560 GiB
A100:8x96x1152$0.81920961152 GiB8 NVIDIA A100s640 GiB

What can it run?

A100s are the second-largest and most powerful GPUs currently available on Baseten. They’re great for large language models, high-performance image generation, and other demanding tasks.

A100-equipped instances can run inference for models like:

NVIDIA H100

Instances with the NVIDIA H100 GPU start at $0.1664 per minute.

GPU specs

The H100 is an Hopper-series GPU with:

  • 16,896 CUDA cores
  • 640 Tensor cores
  • 80 GiB VRAM
  • 3.35 TB/s Memory bandwidth

This enables the card to reach 990 teraFLOPS in fp16 operations, the most common quantization for large language models.

Available instance types

InstanceCost/minutevCPURAMGPUVRAM
H100x26x234$0.1664026234 GiB1 NVIDIA H10080 GiB
H100:2x52x468$0.3328052468 GiB2 NVIDIA H100s160 GiB
H100:4x104x936$0.66560104936 GiB4 NVIDIA H100s320 GiB
H100:8x208x1872$1.331202081872 GiB8 NVIDIA H100s640 GiB

What can it run?

H100s are the most powerful GPUs currently available on Baseten. They’re great for large language models, high-performance image generation, and other demanding tasks.

H100-equipped instances can run inference for models like:

NVIDIA H100mig

Instances with the NVIDIA H100mig GPU start at $0.08250 per minute.

GPU specs

The H100mig family of instances runs on a fractional share of an H100 GPU using Nvidia’s Multi-Instance GPU (MIG) virtualization technology. Currently we support a single instance type H100MIG:3gx13x117 with access to 1/2 the memory and 3/7 the compute of a full H100. This results in:

  • 7,242 CUDA cores
  • 40 GiB VRAM
  • 1.675 TB/s Memory bandwidth

Available instance types

InstanceCost/minutevCPURAMGPUVRAM
H100MIG:3gx13x117$0.0825013117 GiBFractional NVIDIA H10040 GiB

What can it run?

H100mig provides access to the same state-of-the-art AI inference architecture as the H100 in a smaller package. Based on our benchmarks, it can achieve higher throughput than an single A100 GPUs and has a lower cost per minute.