Every AI/ML model on Baseten runs on an instance, a dedicated set of hardware allocated to the model server. Selecting the right instance type ensures optimal performance while controlling compute costs.

  • Insufficient resources → Slow inference or failures.
  • Excess resources → Higher costs without added benefit.

Instance type resource components

  • Instance → The allocated hardware for inference.
  • vCPU → Virtual CPU cores for general computing.
  • RAM → Memory available to the CPU.
  • GPU → Specialized hardware for accelerated ML workloads.
  • VRAM → Dedicated GPU memory for model execution.

Configuring model resources

Resources can be defined before deployment in Truss or adjusted later via the Baseten UI.

Defining resources in Truss

Define resource requirements in config.yaml before running truss push. Any changes after deployment will be ignored.

Example (Stable Diffusion XL):

config.yaml
resources:
  accelerator: A10G
  cpu: "4"
  memory: 16Gi
  use_gpu: true

Baseten provisions the smallest instance that meets the specified constraints:

  • **cpu: “3” or “4” → **Maps to a 4-core instance.
  • **cpu: “5” to “8” → **Maps to an 8-core instance.

Gi in resources.memory refers to Gibibytes, which are slightly larger than Gigabytes.

Updating resources in the Baseten UI

Once deployed, resource configurations can only be updated through the Baseten UI.

For a list of available instance types, see the instance type reference.


Instance Type Reference

Specs and benchmarks for every Baseten instance type.

Choosing the right instance for model inference means balancing performance and cost. This page lists all available instance types on Baseten to help you deploy and serve models effectively.

CPU-only Instances

Cost-effective options for lighter workloads. No GPU.

  • Starts at: $0.00058/min
  • Best for: Transformers pipelines, small QA models, text embeddings
Instance$/minvCPURAM
1×2$0.0005812 GiB
1×4$0.0008614 GiB
2×8$0.0017328 GiB
4×16$0.00346416 GiB
8×32$0.00691832 GiB
16×64$0.013821664 GiB

Example workloads:

  • 1x2: Text classification (e.g., Truss quickstart)
  • 4x16: LayoutLM Document QA
  • 4x16+: Sentence Transformers embeddings on larger corpora

GPU Instances

Accelerated inference for LLMs, diffusion models, and Whisper.

Instance$/minvCPURAMGPUVRAM
T4x4x16$0.01052416 GiBNVIDIA T416 GiB
T4x8x32$0.01504832 GiBNVIDIA T416 GiB
T4x16x64$0.024081664 GiBNVIDIA T416 GiB
L4x4x16$0.01414416 GiBNVIDIA L424 GiB
L4:2x4x16$0.040022496 GiB2 NVIDIA L4s48 GiB
L4:4x48x192$0.0800348192 GiB4 NVIDIA L4s96 GiB
A10Gx4x16$0.02012416 GiBNVIDIA A10G24 GiB
A10Gx8x32$0.02424832 GiBNVIDIA A10G24 GiB
A10Gx16x64$0.032481664 GiBNVIDIA A10G24 GiB
A10G:2x24x96$0.056722496 GiB2 NVIDIA A10Gs48 GiB
A10G:4x48x192$0.1134448192 GiB4 NVIDIA A10Gs96 GiB
A10G:8x192x768$0.32576192768 GiB8 NVIDIA A10Gs188 GiB
V100x8x61$0.061201661 GiBNVIDIA V10016 GiB
A100x12x144$0.1024012144 GiB1 NVIDIA A10080 GiB
A100:2x24x288$0.2048024288 GiB2 NVIDIA A100s160 GiB
A100:3x36x432$0.3072036432 GiB3 NVIDIA A100s240 GiB
A100:4x48x576$0.4096048576 GiB4 NVIDIA A100s320 GiB
A100:5x60x720$0.5120060720 GiB5 NVIDIA A100s400 GiB
A100:6x72x864$0.6144072864 GiB6 NVIDIA A100s480 GiB
A100:7x84x1008$0.71680841008 GiB7 NVIDIA A100s560 GiB
A100:8x96x1152$0.81920961152 GiB8 NVIDIA A100s640 GiB
H100x26x234$0.1664026234 GiB1 NVIDIA H10080 GiB
H100:2x52x468$0.3328052468 GiB2 NVIDIA H100s160 GiB
H100:4x104x936$0.66560104936 GiB4 NVIDIA H100s320 GiB
H100:8x208x1872$1.331202081872 GiB8 NVIDIA H100s640 GiB
H100MIG:3gx13x117$0.0825013117 GiBFractional NVIDIA H10040 GiB

GPU Details & Workloads

T4

Turing-series GPU

  • 2,560 CUDA / 320 Tensor cores
  • 16 GiB VRAM
  • Best for: Whisper, small LLMs like StableLM 3B

L4

Ada Lovelace-series GPU

  • 7,680 CUDA / 240 Tensor cores
  • 24 GiB VRAM, 300 GiB/s
  • 24 GiB VRAM, 300 GiB/s
  • 121 TFLOPS (fp16)
  • Best for: Stable Diffusion XL
  • Limit: Not suitable for LLMs due to bandwidth

A10G

Ampere-series GPU

  • 9,216 CUDA / 288 Tensor cores
  • 24 GiB VRAM, 600 GiB/s
  • 70 TFLOPS (fp16)
  • Best for: Mistral 7B, Whisper, Stable Diffusion/SDXL

V100

Volta-series GPU

  • 16 GiB VRAM
  • Best for: Legacy workloads needing V100-specific support

A100

Ampere-series GPU

  • 6,912 CUDA / 432 Tensor cores
  • 80 GiB VRAM, 1.94 TB/s
  • 312 TFLOPS (fp16)
  • Best for: Mixtral, Llama 2 70B (2 A100s), Falcon 180B (5 A100s), SDXL

H100

Hopper-series GPU

  • 16,896 CUDA / 640 Tensor cores
  • 80 GiB VRAM, 3.35 TB/s
  • 990 TFLOPS (fp16)
  • Best for: Mixtral 8x7B, Llama 2 70B (2×H100), SDXL

H100MIG

Fractional H100 (3/7 compute, ½ memory)

  • 7,242 CUDA cores, 40 GiB VRAM
  • 1.675 TB/s bandwidth
  • Best for: Efficient LLM inference at lower cost than A100