Skip to main content
Quantization trades precision for speed and memory efficiency. This guide covers Baseten’s supported formats, hardware requirements, and model-specific recommendations.

Quantization options

Quantization type availability depends on the engine and GPU.

Engine support

QuantizationBIS-LLMEngine-Builder-LLMBEI
FP8βœ…βœ…βœ…
FP8_KVβœ…βœ…βŒ
FP4βœ…βœ…βœ…
FP4_KVβœ…βœ…βŒ
FP4_MLP_ONLYβœ…βœ…βœ…

GPU support

GPU typeFP8FP8_KVFP4FP4_KVFP4_MLP_ONLY
L4βœ…βœ…βŒβŒβŒ
H100βœ…βœ…βŒβŒβŒ
H200βœ…βœ…βŒβŒβŒ
B200βœ…βœ…βœ…βœ…βœ…

Model recommendations

Some model families have specific quantization requirements that affect accuracy.

Qwen2 models

Qwen2 models experience quality degradation with FP8_KV, so use regular FP8 instead. Increase calibration size to 1024 or greater for better accuracy.

Llama models

Llama variants work well with FP8_KV and standard calibration sizes (1024-1536). For B200 deployments, use FP4_MLP_ONLY for the best balance of speed and quality.

BEI models (embeddings)

Use FP8 for embedding models for causal models. Skip quantization for smaller models since the overhead isn’t worth the minimal benefit and Bert is not supported. BEI doesn’t support FP8_KV.

Calibration

Quantization requires calibration data to determine optimal scaling factors. Larger models generally need more calibration samples.

Calibration datasets

The default dataset is cnn_dailymail (general news text). For specialized models, or fine-tunes specific to a chat template, use domain-specific datasets when available. For using a custom dataset, reference the huggingface name under calib_dataset, and make sure the dataset has a train split with a text column.

Calibration configuration

quantization_config:
  calib_size: 768                    # Number of samples
  calib_dataset: "cnn_dailymail"      # Dataset name
  calib_max_seq_length: 1024          # Max sequence length
Increase calib_size for larger models. Use domain-specific datasets when available for better accuracy on specialized tasks.

Hardware requirements

FP4 quantization requires B200 GPUs. FP8 runs on L4 and above.
QuantizationMinimum GPURecommended GPUMemory reduction
FP16/BF16A100H100None
FP8L4H100~50%
FP8_KVL4H100~60%
FP4B200B200~75%
FP4_KVB200B200~80%

Configuration examples

Engine-Builder-LLM:
trt_llm:
  build:
    base_model: decoder
    quantization_type: fp8
    quantization_config:
      calib_size: 1024
BIS-LLM:
trt_llm:
  inference_stack: v2
  build:
    quantization_type: fp8
    quantization_config:
      calib_size: 1024
  runtime:
    max_seq_len: 32768
BEI:
trt_llm:
  build:
    base_model: encoder
    quantization_type: fp8
    max_num_tokens: 16384
Set quantization_type in the build section and add quantization_config to customize calibration. BIS-LLM uses inference_stack: v2 while Engine-Builder-LLM uses base_model: decoder.

Best practices

When to use quantization

Use FP8 for production deployments to achieve cost-effective scaling. For memory-constrained environments, FP8_KV or FP4 variants provide additional memory reduction. Quantization becomes essential for models over 15B parameters where memory and cost savings are significant.

When to avoid quantization

Skip quantization when maximum accuracy is critical. Use FP16/BF16 instead. Small models under 8B parameters see minimal benefit from quantization. BEI-Bert models don’t support quantization at all. During research and development, FP16 provides faster iteration without calibration overhead.

Optimization tips

Use calibration datasets that match your domain for best accuracy. Test quantized models with your specific data before production deployment. Monitor the accuracy vs. performance trade-off and consider your hardware constraints when selecting quantization type.

Further reading