Skip to main content
Baseten Training supports multinode training via infiniband for distributed training across multiple nodes.

Configuring Multinode Training

To deploy a multinode training job:
  • Configure the Compute resource in your TrainingJob by setting the node_count to the number of nodes you’d like to use (e.g. 2).
from truss_train import definitions

compute = definitions.Compute(
    node_count=2,  # Use 2 nodes for multinode training
    # ... other compute configuration options
)

Environment Variables

Make sure you’ve properly integrated with the Baseten provided environment variables for distributed training.

Network Configuration

Baseten provides high-speed infiniband networking between nodes to ensure efficient communication during distributed training. This enables:
  • Fast gradient synchronization
  • Efficient parameter updates
  • Low-latency communication between nodes

Checkpointing in Multinode Training

Checkpointing behavior varies across training frameworks in multinode setups. One common pattern is to use the shared cache directory that all nodes can access:
# Use shared volume with job name for checkpointing
ckpt_dir="${BT_RW_CACHE_DIR}/${BT_TRAINING_JOB_NAME}"
Then ensure you write to ckpt_dir. This ensures all nodes write to the same checkpoint location. For comprehensive framework-specific examples and patterns, see the Training Cookbook. Keep in mind that these checkpoints will not be backed up by Baseten since they are not stored in $BT_CHECKPOINT_DIR. Make sure to copy them there at some point to ensure they are preserved.

Common Practices

When setting up multinode training:
  1. Data Loading: Ensure your data loading is properly distributed across nodes
  2. Seeding: Use consistent seeding across all nodes for reproducible results
  3. Monitoring: Monitor training metrics across all nodes to ensure balanced training