Skip to main content
Baseten Training supports multinode training via infiniband for distributed training across multiple nodes.

Configuring Multinode Training

To deploy a multinode training job:
  • Configure the Compute resource in your TrainingJob by setting the node_count to the number of nodes you’d like to use (e.g. 2).
from truss_train import definitions

compute = definitions.Compute(
    node_count=2,  # Use 2 nodes for multinode training
    # ... other compute configuration options
)

Environment Variables

Make sure you’ve properly integrated with the Baseten provided environment variables for distributed training.

Network Configuration

Baseten provides high-speed infiniband networking between nodes to ensure efficient communication during distributed training. This enables:
  • Fast gradient synchronization
  • Efficient parameter updates
  • Low-latency communication between nodes

Best Practices

When setting up multinode training:
  1. Data Loading: Ensure your data loading is properly distributed across nodes
  2. Seeding: Use consistent seeding across all nodes for reproducible results
  3. Monitoring: Monitor training metrics across all nodes to ensure balanced training
  4. Checkpointing: Enable checkpointing to save model state across the distributed setup
I