Configuring Multinode Training
To deploy a multinode training job:- Configure the
Compute
resource in yourTrainingJob
by setting thenode_count
to the number of nodes you’d like to use (e.g. 2).
Environment Variables
Make sure you’ve properly integrated with the Baseten provided environment variables for distributed training.Network Configuration
Baseten provides high-speed infiniband networking between nodes to ensure efficient communication during distributed training. This enables:- Fast gradient synchronization
- Efficient parameter updates
- Low-latency communication between nodes
Best Practices
When setting up multinode training:- Data Loading: Ensure your data loading is properly distributed across nodes
- Seeding: Use consistent seeding across all nodes for reproducible results
- Monitoring: Monitor training metrics across all nodes to ensure balanced training
- Checkpointing: Enable checkpointing to save model state across the distributed setup