Configuring Multinode Training
To deploy a multinode training job:- Configure the
Computeresource in yourTrainingJobby setting thenode_countto the number of nodes you’d like to use (e.g. 2).
Environment Variables
Make sure you’ve properly integrated with the Baseten provided environment variables for distributed training.Network Configuration
Baseten provides high-speed infiniband networking between nodes to ensure efficient communication during distributed training. This enables:- Fast gradient synchronization
- Efficient parameter updates
- Low-latency communication between nodes
Checkpointing in Multinode Training
Checkpointing behavior varies across training frameworks in multinode setups. One common pattern is to use the shared cache directory that all nodes can access:ckpt_dir. This ensures all nodes write to the same checkpoint location. For comprehensive framework-specific examples and patterns, see the Training Cookbook.
Keep in mind that these checkpoints will not be backed up by Baseten since they are not stored in $BT_CHECKPOINT_DIR. Make sure to copy them there at some point to ensure they are preserved.
Common Practices
When setting up multinode training:- Data Loading: Ensure your data loading is properly distributed across nodes
- Seeding: Use consistent seeding across all nodes for reproducible results
- Monitoring: Monitor training metrics across all nodes to ensure balanced training