Multinode Training

Baseten Training supports multinode training via infiniband for distributed training across multiple nodes.

Configuring Multinode Training

To deploy a multinode training job:

Configure the Compute resource in your TrainingJob by setting the node_count to the number of nodes you’d like to use (e.g. 2).

from truss_train import definitions

compute = definitions.Compute(
    node_count=2,  # Use 2 nodes for multinode training
    # ... other compute configuration options
)

Environment Variables

Make sure you’ve properly integrated with the Baseten provided environment variables for distributed training.

Network Configuration

Baseten provides high-speed infiniband networking between nodes to ensure efficient communication during distributed training. This enables:

Fast gradient synchronization
Efficient parameter updates
Low-latency communication between nodes

Checkpointing in Multinode Training

Checkpointing behavior varies across training frameworks in multinode setups. One common pattern is to use the shared cache directory that all nodes can access:

# Use shared volume with job name for checkpointing
ckpt_dir="${BT_PROJECT_CACHE_DIR}/${BT_TRAINING_JOB_NAME}"

Then ensure you write to ckpt_dir. This ensures all nodes write to the same checkpoint location. For comprehensive framework-specific examples and patterns, see the Training Cookbook. Keep in mind that these checkpoints will not be backed up by Baseten since they are not stored in $BT_CHECKPOINT_DIR. Make sure to copy them there at some point to ensure they are preserved.

Common Practices

When setting up multinode training:

Data Loading: Ensure your data loading is properly distributed across nodes
Seeding: Use consistent seeding across all nodes for reproducible results
Monitoring: Monitor training metrics across all nodes to ensure balanced training

Get started

Concepts

Development

Deployment

Inference

Training

Observability

Troubleshooting

Multinode Training

Configuring Multinode Training

Environment Variables

Network Configuration

Checkpointing in Multinode Training

Common Practices

Get started

Concepts

Development

Deployment

Inference

Training

Observability

Troubleshooting

​Configuring Multinode Training

​Environment Variables

​Network Configuration

​Checkpointing in Multinode Training

​Common Practices

Configuring Multinode Training

Environment Variables

Network Configuration

Checkpointing in Multinode Training

Common Practices