Parallelization Guide
Serial Training
Before going to parallelization, we first take a look at the single GPU/CPU training.
Here is the training script train.py:
from iann.trainer import Trainer
# Define the trainer
trainer = Trainer(
model="painn",
config={"device": "cuda",
'output_dir': 'output',
'output_log': 'output.log',
'output_model': 'model.pt'},
distributed=False,
)
# Train the model
trainer.train("dataset.traj")
To run training on a local machine in serial, you can run the script on command line:
# Run on a local machine
python train.py
or you can submit it to a SLURM workload manager.
To run training on a single GPU:
#!/bin/bash
#SBATCH -N 1 # Number of nodes
#SBATCH -C gpu # Use GPU nodes
#SBATCH -q debug # Use regular/debug queue
#SBATCH -t 00:30:00 # Time limit
#SBATCH -A m2997 # Your account
#SBATCH --gpus-per-node=1 # GPUs per node
#SBATCH --ntasks-per-node=1 # Number of tasks per node
#SBATCH --cpus-per-task=1 # Number of CPU cores per task
module load your_modules
python train.py
To run training on a single CPU:
#!/bin/bash
#SBATCH -N 1 # Number of nodes
#SBATCH -C cpu # Use CPU nodes
#SBATCH -q debug # Use regular/debug queue
#SBATCH -t 00:30:00 # Time limit
#SBATCH -A m2997 # Your account
#SBATCH --ntasks-per-node=1 # Number of tasks per node
#SBATCH --cpus-per-task=128 # Number of CPU cores per task
module load your_modules
python train.py
Parallel Training
This guide covers how to use IANN with distributed training to parallelize the training process for speeding up the training. IANN supports distributed training using PyTorch’s Distributed Data Parallel (DDP). This allows training on multiple GPUs efficiently.
Multi-GPU Training: submit to multiple GPUs and multiple nodes (in SLURM Workload Manager)
#!/bin/bash
#SBATCH -N 2 # Number of nodes
#SBATCH -C gpu # Use GPU nodes
#SBATCH -q debug # Use regular/debug queue
#SBATCH -t 00:30:00 # Time limit
#SBATCH -A m2997 # Your account
#SBATCH --gpus-per-node=4 # GPUs per node
#SBATCH --ntasks-per-node=4 # Number of tasks per node
#SBATCH --cpus-per-task=1 # Number of CPUs per task
module load your_modules
export GPUS_PER_NODE=$SLURM_GPUS_ON_NODE
export NNODES=$SLURM_NNODES
srun -N $NNODES -n $((NNODES*GPUS_PER_NODE)) python train.py
Multi-CPU Training: submit to multiple CPUs and multiple nodes (in SLURM Workload Manager)
#!/bin/bash
#SBATCH -N 2 # Number of nodes
#SBATCH -C cpu # Use CPU nodes
#SBATCH -q debug # Use regular/debug queue
#SBATCH -t 00:30:00 # Time limit
#SBATCH -A m2997 # Your account
#SBATCH --ntasks-per-node=1 # Number of tasks per node
#SBATCH --cpus-per-task=128 # Number of CPUs per task
module load your_modules
export GPUS_PER_NODE=$SLURM_GPUS_ON_NODE
export NNODES=$SLURM_NNODES
srun -N $NNODES -n $((NNODES*GPUS_PER_NODE)) python train.py
To directly run training on multiple GPUs/CPUs on a local machine:
# Run on N GPUs/CPUs
python -m torch.distributed.launch --nproc_per_node=N train.py
Or use the following command by torchrun:
# Run on N GPUs/CPUs
torchrun --nproc_per_node=N train.py
Note
nproc_per_node defines the number of local CPU or GPU workers. Setup device="cpu" or device="cuda" to make device selection in your config.toml.
Examples of parallel training on NERSC and S3DF
Here is an example of how to run multi-GPU training on NERSC:
#!/bin/bash
#SBATCH -N 2 # Number of nodes
#SBATCH -C gpu # Use GPU nodes
#SBATCH -q debug # Use regular/debug queue
#SBATCH -t 00:30:00 # Time limit
#SBATCH -A m2997 # Your account
#SBATCH --gpus-per-node=4 # GPUs per node
#SBATCH --ntasks-per-node=4 # Number of tasks per node
#SBATCH --cpus-per-task=1 # Number of CPUs per task
# Load the environments, such as:
export PYTHONPATH=/pscratch/sd/c/changzhi/softwares/IANN_v3/IANN/:$PYTHONPATH
module purge
module load PrgEnv-nvidia; module load openmpi
# Vender bugs fixed on nersc for multiple nodes
export FI_CXI_RDZV_GET_MIN=0 # vender bugs fixed on nersc for multiple nodes
export FI_CXI_SAFE_DEVMEM_COPY_THRESHOLD=16777216 # vender bugs fixed on nersc
# GPUs per node and number of nodes
export GPUS_PER_NODE=$SLURM_GPUS_ON_NODE
export NNODES=$SLURM_NNODES
# Run the training script on multiple GPUs/CPUs
srun -N $NNODES -n $((NNODES*GPUS_PER_NODE)) python train.py
Here is an example of how to run multi-GPU training on S3DF:
#!/bin/bash
#SBATCH --job-name=train
#SBATCH --nodes=2
#SBATCH --tasks-per-node=1
#SBATCH --cpus-per-task=1
#SBATCH --gpus-per-node=1
#SBATCH --time=00:30:00
#SBATCH --partition=ampere
#SBATCH --account=suncat:normal
# Load the environments, such as:
conda activate /sdf/home/c/changzhi/softwares/anoconda3/envs/painn
export PYTHONPATH=/sdf/home/c/changzhi/changzhi/softwares/IANN_v3/IANN:$PYTHONPATH
# GPUs per node and number of nodes
export GPUS_PER_NODE=$SLURM_GPUS_ON_NODE
export NNODES=$SLURM_NNODES
# Run the training script on multiple GPUs/CPUs
srun -N $NNODES -n $((NNODES*GPUS_PER_NODE)) python train.py
Note
The srun command is used to run the training script on multiple GPUs/CPUs. It is a wrapper around the mpirun command for multiple GPUs/CPUs parallelization.
For multi-GPUs/multi-CPUs mode, Trainer.train() will call mp.spawn() to launch world_size workers using the process_function.
The parallelization parameters would be automatically obtained from the SLURM environment variables.
Parallelization Configuration
When using DDP, keep in mind that these config parameters are vital, which are automatically obtained from the SLURM environment variables when using SLURM Workload Manager:
# DDP-specific settings
batch_size = 32 # This is per GPU
dist_timeout = 600 # Timeout in seconds
master_port = 12356 # Master port for distributed training
Performance Optimization
Batch Size
Set batch_size per GPU by
batch_size = 32Total batch size = batch_size x num_gpus
Adjust based on GPU memory
Data Loading
Use multiple workers per GPU
Enable pin_memory for faster data transfer
Consider using persistent workers
Communication
Use
dist_timeoutto set the timeout for communicationUse
master_portto set the master port for distributed trainingIf port conflicts, you can change the port by
master_port = 12356
Common Issues
Gradient Strides Warning
You may see a warning about gradient strides not matching bucket view strides
This is not an error and typically doesn’t affect training
Can be safely ignored in most cases
Memory Issues
Reduce batch size if OOM errors occur
Monitor GPU memory usage
Consider gradient checkpointing for large models
Communication Errors
Check network connectivity between nodes
Verify NCCL installation
Adjust timeout values if needed
Best Practices
Scaling
Start with a small number of GPUs
Monitor scaling efficiency
Adjust batch size and learning rate accordingly
Monitoring
Use tools like nvidia-smi to monitor GPU usage
Check communication overhead
Profile training for bottlenecks
Debugging
Run with a single GPU first
Enable debug logging if needed
Check for synchronization issues
For more advanced usage and troubleshooting, see the API Reference reference.