Parallelization Guide

Serial Training

Before going to parallelization, we first take a look at the single GPU/CPU training.

Here is the training script train.py:

from iann.trainer import Trainer

# Define the trainer
trainer = Trainer(
   model="painn",
   config={"device": "cuda",
           'output_dir': 'output',
           'output_log': 'output.log',
           'output_model': 'model.pt'},
   distributed=False,
   )

# Train the model
trainer.train("dataset.traj")

To run training on a local machine in serial, you can run the script on command line:

# Run on a local machine
python train.py

or you can submit it to a SLURM workload manager.

To run training on a single GPU:

#!/bin/bash
#SBATCH -N 1                   # Number of nodes
#SBATCH -C gpu                 # Use GPU nodes
#SBATCH -q debug               # Use regular/debug queue
#SBATCH -t 00:30:00            # Time limit
#SBATCH -A m2997               # Your account
#SBATCH --gpus-per-node=1      # GPUs per node
#SBATCH --ntasks-per-node=1    # Number of tasks per node
#SBATCH --cpus-per-task=1      # Number of CPU cores per task

module load your_modules

python train.py

To run training on a single CPU:

#!/bin/bash
#SBATCH -N 1                   # Number of nodes
#SBATCH -C cpu                 # Use CPU nodes
#SBATCH -q debug               # Use regular/debug queue
#SBATCH -t 00:30:00            # Time limit
#SBATCH -A m2997               # Your account
#SBATCH --ntasks-per-node=1    # Number of tasks per node
#SBATCH --cpus-per-task=128      # Number of CPU cores per task

module load your_modules

python train.py

Parallel Training

This guide covers how to use IANN with distributed training to parallelize the training process for speeding up the training. IANN supports distributed training using PyTorch’s Distributed Data Parallel (DDP). This allows training on multiple GPUs efficiently.

Multi-GPU Training: submit to multiple GPUs and multiple nodes (in SLURM Workload Manager)

#!/bin/bash
#SBATCH -N 2                   # Number of nodes
#SBATCH -C gpu                 # Use GPU nodes
#SBATCH -q debug               # Use regular/debug queue
#SBATCH -t 00:30:00            # Time limit
#SBATCH -A m2997               # Your account
#SBATCH --gpus-per-node=4      # GPUs per node
#SBATCH --ntasks-per-node=4    # Number of tasks per node
#SBATCH --cpus-per-task=1      # Number of CPUs per task

module load your_modules

export GPUS_PER_NODE=$SLURM_GPUS_ON_NODE
export NNODES=$SLURM_NNODES

srun -N $NNODES -n $((NNODES*GPUS_PER_NODE)) python train.py

Multi-CPU Training: submit to multiple CPUs and multiple nodes (in SLURM Workload Manager)

#!/bin/bash
#SBATCH -N 2                   # Number of nodes
#SBATCH -C cpu                 # Use CPU nodes
#SBATCH -q debug               # Use regular/debug queue
#SBATCH -t 00:30:00            # Time limit
#SBATCH -A m2997               # Your account
#SBATCH --ntasks-per-node=1    # Number of tasks per node
#SBATCH --cpus-per-task=128    # Number of CPUs per task

module load your_modules

export GPUS_PER_NODE=$SLURM_GPUS_ON_NODE
export NNODES=$SLURM_NNODES

srun -N $NNODES -n $((NNODES*GPUS_PER_NODE)) python train.py

To directly run training on multiple GPUs/CPUs on a local machine:

# Run on N GPUs/CPUs
python -m torch.distributed.launch --nproc_per_node=N train.py

Or use the following command by torchrun:

# Run on N GPUs/CPUs
torchrun --nproc_per_node=N train.py

Note

nproc_per_node defines the number of local CPU or GPU workers. Setup device="cpu" or device="cuda" to make device selection in your config.toml.

Examples of parallel training on NERSC and S3DF

Here is an example of how to run multi-GPU training on NERSC:

#!/bin/bash
#SBATCH -N 2                   # Number of nodes
#SBATCH -C gpu                 # Use GPU nodes
#SBATCH -q debug               # Use regular/debug queue
#SBATCH -t 00:30:00            # Time limit
#SBATCH -A m2997               # Your account
#SBATCH --gpus-per-node=4      # GPUs per node
#SBATCH --ntasks-per-node=4    # Number of tasks per node
#SBATCH --cpus-per-task=1      # Number of CPUs per task

# Load the environments, such as:
export PYTHONPATH=/pscratch/sd/c/changzhi/softwares/IANN_v3/IANN/:$PYTHONPATH
module purge
module load PrgEnv-nvidia; module load openmpi

# Vender bugs fixed on nersc for multiple nodes
export FI_CXI_RDZV_GET_MIN=0 # vender bugs fixed on nersc for multiple nodes
export FI_CXI_SAFE_DEVMEM_COPY_THRESHOLD=16777216 # vender bugs fixed on nersc

# GPUs per node and number of nodes
export GPUS_PER_NODE=$SLURM_GPUS_ON_NODE
export NNODES=$SLURM_NNODES

# Run the training script on multiple GPUs/CPUs
srun -N $NNODES -n $((NNODES*GPUS_PER_NODE)) python train.py

Here is an example of how to run multi-GPU training on S3DF:

#!/bin/bash
#SBATCH --job-name=train
#SBATCH --nodes=2
#SBATCH --tasks-per-node=1
#SBATCH --cpus-per-task=1
#SBATCH --gpus-per-node=1
#SBATCH --time=00:30:00
#SBATCH --partition=ampere
#SBATCH --account=suncat:normal

# Load the environments, such as:
conda activate /sdf/home/c/changzhi/softwares/anoconda3/envs/painn
export PYTHONPATH=/sdf/home/c/changzhi/changzhi/softwares/IANN_v3/IANN:$PYTHONPATH

# GPUs per node and number of nodes
export GPUS_PER_NODE=$SLURM_GPUS_ON_NODE
export NNODES=$SLURM_NNODES

# Run the training script on multiple GPUs/CPUs
srun -N $NNODES -n $((NNODES*GPUS_PER_NODE)) python train.py

Note

The srun command is used to run the training script on multiple GPUs/CPUs. It is a wrapper around the mpirun command for multiple GPUs/CPUs parallelization. For multi-GPUs/multi-CPUs mode, Trainer.train() will call mp.spawn() to launch world_size workers using the process_function. The parallelization parameters would be automatically obtained from the SLURM environment variables.

Parallelization Configuration

When using DDP, keep in mind that these config parameters are vital, which are automatically obtained from the SLURM environment variables when using SLURM Workload Manager:

# DDP-specific settings
batch_size = 32  # This is per GPU
dist_timeout = 600  # Timeout in seconds
master_port = 12356 # Master port for distributed training

Performance Optimization

Batch Size
- Set batch_size per GPU by batch_size = 32
- Total batch size = batch_size x num_gpus
- Adjust based on GPU memory
Data Loading
- Use multiple workers per GPU
- Enable pin_memory for faster data transfer
- Consider using persistent workers
Communication
- Use dist_timeout to set the timeout for communication
- Use master_port to set the master port for distributed training
- If port conflicts, you can change the port by master_port = 12356

Common Issues

Gradient Strides Warning
- You may see a warning about gradient strides not matching bucket view strides
- This is not an error and typically doesn’t affect training
- Can be safely ignored in most cases
Memory Issues
- Reduce batch size if OOM errors occur
- Monitor GPU memory usage
- Consider gradient checkpointing for large models
Communication Errors
- Check network connectivity between nodes
- Verify NCCL installation
- Adjust timeout values if needed

Best Practices

Scaling
- Start with a small number of GPUs
- Monitor scaling efficiency
- Adjust batch size and learning rate accordingly
Monitoring
- Use tools like nvidia-smi to monitor GPU usage
- Check communication overhead
- Profile training for bottlenecks
Debugging
- Run with a single GPU first
- Enable debug logging if needed
- Check for synchronization issues

For more advanced usage and troubleshooting, see the API Reference reference.