Troubleshooting Guide
This guide covers common issues and their solutions when using IANN.
Training Issues
Memory Issues
Problem: Out of Memory (OOM) errors during training
Solutions:
Reduce batch size
Use gradient checkpointing
Enable mixed precision training
Clear GPU cache between runs
Training Instability
Problem: Loss becomes NaN or training diverges
Solutions:
Reduce learning rate
Enable gradient clipping
Check data normalization
Verify input data quality
Slow Training
Problem: Training is slower than expected
Solutions:
Increase batch size if memory allows
Use multiple workers for data loading
Enable mixed precision training
Profile training to identify bottlenecks
DDP Issues
Gradient Strides Warning
Problem: Warning about gradient strides not matching bucket view strides
Solution: This is a known PyTorch DDP warning that can be safely ignored. It doesn’t affect training accuracy.
Communication Errors
Problem: DDP communication failures
Solutions:
Check network connectivity
Verify NCCL installation
Increase DDP timeout
Check GPU compatibility
Synchronization Issues
Problem: Models on different GPUs become desynchronized
Solutions:
Use consistent random seeds
Check data loading order
Verify batch size consistency
Monitor gradient norms
Prediction Issues
Incorrect Predictions
Problem: Model predictions are inaccurate
Solutions:
Verify model loading
Check input data normalization
Ensure cutoff radius matches training
Validate atomic numbers
Performance Issues
Problem: Slow prediction speed
Solutions:
Use batch processing
Enable CUDA if available
Optimize data loading
Consider model quantization
LAMMPS Integration
Model Loading
Problem: LAMMPS fails to load the model
Solutions:
Verify model export format
Check file permissions
Ensure correct LAMMPS version
Validate model compatibility
Energy/Force Issues
Problem: Incorrect energies or forces in LAMMPS
Solutions:
Check unit conversion
Verify cutoff radius
Validate energy/force scaling
Test with simple systems
Performance Problems
Problem: Slow MD simulations
Solutions:
Optimize neighbor list settings
Adjust communication settings
Use appropriate parallelization
Profile simulation
General Tips
Debugging
Enable debug logging
Use smaller test cases
Check intermediate outputs
Monitor memory usage
Performance Optimization
Profile your code
Use appropriate batch sizes
Enable mixed precision
Optimize data loading
Best Practices
Keep track of model versions
Document configuration changes
Use version control
Regular testing
References
For more specific issues or if you need additional help, please:
Check the GitHub issues page
Review the API documentation
Contact the maintainers
Maintainers
Maintainer Dr. Changzhi Ai (changzhi@stanford.edu) at SUNCAT center, Stanford University and SLAC, who is supervised by Dr. Johannes Voss and Dr. Frank Abild-Pedersen.