HPC-GVCW: Accelerating SNP Detection in Major Crops through High-Performance Computing and Parallelized Workflows

The exponential increase in sequencing data and the need for rapid processing present challenges for efficient SNP (single-nucleotide polymorphism) detection in major crops, such as sorghum, maize and soybean. SNP detection is significant for trait discovery and genomic selection, necessitating the development of faster and more efficient workflows. The researchers, from King Abdullah University of Science and Technology, University of Arizona, Huazhong Agricultural University, International Rice Research Institute, Cold Spring Harbor Laboratory, Crop Science Research Center and the USDA-ARS, introduce HPC-GVCW, an open-source high-performance computing genome variant calling workflow, designed to address these challenges. This workflow employs high-performance computing (HPC) architectures and a novel algorithm called "Genome Index splitter" to parallelize genome processing and reduce execution times significantly. By dividing genomes into smaller chunks, HPC-GVCW achieves a remarkable speedup, enabling the rapid calling of SNPs from large datasets, such as resequenced rice accessions, in a fraction of the time previously required.

The paper demonstrates the utility of HPC-GVCW by running it on a dataset comprising 25 crop genomes and calling millions of SNPs for rice, sorghum, maize and soybean. Analysis of these SNPs reveals a substantial number of novel variations, particularly in a 16-genome rice reference panel, which could significantly contribute to molecular-assisted selection breeding programs and functional genomics studies. The authors emphasize the reproducibility of their results through containerization and highlight the scalability of HPC-GVCW across various computational platforms. They also discuss future directions, including integrating machine learning-based tools like "DeepVariant" into the workflow and expanding pan-genome strategies to other species, with the goal of enhancing genetic studies and breeding programs in major crops. Overall, HPC-GVCW represents a significant advancement in SNP detection, offering a faster and more scalable solution for exploring genetic diversity in crop plants.

Our High-performance Computing (HPC) workflow was specifically crafted to expedite variant identification in key crop species, enabling scientists to dedicate their efforts primarily to downstream analysis utilizing pan-genomic variant data. The vast potential and ample opportunities within this dataset warrant thorough exploration. While the logical structure and demonstration of the workflow was straightforward, its implementation on the HPC platform presented numerous obstacles. By collaborating closely with a team of computer scientists, we successfully navigated these challenges. Such interdisciplinary collaborations will become increasingly vital in the future. - Zhou

We are excited that our hpc workflow can be wildly adapted to rapidly call SNPs on multiple crops species on genomes that range over 10 fold in size (e.g. rice GS = 400 Mb, sorghum GS = 700 Mb, and maize GS = 2400 Mb). - Wing

Reference:

Zhou Y, Kathiresan N, Yu Z, Rivera LF, Yang Y, Thimma M, Manickam K, Chebotarov D, Mauleon R, Chougule K, Wei S, Gao T, Green CD, Zuccolo A, Xie W, Ware D, Zhang J, McNally KL, Wing RA. A high-performance computational workflow to accelerate GATK SNP detection across a 25-genome dataset. BMC Biol. 2024 Jan 25;22(1):13. PMID: 38273258. doi: 10.1186/s12915-024-01820-5. Read more

Related Project Websites: 

 

Image 1: From left to right, Yong Zhou, Nagarajan Kathiresan, Zhichao Yu, Luis Rivera Serna, Doreen Ware, Jianwei Zhang, Kenneth McNally, Rod A. Wing.