ParaFold: Paralleling AlphaFold for Large-Scale Predictions

Bozitao Zhong(Shanghai Jiao Tong University), Xiaoming Su(Shanghai Jiao Tong University), Minhua Wen(Shanghai Jiao Tong University), Si-Cheng Zuo(Shanghai Jiao Tong University), Liang Hong(Shanghai Jiao Tong University), James Lin(Shanghai Jiao Tong University)
Unknown
January 11, 2022
Cited by 33

Abstract

AlphaFold developed by DeepMind predicts protein structures from the amino acid sequence at or near experimental resolution, solving the 50-year-old protein folding challenge, leading to progress by transforming large-scale genomics data into protein structures. AlphaFold will also greatly change the scientific research model from low-throughput to high-throughput manner. The overall AlphaFold prediction process consists of two stages: 1) MSA construction based on CPUs and 2) model inferences on GPUs. In the first stage, AlphaFold uses CPUs only, taking up to hours for MSA construction of a single protein due to the large database sizes and I/O bottlenecks. However, GPUs in this stage remain idle, resulting in low GPU utilization and restricting the capacity of large-scale structure predictions. Therefore, we proposed “ParaFold”, an open-source parallel version of AlphaFold for high throughput protein structure predictions. ParaFold separates the CPU and GPU parts to enable large-scale structure predictions and to improve GPU utilization. ParaFold also effectively reduces the CPU and GPU runtime with two optimizations without compromising the quality of prediction results: using multi-threaded parallelism on CPUs and using optimized JAX compilation on GPUs. We evaluated ParaFold with three datasets of different protein lengths. We showed the large-scale structure prediction capability by running model 1 inference of ∼ 20,000 small proteins in 5.4 hours on one NVIDIA DGX-2. With the CPU/GPU separation and JAX compile optimization, the total GPU runtime was reduced to 5.4 hours, compared with 1,352.6 hours when using AlphaFold, achieving a 99.7% GPU runtime reduction. ParaFold largely increased the protein structure prediction capacity of GPU per day, getting a 250X speedup over AlphaFold with this case (∼ 20,000 proteins of the same 50 residues). ParaFold offers an rapid and effective approach for high-throughput structure predictions, leveraging the predictive power by running on supercomputers, with shorter time and at a lower cost. The development of ParaFold will greatly speed up high-throughput studies and render the protein “structure-omics” feasible.


Related Papers

No related papers found

Powered by citation graph analysis