B

Bozitao Zhong

Shanghai Jiao Tong University

ORCID: 0000-0001-9363-6099

Publishes on Protein Structure and Dynamics, Machine Learning in Bioinformatics, RNA and protein synthesis mechanisms. 58 papers and 404 citations.

58Publications
404Total Citations

Is this you? Claim your profile.

Add your photo, update your bio, and get notified when your ranking changes.

Top publicationsby citations

Pretrainable geometric graph neural network for antibody affinity maturation
Huiyu Cai, Zuobai Zhang, Mingkai Wang et al.|Nature Communications|2024
Cited by 42Open Access

Increasing the binding affinity of an antibody to its target antigen is a crucial task in antibody therapeutics development. This paper presents a pretrainable geometric graph neural network, GearBind, and explores its potential in in silico affinity maturation. Leveraging multi-relational graph construction, multi-level geometric message passing and contrastive pretraining on mass-scale, unlabeled protein structural data, GearBind outperforms previous state-of-the-art approaches on SKEMPI and an independent test set. A powerful ensemble model based on GearBind is then derived and used to successfully enhance the binding of two antibodies with distinct formats and target antigens. ELISA EC50 values of the designed antibody mutants are decreased by up to 17 fold, and KD values by up to 6.1 fold. These promising results underscore the utility of geometric deep learning and effective pretraining in macromolecule interaction modeling tasks. Increasing the binding affinity of an antibody to its target antigen is key for antibody therapeutics. Here the authors report a pretrainable geometric graph neural network, GearBind, and explore its potential in in silico antibody affinity maturation.

ProSST: Protein Language Modeling with Quantized Structure and Disentangled Attention
Mingchen Li, Pan Tan, Xinzhu Ma et al.|bioRxiv (Cold Spring Harbor Laboratory)|2024
Cited by 35Open Access

Abstract Protein language models (PLMs) have shown remarkable capabilities in various protein function prediction tasks. However, while protein function is intricately tied to structure, most existing PLMs do not incorporate protein structure information. To address this issue, we introduce ProSST, a Transformer-based protein language model that seamlessly integrates both protein sequences and structures. ProSST incorporates a structure quantization module and a Transformer architecture with disentangled attention. The structure quantization module translates a 3D protein structure into a sequence of discrete tokens by first serializing the protein structure into residue-level local structures and then embeds them into dense vector space. These vectors are then quantized into discrete structure tokens by a pre-trained clustering model. These tokens serve as an effective protein structure representation. Furthermore, ProSST explicitly learns the relationship between protein residue token sequences and structure token sequences through the sequence-structure disentangled attention. We pre-train ProSST on millions of protein structures using a masked language model objective, enabling it to learn comprehensive contextual representations of proteins. To evaluate the proposed ProSST, we conduct extensive experiments on the zero-shot mutation effect prediction and several supervised downstream tasks, where ProSST achieves the state-of-the-art performance among all baselines. Our code and pretrained models are publicly available 2 .

ParaFold: Paralleling AlphaFold for Large-Scale Predictions
Bozitao Zhong, Xiaoming Su, Minhua Wen et al.|Unknown|2022
Cited by 33

AlphaFold developed by DeepMind predicts protein structures from the amino acid sequence at or near experimental resolution, solving the 50-year-old protein folding challenge, leading to progress by transforming large-scale genomics data into protein structures. AlphaFold will also greatly change the scientific research model from low-throughput to high-throughput manner. The overall AlphaFold prediction process consists of two stages: 1) MSA construction based on CPUs and 2) model inferences on GPUs. In the first stage, AlphaFold uses CPUs only, taking up to hours for MSA construction of a single protein due to the large database sizes and I/O bottlenecks. However, GPUs in this stage remain idle, resulting in low GPU utilization and restricting the capacity of large-scale structure predictions. Therefore, we proposed “ParaFold”, an open-source parallel version of AlphaFold for high throughput protein structure predictions. ParaFold separates the CPU and GPU parts to enable large-scale structure predictions and to improve GPU utilization. ParaFold also effectively reduces the CPU and GPU runtime with two optimizations without compromising the quality of prediction results: using multi-threaded parallelism on CPUs and using optimized JAX compilation on GPUs. We evaluated ParaFold with three datasets of different protein lengths. We showed the large-scale structure prediction capability by running model 1 inference of ∼ 20,000 small proteins in 5.4 hours on one NVIDIA DGX-2. With the CPU/GPU separation and JAX compile optimization, the total GPU runtime was reduced to 5.4 hours, compared with 1,352.6 hours when using AlphaFold, achieving a 99.7% GPU runtime reduction. ParaFold largely increased the protein structure prediction capacity of GPU per day, getting a 250X speedup over AlphaFold with this case (∼ 20,000 proteins of the same 50 residues). ParaFold offers an rapid and effective approach for high-throughput structure predictions, leveraging the predictive power by running on supercomputers, with shorter time and at a lower cost. The development of ParaFold will greatly speed up high-throughput studies and render the protein “structure-omics” feasible.

Mechanism of zinc ejection by disulfiram in nonstructural protein 5A
Ashfaq Ur Rehman, Guodong Zhen, Bozitao Zhong et al.|Physical Chemistry Chemical Physics|2021
Cited by 31

Hepatitis C virus (HCV) is a notorious member of the Flaviviridae family of enveloped, positive-strand RNA viruses. Non-structural protein 5A (NS5A) plays a key role in HCV replication and assembly. NS5A is a multi-domain protein which includes an N-terminal amphipathic membrane anchoring alpha helix, a highly structured domain-1, and two intrinsically disordered domains 2-3. The highly structured domain-1 contains a zinc finger (Zf)-site, and binding of zinc stabilizes the overall structure, while ejection of this zinc from the Zf-site destabilizes the overall structure. Therefore, NS5A is an attractive target for anti-HCV therapy by disulfiram, through ejection of zinc from the Zf-site. However, the zinc ejection mechanism is poorly understood. To disclose this mechanism based on three different states, A-state (NS5A protein), B-state (NS5A + Zn), and C-state (NS5A + Zn + disulfiram), we have performed molecular dynamics (MD) simulation in tandem with DFT calculations in the current study. The MD results indicate that disulfiram triggers Zn ejection from the Zf-site predominantly through altering the overall conformation ensemble. On the other hand, the DFT assessment demonstrates that the Zn adopts a tetrahedral configuration at the Zf-site with four Cys residues, which indicates a stable protein structure morphology. Disulfiram binding induces major conformational changes at the Zf-site, introduces new interactions of Cys39 with disulfiram, and further weakens the interaction of this residue with Zn, causing ejection of zinc from the Zf-site. The proposed mechanism elucidates the therapeutic potential of disulfiram and offers theoretical guidance for the advancement of drug candidates.

Precise Generation of Conformational Ensembles for Intrinsically Disordered Proteins via Fine-tuned Diffusion Models
Junjie Zhu, Zhengxin Li, Bo Zhang et al.|bioRxiv (Cold Spring Harbor Laboratory)|2024
Cited by 29Open Access

Intrinsically disordered proteins (IDPs) play pivotal roles in various biological functions and are closely linked to many human diseases including cancer, diabetes and Alzheimer disease. Structural investigations of IDPs typically involve a combination of molecular dynamics (MD) simulations and experimental data to correct for intrinsic biases in simulation methods. However, these simulations are hindered by their high computational cost and a scarcity of experimental data, severely limiting their applicability. Despite the recent advancements in structure prediction for structured proteins, understanding the conformational properties of IDPs remains challenging partly due to the poor conservation of disordered protein sequences and limited experimental characterization. Here, we introduce IDPFold, a method capable of generating conformational ensembles for IDPs directly from their sequences using fine-tuned diffusion models. IDPFold bypasses the need for Multiple Sequence Alignments (MSA) or experimental data, achieving accurate predictions of ensemble properties across numerous IDPs. By sampling conformations at the backbone level, IDPFold provides more detailed structural features and more precise property estimation compared to other state-of-the-art methods. IDPFold is ready to be used in the elucidate the sequence-disorder-function paradigm of IDPs.