ProSST: Protein Language Modeling with Quantized Structure and Disentangled Attention

Mingchen Li; Pan Tan; Xinzhu Ma; Bozitao Zhong; Huiqun Yu; Ziyi Zhou; Wanli Ouyang; Bingxin Zhou; Liang Hong; Yang Tan

doi:10.1101/2024.04.15.589672

ProSST: Protein Language Modeling with Quantized Structure and Disentangled Attention

Mingchen Li(Beijing Academy of Artificial Intelligence), Pan Tan(Shanghai Jiao Tong University), Xinzhu Ma(Beijing Academy of Artificial Intelligence), Bozitao Zhong(Shanghai Jiao Tong University), Huiqun Yu(East China University of Science and Technology), Ziyi Zhou(Shanghai Jiao Tong University), Wanli Ouyang(Beijing Academy of Artificial Intelligence), Bingxin Zhou(Shanghai Jiao Tong University), Liang Hong(Shanghai Jiao Tong University), Yang Tan(Beijing Academy of Artificial Intelligence)

bioRxiv (Cold Spring Harbor Laboratory)

April 17, 2024

10.1101/2024.04.15.589672

Cited by 35Open Access

Full Text

Abstract

Abstract Protein language models (PLMs) have shown remarkable capabilities in various protein function prediction tasks. However, while protein function is intricately tied to structure, most existing PLMs do not incorporate protein structure information. To address this issue, we introduce ProSST, a Transformer-based protein language model that seamlessly integrates both protein sequences and structures. ProSST incorporates a structure quantization module and a Transformer architecture with disentangled attention. The structure quantization module translates a 3D protein structure into a sequence of discrete tokens by first serializing the protein structure into residue-level local structures and then embeds them into dense vector space. These vectors are then quantized into discrete structure tokens by a pre-trained clustering model. These tokens serve as an effective protein structure representation. Furthermore, ProSST explicitly learns the relationship between protein residue token sequences and structure token sequences through the sequence-structure disentangled attention. We pre-train ProSST on millions of protein structures using a masked language model objective, enabling it to learn comprehensive contextual representations of proteins. To evaluate the proposed ProSST, we conduct extensive experiments on the zero-shot mutation effect prediction and several supervised downstream tasks, where ProSST achieves the state-of-the-art performance among all baselines. Our code and pretrained models are publicly available 2 .

Related Papers

No related papers found

Powered by citation graph analysis