S

Sirui Liu

Nanjing Agricultural University

ORCID: 0000-0002-9369-6291

Publishes on Genomics and Chromatin Dynamics, Protein Structure and Dynamics, Epigenetics and DNA Methylation. 79 papers and 1.9k citations.

79Publications
1.9kTotal Citations

Is this you? Claim your profile.

Add your photo, update your bio, and get notified when your ranking changes.

Top publicationsby citations

Unraveling the functional dark matter through global metagenomics
Cited by 193Open Access

Abstract Metagenomes encode an enormous diversity of proteins, reflecting a multiplicity of functions and activities 1,2 . Exploration of this vast sequence space has been limited to a comparative analysis against reference microbial genomes and protein families derived from those genomes. Here, to examine the scale of yet untapped functional diversity beyond what is currently possible through the lens of reference genomes, we develop a computational approach to generate reference-free protein families from the sequence space in metagenomes. We analyse 26,931 metagenomes and identify 1.17 billion protein sequences longer than 35 amino acids with no similarity to any sequences from 102,491 reference genomes or the Pfam database 3 . Using massively parallel graph-based clustering, we group these proteins into 106,198 novel sequence clusters with more than 100 members, doubling the number of protein families obtained from the reference genomes clustered using the same approach. We annotate these families on the basis of their taxonomic, habitat, geographical and gene neighbourhood distributions and, where sufficient sequence diversity is available, predict protein three-dimensional models, revealing novel structures. Overall, our results uncover an enormously diverse functional space, highlighting the importance of further exploring the microbial functional dark matter.

Protein sequence design by conformational landscape optimization
Christoffer Norn, Basile I. M. Wicky, David Juergens et al.|Proceedings of the National Academy of Sciences|2021
Cited by 151Open Access

The protein design problem is to identify an amino acid sequence that folds to a desired structure. Given Anfinsen's thermodynamic hypothesis of folding, this can be recast as finding an amino acid sequence for which the desired structure is the lowest energy state. As this calculation involves not only all possible amino acid sequences but also, all possible structures, most current approaches focus instead on the more tractable problem of finding the lowest-energy amino acid sequence for the desired structure, often checking by protein structure prediction in a second step that the desired structure is indeed the lowest-energy conformation for the designed sequence, and typically discarding a large fraction of designed sequences for which this is not the case. Here, we show that by backpropagating gradients through the transform-restrained Rosetta (trRosetta) structure prediction network from the desired structure to the input amino acid sequence, we can directly optimize over all possible amino acid sequences and all possible structures in a single calculation. We find that trRosetta calculations, which consider the full conformational landscape, can be more effective than Rosetta single-point energy estimations in predicting folding and stability of de novo designed proteins. We compare sequence design by conformational landscape optimization with the standard energy-based sequence design methodology in Rosetta and show that the former can result in energy landscapes with fewer alternative energy minima. We show further that more funneled energy landscapes can be designed by combining the strengths of the two approaches: the low-resolution trRosetta model serves to disfavor alternative states, and the high-resolution Rosetta model serves to create a deep energy minimum at the design target structure.

From 1D sequence to 3D chromatin dynamics and cellular functions: a phase separation perspective
Sirui Liu, Ling Zhang, Hui Quan et al.|Nucleic Acids Research|2018
Cited by 74Open Access

The high-order chromatin structure plays a non-negligible role in gene regulation. However, the mechanism, especially the sequence dependence for the formation of varied chromatin structures in different cells remains to be elucidated. As the nucleotide distributions in human and mouse genomes are highly uneven, we identified CGI (CpG island) forest and prairie genomic domains based on CGI densities of a species, dividing the genome into two sequentially, epigenetically, and transcriptionally distinct regions. These two megabase-sized domains also spatially segregate to different extents in different cell types. Forests and prairies show enhanced segregation from each other in development, differentiation, and senescence, meanwhile the multi-scale forest-prairie spatial intermingling is cell-type specific and increases in differentiation, helping to define cell identity. We propose that the phase separation of the 1D mosaic sequence in space serves as a potential driving force, and together with cell type specific epigenetic marks and transcription factors, shapes the chromatin structure in different cell types. The mosaicity in genome of different species in terms of forests and prairies could relate to observations in their biological processes like development and aging. In this way, we provide a bottoms-up theory to explain the chromatin structural and epigenetic changes in different processes.