Matthis Ebel

BRAKER3: Fully automated genome annotation using RNA-seq and protein evidence with GeneMark-ETP, AUGUSTUS, and TSEBRA

Lars Gabriel, Tomáš Brůna, Katharina J. Hoff et al.|Genome Research|2024

Cited by 537Open Access

Gene prediction has remained an active area of bioinformatics research for a long time. Still, gene prediction in large eukaryotic genomes presents a challenge that must be addressed by new algorithms. The amount and significance of the evidence available from transcriptomes and proteomes vary across genomes, between genes, and even along a single gene. User-friendly and accurate annotation pipelines that can cope with such data heterogeneity are needed. The previously developed annotation pipelines BRAKER1 and BRAKER2 use RNA-seq or protein data, respectively, but not both. A further significant performance improvement integrating all three data types was made by the recently released GeneMark-ETP. We here present the BRAKER3 pipeline that builds on GeneMark-ETP and AUGUSTUS, and further improves accuracy using the TSEBRA combiner. BRAKER3 annotates protein-coding genes in eukaryotic genomes using both short-read RNA-seq and a large protein database, along with statistical models learned iteratively and specifically for the target genome. We benchmarked the new pipeline on genomes of 11 species under an assumed level of relatedness of the target species proteome to available proteomes. BRAKER3 outperforms BRAKER1 and BRAKER2. The average transcript-level F1-score is increased by about 20 percentage points on average, whereas the difference is most pronounced for species with large and complex genomes. BRAKER3 also outperforms other existing tools, MAKER2, Funannotate, and FINDER. The code of BRAKER3 is available on GitHub and as a ready-to-run Docker container for execution with Docker or Singularity. Overall, BRAKER3 is an accurate, easy-to-use tool for eukaryotic genome annotation.

BRAKER3: Fully automated genome annotation using RNA-seq and protein evidence with GeneMark-ETP, AUGUSTUS and TSEBRA

Lars Gabriel, Tomáš Brůna, Katharina J. Hoff et al.|bioRxiv (Cold Spring Harbor Laboratory)|2023

Cited by 236Open Access

Gene prediction has remained an active area of bioinformatics research for a long time. Still, gene prediction in large eukaryotic genomes presents a challenge that must be addressed by new algorithms. The amount and significance of the evidence available from transcriptomes and proteomes vary across genomes, between genes and even along a single gene. User-friendly and accurate annotation pipelines that can cope with such data heterogeneity are needed. The previously developed annotation pipelines BRAKER1 and BRAKER2 use RNA-seq or protein data, respectively, but not both. A further significant performance improvement was made by the recently released GeneMark-ETP integrating all three data types. We here present the BRAKER3 pipeline that builds on GeneMark-ETP and AUGUSTUS and further improves accuracy using the TSEBRA combiner. BRAKER3 annotates protein-coding genes in eukaryotic genomes using both short-read RNA-seq and a large protein database, along with statistical models learned iteratively and specifically for the target genome. We benchmarked the new pipeline on genomes of 11 species under assumed level of relatedness of the target species proteome to available proteomes. BRAKER3 outperformed BRAKER1 and BRAKER2. The average transcript-level F1-score was increased by ~20 percentage points on average, while the difference was most pronounced for species with large and complex genomes. BRAKER3 also outperformed other existing tools, MAKER2, Funannotate and FINDER. The code of BRAKER3 is available on GitHub and as a ready-to-run Docker container for execution with Docker or Singularity. Overall, BRAKER3 is an accurate, easy-to-use tool for eukaryotic genome annotation.

Galba: genome annotation with miniprot and AUGUSTUS

Tomáš Brůna, Heng Li, Joseph Guhlin et al.|BMC Bioinformatics|2023

Cited by 101Open Access

BACKGROUND: The Earth Biogenome Project has rapidly increased the number of available eukaryotic genomes, but most released genomes continue to lack annotation of protein-coding genes. In addition, no transcriptome data is available for some genomes. RESULTS: Various gene annotation tools have been developed but each has its limitations. Here, we introduce GALBA, a fully automated pipeline that utilizes miniprot, a rapid protein-to-genome aligner, in combination with AUGUSTUS to predict genes with high accuracy. Accuracy results indicate that GALBA is particularly strong in the annotation of large vertebrate genomes. We also present use cases in insects, vertebrates, and a land plant. GALBA is fully open source and available as a docker image for easy execution with Singularity in high-performance computing environments. CONCLUSIONS: Our pipeline addresses the critical need for accurate gene annotation in newly sequenced genomes, and we believe that GALBA will greatly facilitate genome annotation for diverse organisms.

Classifying sex with volume-matched brain MRI

Matthis Ebel, Martin Domín, Nicola Neumann et al.|Neuroimage Reports|2023

Cited by 11Open Access

Sex differences in the size of specific brain structures have been extensively studied, but careful and reproducible statistical hypothesis testing to identify them produced overall small effect sizes and differences in brains of males and females. On the other hand, multivariate statistical or machine learning methods that analyze MR images of the whole brain have reported respectable accuracies for the task of distinguishing brains of males from brains of females. However, most existing studies lacked a careful control for brain volume differences between sexes and, if done, their accuracy often declined to 70% or below. This raises questions about the relevance of accuracies achieved without careful control of overall volume. We examined how accurately sex can be classified from gray matter properties of the human brain when matching on overall brain volume. We tested, how robust machine learning classifiers are when predicting cross-cohort, i.e. when they are used on a different cohort than they were trained on. Furthermore, we studied how their accuracy depends on the size of the training set and attempted to identify brain regions relevant for successful classification. MRI data was used from two population-based data sets of 3298 mostly older adults from the Study of Health in Pomerania (SHIP) and 399 mostly younger adults from the Human Connectome Project (HCP), respectively. We benchmarked two multivariate methods, logistic regression and a 3D convolutional neural network. We show that male and female brains of the same intracranial volume can be distinguished with >92% accuracy with logistic regression on a dataset of 1166 matched individuals. The same model also reached 85% accuracy on a different cohort without retraining. The accuracy for both methods increased with the training cohort size up to and beyond 3000 individuals, suggesting that classifiers trained on smaller cohorts likely have an accuracy disadvantage. We found no single outstanding brain region necessary for successful classification, but important features appear rather distributed across the brain.

GALBA: Genome Annotation with Miniprot and AUGUSTUS

Tomáš Brůna, Heng Li, Joseph Guhlin et al.|bioRxiv (Cold Spring Harbor Laboratory)|2023

Cited by 5Open Access

Abstract The Earth Biogenome Project has rapidly increased the number of available eukaryotic genomes, but most released genomes continue to lack annotation of protein-coding genes. In addition, no transcriptome data is available for some genomes. Various gene annotation tools have been developed but each has its limitations. Here, we introduce GALBA, a fully automated pipeline that utilizes miniprot, a rapid protein- to-genome aligner, in combination with AUGUSTUS to predict genes with high accuracy. Accuracy results indicate that GALBA is particularly strong in the annotation of large vertebrate genomes. We also present use cases in insects, vertebrates, and a previously unannotated land plant. GALBA is fully open source and available as a docker image for easy execution with Singularity in high-performance computing environments. Our pipeline addresses the critical need for accurate gene annotation in newly sequenced genomes, and we believe that GALBA will greatly facilitate genome annotation for diverse organisms.

Is this you? Claim your profile.

Top publicationsby citations