Leandro Lima

A fast and agnostic method for bacterial genome-wide association studies: Bridging the gap between k-mers and genetic events

Magali Jaillard, Leandro Lima, Maud Tournoud et al.|PLoS Genetics|2018

Cited by 225Open Access

Genome-wide association study (GWAS) methods applied to bacterial genomes have shown promising results for genetic marker discovery or detailed assessment of marker effect. Recently, alignment-free methods based on k-mer composition have proven their ability to explore the accessory genome. However, they lead to redundant descriptions and results which are sometimes hard to interpret. Here we introduce DBGWAS, an extended k-mer-based GWAS method producing interpretable genetic variants associated with distinct phenotypes. Relying on compacted De Bruijn graphs (cDBG), our method gathers cDBG nodes, identified by the association model, into subgraphs defined from their neighbourhood in the initial cDBG. DBGWAS is alignment-free and only requires a set of contigs and phenotypes. In particular, it does not require prior annotation or reference genomes. It produces subgraphs representing phenotype-associated genetic variants such as local polymorphisms and mobile genetic elements (MGE). It offers a graphical framework which helps interpret GWAS results. Importantly it is also computationally efficient-experiments took one hour and a half on average. We validated our method using antibiotic resistance phenotypes for three bacterial species. DBGWAS recovered known resistance determinants such as mutations in core genes in Mycobacterium tuberculosis, and genes acquired by horizontal transfer in Staphylococcus aureus and Pseudomonas aeruginosa-along with their MGE context. It also enabled us to formulate new hypotheses involving genetic variants not yet described in the antibiotic resistance literature. An open-source tool implementing DBGWAS is available at https://gitlab.com/leoisl/dbgwas.

Exploring bacterial diversity via a curated and searchable snapshot of archived DNA sequences

Grace A. Blackwell, Martin Hunt, Kerri M. Malone et al.|PLoS Biology|2021

Cited by 147Open Access

The open sharing of genomic data provides an incredibly rich resource for the study of bacterial evolution and function and even anthropogenic activities such as the widespread use of antimicrobials. However, these data consist of genomes assembled with different tools and levels of quality checking, and of large volumes of completely unprocessed raw sequence data. In both cases, considerable computational effort is required before biological questions can be addressed. Here, we assembled and characterised 661,405 bacterial genomes retrieved from the European Nucleotide Archive (ENA) in November of 2018 using a uniform standardised approach. Of these, 311,006 did not previously have an assembly. We produced a searchable COmpact Bit-sliced Signature (COBS) index, facilitating the easy interrogation of the entire dataset for a specific sequence (e.g., gene, mutation, or plasmid). Additional MinHash and pp-sketch indices support genome-wide comparisons and estimations of genomic distance. Combined, this resource will allow data to be easily subset and searched, phylogenetic relationships between genomes to be quickly elucidated, and hypotheses rapidly generated and tested. We believe that this combination of uniform processing and variety of search/filter functionalities will make this a resource of very wide utility. In terms of diversity within the data, a breakdown of the 639,981 high-quality genomes emphasised the uneven species composition of the ENA/public databases, with just 20 of the total 2,336 species making up 90% of the genomes. The overrepresented species tend to be acute/common human pathogens, aligning with research priorities at different levels from individual interests to funding bodies and national and global public health agencies.

AllTheBacteria – all bacterial genomes assembled, available, and searchable

Martin Hunt, Leandro Lima, D.P. Anderson et al.|bioRxiv (Cold Spring Harbor Laboratory)|2024

Cited by 76Open Access

Abstract The bacterial sequence data publicly available via the global DNA archives is a vast potential source of information on the evolution of bacteria. However, most of this sequence data is unassembled, or where assembled was done so with no consistent assembler or quality control. Although this data has great potential, these inconsistencies make it unsuitable for large-scale analyses, and inaccessible for most researchers to reuse. Therefore in our previous effort, we released a uniformly assembled set of 661,405 genomes, consisting of all publicly available whole genome sequenced bacterial isolate data up to a cutoff of November 2018, enriched with various search indexes to make the data easier to sort and use. In this study, we first extend the dataset up to August 2024 with the same consistent assembly pipeline, more than tripling the number of genomes available. We also expand the scope of the dataset beyond genomes, as we begin a global collaborative project to generate annotations, species-specific analyses, evolutionary data, new search indices, and protein structural data. Our collaboration is therefore grass-roots, driven by the needs of different research communities within microbiology. In this paper, we describe the project as of release 2024-08, comprising 2,440,377 assemblies. All 2.4 million genomes have been uniformly reprocessed for quality criteria and to give taxonomic abundance estimates with respect to the GTDB phylogeny. We further enrich the dataset with sequence annotations from Bakta, antimicrobial resistance predictions from AMRFinderPlus, and AlphaFold2 protein structure predictions for the 17.7M hypothetical proteins. By applying an evolution-informed compression approach, the full set of genomes is just 130Gb: a reduction of ∽ 23x compared to compressing individual assemblies. To make the resource as accessible as possible, we also provide multiple search indexes, a method for alignment to the full dataset, and cloud-based access to all the genomes. The AllTheBacteria data ( https://allthebacteria.org/ ) has already been independently used in multiple other analyses – our goal is to make this a self-sustaining community-driven resource, which increases the accessiblity and reuse of bacterial genomes for a large range of purposes.

Innate immune response is differentially dysregulated between bipolar disease and schizophrenia

Angélica de Baumont, Mariana Maschietto, Leandro Lima et al.|Schizophrenia Research|2014

Cited by 72Open Access

Schizophrenia (SZ) and bipolar disorder (BD) are severe psychiatric conditions with a neurodevelopmental component. Genetic findings indicate the existence of an overlap in genetic susceptibility across the disorders. Also, image studies provide evidence for a shared neurobiological basis, contributing to a dimensional diagnostic approach. This study aimed to identify the molecular mechanisms that differentiate SZ and BD patients from health controls but also that distinguish both from health individuals. Comparison of gene expression profiling in post-mortem brains of both disorders and health controls (30 cases), followed by a further comparison between 29 BD and 29 SZ revealed 28 differentially expressed genes. These genes were used in co-expression analysesthat revealed the pairs CCR1/SERPINA1, CCR5/HCST, C1QA/CD68, CCR5/S100A11 and SERPINA1/TLR1 as presenting the most significant difference in co-expression between SZ and BD. Next, a protein-protein interaction (PPI) network using the 28 differentially expressed genes as seeds revealed CASP4, TYROBP, CCR1, SERPINA1, CCR5 and C1QA as having a central role in the diseases manifestation. Both co-expression and network topological analyses pointed to genes related to microglia functions. Based on this data, we suggest that differences between SZ and BP are due to genes involved with response to stimulus, defense response, immune system process and response to stress biological processes, all having a role in the communication of environmental factors to the cells and associated to microglia.

Pandora: nucleotide-resolution bacterial pan-genomics with reference graphs

Rachel Colquhoun, Michael B. Hall, Leandro Lima et al.|Genome biology|2021

Cited by 65Open Access

We present pandora, a novel pan-genome graph structure and algorithms for identifying variants across the full bacterial pan-genome. As much bacterial adaptability hinges on the accessory genome, methods which analyze SNPs in just the core genome have unsatisfactory limitations. Pandora approximates a sequenced genome as a recombinant of references, detects novel variation and pan-genotypes multiple samples. Using a reference graph of 578 Escherichia coli genomes, we compare 20 diverse isolates. Pandora recovers more rare SNPs than single-reference-based tools, is significantly better than picking the closest RefSeq reference, and provides a stable framework for analyzing diverse samples without reference bias.

Is this you? Claim your profile.

Top publicationsby citations