A fast and agnostic method for bacterial genome-wide association studies: Bridging the gap between k-mers and genetic eventsGenome-wide association study (GWAS) methods applied to bacterial genomes have shown promising results for genetic marker discovery or detailed assessment of marker effect. Recently, alignment-free methods based on k-mer composition have proven their ability to explore the accessory genome. However, they lead to redundant descriptions and results which are sometimes hard to interpret. Here we introduce DBGWAS, an extended k-mer-based GWAS method producing interpretable genetic variants associated with distinct phenotypes. Relying on compacted De Bruijn graphs (cDBG), our method gathers cDBG nodes, identified by the association model, into subgraphs defined from their neighbourhood in the initial cDBG. DBGWAS is alignment-free and only requires a set of contigs and phenotypes. In particular, it does not require prior annotation or reference genomes. It produces subgraphs representing phenotype-associated genetic variants such as local polymorphisms and mobile genetic elements (MGE). It offers a graphical framework which helps interpret GWAS results. Importantly it is also computationally efficient-experiments took one hour and a half on average. We validated our method using antibiotic resistance phenotypes for three bacterial species. DBGWAS recovered known resistance determinants such as mutations in core genes in Mycobacterium tuberculosis, and genes acquired by horizontal transfer in Staphylococcus aureus and Pseudomonas aeruginosa-along with their MGE context. It also enabled us to formulate new hypotheses involving genetic variants not yet described in the antibiotic resistance literature. An open-source tool implementing DBGWAS is available at https://gitlab.com/leoisl/dbgwas.
Large-scale machine learning for metagenomics sequence classificationMOTIVATION: Metagenomics characterizes the taxonomic diversity of microbial communities by sequencing DNA directly from an environmental sample. One of the main challenges in metagenomics data analysis is the binning step, where each sequenced read is assigned to a taxonomic clade. Because of the large volume of metagenomics datasets, binning methods need fast and accurate algorithms that can operate with reasonable computing requirements. While standard alignment-based methods provide state-of-the-art performance, compositional approaches that assign a taxonomic class to a DNA read based on the k-mers it contains have the potential to provide faster solutions. RESULTS: We propose a new rank-flexible machine learning-based compositional approach for taxonomic assignment of metagenomics reads and show that it benefits from increasing the number of fragments sampled from reference genome to tune its parameters, up to a coverage of about 10, and from increasing the k-mer size to about 12. Tuning the method involves training machine learning models on about 10(8) samples in 10(7) dimensions, which is out of reach of standard softwares but can be done efficiently with modern implementations for large-scale machine learning. The resulting method is competitive in terms of accuracy with well-established alignment and composition-based tools for problems involving a small to moderate number of candidate species and for reasonable amounts of sequencing errors. We show, however, that machine learning-based compositional approaches are still limited in their ability to deal with problems involving a greater number of species and more sensitive to sequencing errors. We finally show that the new method outperforms the state-of-the-art in its ability to classify reads from species of lineage absent from the reference database and confirm that compositional approaches achieve faster prediction times, with a gain of 2-17 times with respect to the BWA-MEM short read mapper, depending on the number of candidate species and the level of sequencing noise. AVAILABILITY AND IMPLEMENTATION: Data and codes are available at http://cbio.ensmp.fr/largescalemetagenomics CONTACT: pierre.mahe@biomerieux.com SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
Predicting bacterial resistance from whole-genome sequences using k-mers and stability selectionPierre Mahé, Maud Tournoud|BMC Bioinformatics|2018 BACKGROUND: Several studies demonstrated the feasibility of predicting bacterial antibiotic resistance phenotypes from whole-genome sequences, the prediction process usually amounting to detecting the presence of genes involved in antibiotic resistance mechanisms, or of specific mutations, previously identified from a training panel of strains, within these genes. We address the problem from the supervised statistical learning perspective, not relying on prior information about such resistance factors. We rely on a k-mer based genotyping scheme and a logistic regression model, thereby combining several k-mers into a probabilistic model. To identify a small yet predictive set of k-mers, we rely on the stability selection approach (Meinshausen et al., J R Stat Soc Ser B 72:417-73, 2010), that consists in penalizing logistic regression models with a Lasso penalty, coupled with extensive resampling procedures. RESULTS: Using public datasets, we applied the resulting classifiers to two bacterial species and achieved predictive performance equivalent to state of the art. The models are extremely sparse, involving 1 to 8 k-mers per antibiotic, hence are remarkably easy and fast to evaluate on new genomes (from raw reads to assemblies). CONCLUSION: Our proof of concept therefore demonstrates that stability selection is a powerful approach to investigate bacterial genotype-phenotype relationships.
Age at Which HIV Infection Can Be Detected in InfantsMaud Tournoud, René Écochard|JAIDS Journal of Acquired Immune Deficiency Syndromes|2006 OBJECTIVE: In St. Petersburg, Russia, we sought to describe the characteristics of active high-risk injection drug users (IDUs) to evaluate the associations between behavioral and demographic characteristics and HIV-1 infection and to describe 3 discrete recruitment methods. METHODS: Active high-risk IDUs were recruited in 3 ways: through street outreach, at facilities serving IDUs, and by network-based chain referral. Recruits were screened, counseled, and tested for HIV-1. Sociodemographic and behavioral data were collected. HIV-1 prevalence was analyzed as a function of sociodemographic and behavioral variables. RESULTS: During the 10-month recruitment period, data from 900 participants were collected: median age was 24 years, and in the previous month, 96% used heroin and 75% shared needles with others. The baseline HIV prevalence was 30% (95% confidence interval [CI]: 27 to 33). Recruitment through social networks was the most productive strategy. HIV-positive individuals were younger, but none of the other sociodemographic or behavioral characteristics differed significantly by HIV status. CONCLUSIONS: The estimated HIV prevalence of 30% places St. Petersburg among the worst IDU-concentrated epidemics in Europe. Recruitment through network-based chain referral is a useful method for recruiting active IDUs. Sociodemographic and behavioral links to prevalent HIV infection remain to be elucidated.
A fast and agnostic method for bacterial genome-wide association studies: bridging the gap between kmers and genetic eventsMagali Jaillard, Leandro Lima, Maud Tournoud et al.|bioRxiv (Cold Spring Harbor Laboratory)|2018 Abstract Motivation Genome-wide association study (GWAS) methods applied to bacterial genomes have shown promising results for genetic marker discovery or fine-assessment of marker effect. Recently, alignment-free methods based on kmer composition have proven their ability to explore the accessory genome. However, they lead to redundant descriptions and results which are hard to interpret. Methods Here, we introduce DBGWAS, an extended kmer-based GWAS method producing interpretable genetic variants associated with pheno-types. Relying on compacted De Bruijn graphs (cDBG), our method gathers cDBG nodes identified by the association model into subgraphs defined from their neighbourhood in the initial cDBG. DBGWAS is fast, alignment-free and only requires a set of contigs and phenotypes. It produces annotated subgraphs representing local polymorphisms as well as mobile genetic elements (MGE) and offers a graphical framework to interpret GWAS results. Results We validated our method using antibiotic resistance phenotypes for three bacterial species. DBGWAS recovered known resistance determinants such as mutations in core genes in Mycobacterium tuberculosis and genes acquired by horizontal transfer in Staphylococcus aureus and Pseudomonas aeruginosa – along with their MGE context. It also enabled us to formulate new hypotheses involving genetic variants not yet described in the antibiotic resistance literature. Conclusion Our novel method proved its efficiency to retrieve any type of phenotype-associated genetic variant without prior knowledge. All experiments were computed in less than two hours and produced a compact set of meaningful subgraphs, thereby outperforming other GWAS approaches and facilitating the interpretation of the results. Availability Open-source tool available at https://gitlab.com/leoisl/dbgwas