Fast and accurate de novo genome assembly from long uncorrected readsThe assembly of long reads from Pacific Biosciences and Oxford Nanopore Technologies typically requires resource-intensive error-correction and consensus-generation steps to obtain high-quality assemblies. We show that the error-correction step can be omitted and that high-quality consensus sequences can be generated efficiently with a SIMD-accelerated, partial-order alignment-based, stand-alone consensus module called Racon. Based on tests with PacBio and Oxford Nanopore data sets, we show that Racon coupled with miniasm enables consensus genomes with similar or better quality than state-of-the-art methods while being an order of magnitude faster.
Comprehensive Characterization of Cancer Driver Genes and MutationsIdentifying molecular cancer drivers is critical for precision oncology. Multiple advanced algorithms to identify drivers now exist, but systematic attempts to combine and optimize them on large datasets are few. We report a PanCancer and PanSoftware analysis spanning 9,423 tumor exomes (comprising all 33 of The Cancer Genome Atlas projects) and using 26 computational tools to catalog driver genes and mutations. We identify 299 driver genes with implications regarding their anatomical sites and cancer/cell types. Sequence- and structure-based analyses identified >3,400 putative missense driver mutations supported by multiple lines of evidence. Experimental validation confirmed 60%-85% of predicted mutations as likely drivers. We found that >300 MSI tumors are associated with high PD-1/PD-L1, and 57% of tumors analyzed harbor putative clinically actionable events. Our study represents the most comprehensive discovery of cancer genes and mutations to date and will serve as a blueprint for future biological and clinical endeavors.
Statistical Methods for Detecting Differentially Abundant Features in Clinical Metagenomic SamplesNumerous studies are currently underway to characterize the microbial communities inhabiting our world. These studies aim to dramatically expand our understanding of the microbial biosphere and, more importantly, hope to reveal the secrets of the complex symbiotic relationship between us and our commensal bacterial microflora. An important prerequisite for such discoveries are computational tools that are able to rapidly and accurately compare large datasets generated from complex bacterial communities to identify features that distinguish them.We present a statistical method for comparing clinical metagenomic samples from two treatment populations on the basis of count data (e.g. as obtained through sequencing) to detect differentially abundant features. Our method, Metastats, employs the false discovery rate to improve specificity in high-complexity environments, and separately handles sparsely-sampled features using Fisher's exact test. Under a variety of simulations, we show that Metastats performs well compared to previously used methods, and significantly outperforms other methods for features with sparse counts. We demonstrate the utility of our method on several datasets including a 16S rRNA survey of obese and lean human gut microbiomes, COG functional profiles of infant and mature gut microbiomes, and bacterial and viral metabolic subsystem data inferred from random sequencing of 85 metagenomes. The application of our method to the obesity dataset reveals differences between obese and lean subjects not reported in the original study. For the COG and subsystem datasets, we provide the first statistically rigorous assessment of the differences between these populations. The methods described in this paper are the first to address clinical metagenomic datasets comprising samples from multiple subjects. Our methods are robust across datasets of varied complexity and sampling level. While designed for metagenomic applications, our software can also be applied to digital gene expression studies (e.g. SAGE). A web server implementation of our methods and freely available source code can be found at http://metastats.cbcb.umd.edu/.
LoFreq: a sequence-quality aware, ultra-sensitive variant caller for uncovering cell-population heterogeneity from high-throughput sequencing datasetsAndreas Wilm, Pauline Aw, Denis Bertrand et al.|Nucleic Acids Research|2012 The study of cell-population heterogeneity in a range of biological systems, from viruses to bacterial isolates to tumor samples, has been transformed by recent advances in sequencing throughput. While the high-coverage afforded can be used, in principle, to identify very rare variants in a population, existing ad hoc approaches frequently fail to distinguish true variants from sequencing errors. We report a method (LoFreq) that models sequencing run-specific error rates to accurately call variants occurring in <0.05% of a population. Using simulated and real datasets (viral, bacterial and human), we show that LoFreq has near-perfect specificity, with significantly improved sensitivity compared with existing methods and can efficiently analyze deep Illumina sequencing datasets without resorting to approximations or heuristics. We also present experimental validation for LoFreq on two different platforms (Fluidigm and Sequenom) and its application to call rare somatic variants from exome sequencing datasets for gastric cancer. Source code and executables for LoFreq are freely available at http://sourceforge.net/projects/lofreq/.
The draft genome of the transgenic tropical fruit tree papaya (Carica papaya Linnaeus)In the early 1990s an outbreak of papaya ringspot virus (PRSV) in the papaya groves in the Puna district of Hawaii caused severe damage to an important crop. Since then, the planting of two transgenic cultivars resistant to the virus — called 'SunUp' and 'Rainbow' — has helped to maintain yields. SunUp is a transgenic red-fleshed fruit that expresses the coat protein gene of a mild mutant of PRSV, conferring resistance via post-transcriptional gene silencing. Rainbow is a yellow-fleshed (and therefore more popular) F1 hybrid bred from SunUp. Now the draft genome sequence of the SunUp strain of papaya has been determined — a first for a commercial virus-resistant transgenic fruit tree. Comparison of this plant genome to those of Arabidopsis and others sheds light on the evolution of qualities such as biosynthesis, starch deposition, control of photosynthesis and pathways for creating the volatile compounds that contribute to the characteristic flavour of papaya. On the cover, the disease-free transgenic Rainbow and the severely infected, stunted and dying non-transgenic Sunrise grow in adjoining plots. Researchers from Hawaii and an international consortium have produced a draft genome assembly for 'SunUp', the first commercial virus-resistant transgenic fruit tree. Comparison of this plant genome to those of Arabidopsis and others sheds light on evolution of characteristics such as biosynthesis, starch deposition, control of photosynthesis and pathways for creating volatile compounds. Papaya, a fruit crop cultivated in tropical and subtropical regions, is known for its nutritional benefits and medicinal applications. Here we report a 3× draft genome sequence of ‘SunUp’ papaya, the first commercial virus-resistant transgenic fruit tree1 to be sequenced. The papaya genome is three times the size of the Arabidopsis genome, but contains fewer genes, including significantly fewer disease-resistance gene analogues. Comparison of the five sequenced genomes suggests a minimal angiosperm gene set of 13,311. A lack of recent genome duplication, atypical of other angiosperm genomes sequenced so far2,3,4,5, may account for the smaller papaya gene number in most functional groups. Nonetheless, striking amplifications in gene number within particular functional groups suggest roles in the evolution of tree-like habit, deposition and remobilization of starch reserves, attraction of seed dispersal agents, and adaptation to tropical daylengths. Transgenesis at three locations is closely associated with chloroplast insertions into the nuclear genome, and with topoisomerase I recognition sites. Papaya offers numerous advantages as a system for fruit-tree functional genomics, and this draft genome sequence provides the foundation for revealing the basis of Carica’s distinguishing morpho-physiological, medicinal and nutritional properties.