The Inferelator: an algorithm for learning parsimonious regulatory networks from systems-biology data sets de novoWe present a method (the Inferelator) for deriving genome-wide transcriptional regulatory interactions, and apply the method to predict a large portion of the regulatory network of the archaeon Halobacterium NRC-1. The Inferelator uses regression and variable selection to identify transcriptional influences on genes based on the integration of genome annotation and expression data. The learned network successfully predicted Halobacterium's global expression under novel perturbations with predictive power similar to that seen over training data. Several specific regulatory predictions were experimentally tested and verified.
An Integrated Pipeline for de Novo Assembly of Microbial GenomesRemarkable advances in DNA sequencing technology have created a need for de novo genome assembly methods tailored to work with the new sequencing data types. Many such methods have been published in recent years, but assembling raw sequence data to obtain a draft genome has remained a complex, multi-step process, involving several stages of sequence data cleaning, error correction, assembly, and quality control. Successful application of these steps usually requires intimate knowledge of a diverse set of algorithms and software. We present an assembly pipeline called A5 (Andrew And Aaron's Awesome Assembly pipeline) that simplifies the entire genome assembly process by automating these stages, by integrating several previously published algorithms with new algorithms for quality control and automated assembly parameter selection. We demonstrate that A5 can produce assemblies of quality comparable to a leading assembly algorithm, SOAPdenovo, without any prior knowledge of the particular genome being assembled and without the extensive parameter tuning required by the other assembly algorithm. In particular, the assemblies produced by A5 exhibit 50% or more reduction in broken protein coding sequences relative to SOAPdenovo assemblies. The A5 pipeline can also assemble Illumina sequence data from libraries constructed by the Nextera (transposon-catalyzed) protocol, which have markedly different characteristics to mechanically sheared libraries. Finally, A5 has modest compute requirements, and can assemble a typical bacterial genome on current desktop or laptop computer hardware in under two hours, depending on depth of coverage.
A Predictive Model for Transcriptional Control of Physiology in a Free Living CellGenome sequence of <i>Haloarcula marismortui</i>: A halophilic archaeon from the Dead SeaWe report the complete sequence of the 4,274,642-bp genome of Haloarcula marismortui, a halophilic archaeal isolate from the Dead Sea. The genome is organized into nine circular replicons of varying G+C compositions ranging from 54% to 62%. Comparison of the genome architectures of Halobacterium sp. NRC-1 and H. marismortui suggests a common ancestor for the two organisms and a genome of significantly reduced size in the former. Both of these halophilic archaea use the same strategy of high surface negative charge of folded proteins as means to circumvent the salting-out phenomenon in a hypersaline cytoplasm. A multitiered annotation approach, including primary sequence similarities, protein family signatures, structure prediction, and a protein function association network, has assigned putative functions for at least 58% of the 4242 predicted proteins, a far larger number than is usually achieved in most newly sequenced microorganisms. Among these assigned functions were genes encoding six opsins, 19 MCP and/or HAMP domain signal transducers, and an unusually large number of environmental response regulators-nearly five times as many as those encoded in Halobacterium sp. NRC-1--suggesting H. marismortui is significantly more physiologically capable of exploiting diverse environments. In comparing the physiologies of the two halophilic archaea, in addition to the expected extensive similarity, we discovered several differences in their metabolic strategies and physiological responses such as distinct pathways for arginine breakdown in each halophile. Finally, as expected from the larger genome, H. marismortui encodes many more functions and seems to have fewer nutritional requirements for survival than does Halobacterium sp. NRC-1.
Evaluation of Algorithm Performance in ChIP-Seq Peak DetectionNext-generation DNA sequencing coupled with chromatin immunoprecipitation (ChIP-seq) is revolutionizing our ability to interrogate whole genome protein-DNA interactions. Identification of protein binding sites from ChIP-seq data has required novel computational tools, distinct from those used for the analysis of ChIP-Chip experiments. The growing popularity of ChIP-seq spurred the development of many different analytical programs (at last count, we noted 31 open source methods), each with some purported advantage. Given that the literature is dense and empirical benchmarking challenging, selecting an appropriate method for ChIP-seq analysis has become a daunting task. Herein we compare the performance of eleven different peak calling programs on common empirical, transcription factor datasets and measure their sensitivity, accuracy and usability. Our analysis provides an unbiased critical assessment of available technologies, and should assist researchers in choosing a suitable tool for handling ChIP-seq data.