Publishes on Genomics and Phylogenetic Studies, RNA and protein synthesis mechanisms, Machine Learning in Bioinformatics. 140 papers and 29.1k citations.
We have developed three computer programs for comparisons of protein and DNA sequences. They can be used to search sequence data bases, evaluate similarity scores, and identify periodic structures based on local sequence similarity. The FASTA program is a more sensitive derivative of the FASTP program, which can be used to search protein or DNA sequence data bases and can compare a protein sequence to a DNA sequence data base by translating the DNA data base as it is searched. FASTA includes an additional step in the calculation of the initial pairwise similarity score that allows multiple regions of similarity to be joined to increase the score of related sequences. The RDF2 program can be used to evaluate the significance of similarity scores using a shuffling method that preserves local sequence composition. The LFASTA program can display all the regions of local similarity between two sequences with scores greater than a threshold, using the same scoring parameters and a similar alignment algorithm; these local similarities can be displayed as a "graphic matrix" plot or as individual alignments. In addition, these programs have been generalized to allow comparison of DNA or protein sequences based on a variety of alternative scoring matrices.
An algorithm was developed which facilitates the search for similarities between newly determined amino acid sequences and sequences already available in databases. Because of the algorithm's efficiency on many microcomputers, sensitive protein database searches may now become a routine procedure for molecular biologists. The method efficiently identifies regions of similar sequence and then scores the aligned identical and differing residues in those regions by means of an amino acid replacability matrix. This matrix increases sensitivity by giving high scores to those amino acid replacements which occur frequently in evolution. The algorithm has been implemented in a computer program designed to search protein databases very rapidly. For example, comparison of a 200-amino-acid sequence to the 500,000 residues in the National Biomedical Research Foundation library would take less than 2 minutes on a minicomputer, and less than 10 minutes on a microcomputer (IBM PC).
Sequence similarity searching, typically with BLAST, is the most widely used and most reliable strategy for characterizing newly determined sequences. Sequence similarity searches can identify "homologous" proteins or genes by detecting excess similarity- statistically significant similarity that reflects common ancestry. This unit provides an overview of the inference of homology from significant similarity, and introduces other units in this chapter that provide more details on effective strategies for identifying homologs.
Glutathione transferase (GT; EC 2.5.1.18) mRNA levels were measured in human liver samples by using mouse and human cDNA clones that encode class-mu and class-alpha GT. Although all the RNA samples examined contained class-alpha GT mRNA, class-mu GT mRNA was found only in individuals whose peripheral leukocytes expressed GT activity on the substrate trans-stilbene oxide. The mouse class-mu cDNA clone was used to identify a human class-mu GT cDNA clone, lambda GTH411. The amino acid sequence of the GT encoded by lambda GTH411 is identical with the 23 residues determined for the human liver GT-mu isoenzyme and shares 76-81% identity with mouse and rat class-mu GT isoenzymes. The mouse and human class-mu GT cDNA inserts hybridize with multiple BamHI and EcoRI restriction fragments in the human genome. One of these hybridizing fragments is missing in the DNA of individuals who lack GT activity on trans-stilbene oxide. Hybridizations with nonoverlapping subfragments of lambda GTH411 suggest that there are at least three class-mu genes in the human genome. One of these genes appears to be deleted in individuals lacking GT activity on trans-stilbene oxide.