Protein Interactions

Charlotte M. Deane(University of California, Los Angeles), Łukasz Salwiński(University of California, Los Angeles), Ioannis Xénarios(University of California, Los Angeles), David Eisenberg(Howard Hughes Medical Institute)
Molecular & Cellular Proteomics
May 1, 2002
Cited by 648Open Access
Full Text

Abstract

High throughput methods for detecting protein interactions require assessment of their accuracy. We present two forms of computational assessment. The first method is the expression profile reliability (EPR) index. The EPR index estimates the biologically relevant fraction of protein interactions detected in a high throughput screen. It does so by comparing the RNA expression profiles for the proteins whose interactions are found in the screen with expression profiles for known interacting and non-interacting pairs of proteins. The second form of assessment is the paralogous verification method (PVM). This method judges an interaction likely if the putatively interacting pair has paralogs that also interact. In contrast to the EPR index, which evaluates datasets of interactions, PVM scores individual interactions. On a test set, PVM identifies correctly 40% of true interactions with a false positive rate of ∼1%. EPR and PVM were applied to the Database of Interacting Proteins (DIP), a large and diverse collection of protein-protein interactions that contains over 8000 Saccharomyces cerevisiae pairwise protein interactions. Using these two methods, we estimate that ∼50% of them are reliable, and with the aid of PVM we identify confidently 3003 of them. Web servers for both the PVM and EPR methods are available on the DIP website (dip.doe-mbi.ucla.edu/Services.cgi). High throughput methods for detecting protein interactions require assessment of their accuracy. We present two forms of computational assessment. The first method is the expression profile reliability (EPR) index. The EPR index estimates the biologically relevant fraction of protein interactions detected in a high throughput screen. It does so by comparing the RNA expression profiles for the proteins whose interactions are found in the screen with expression profiles for known interacting and non-interacting pairs of proteins. The second form of assessment is the paralogous verification method (PVM). This method judges an interaction likely if the putatively interacting pair has paralogs that also interact. In contrast to the EPR index, which evaluates datasets of interactions, PVM scores individual interactions. On a test set, PVM identifies correctly 40% of true interactions with a false positive rate of ∼1%. EPR and PVM were applied to the Database of Interacting Proteins (DIP), a large and diverse collection of protein-protein interactions that contains over 8000 Saccharomyces cerevisiae pairwise protein interactions. Using these two methods, we estimate that ∼50% of them are reliable, and with the aid of PVM we identify confidently 3003 of them. Web servers for both the PVM and EPR methods are available on the DIP website (dip.doe-mbi.ucla.edu/Services.cgi). One thrust of post-genomic biology is the study of the networks of protein interactions that control the lives of cells and organisms. These networks have been reconstructed by detecting pairwise interactions of proteins. To store and manage this information in a systematic way, databases have been created (1.Bader G.D. Donaldson I. Wolting C. Ouellette B.F. Pawson T. Hogue C.W. BIND - The biomolecular interaction network database.Nucleic Acids Res. 2001; 29: 242-245Google Scholar, 2.Xenarios I. Rice D.W. Salwinski L. Baron M.K. Marcotte E.M. Eisenberg D. DIP: the database of interacting proteins.Nucleic Acids Res. 2000; 28: 289-291Google Scholar). These databases provide centralized access to curated experimental data. They have also emerged as resources for the investigation of the large scale properties of biological networks, in particular their functional and evolutionary aspects (3.Jeong H. Mason S.P. Barabási A.L. Oltvai Z.N. Lethality and centrality in protein networks.Nature. 2001; 411: 41-42Google Scholar). In this paper we explore the usefulness of the Database of Interacting Proteins (DIP) 1The abbreviations used are: DIP, Database of Interacting Proteins; EPR, expression profile reliability; IST, interaction sequence tag; PVM, paralogous verification method; Y2H, yeast 2 hybrid; GY2H, genome-wide Y2H; YPD, Yeast Protein Database. 1The abbreviations used are: DIP, Database of Interacting Proteins; EPR, expression profile reliability; IST, interaction sequence tag; PVM, paralogous verification method; Y2H, yeast 2 hybrid; GY2H, genome-wide Y2H; YPD, Yeast Protein Database. for assessing the reliability of measurement of protein interaction. Until two years ago, when high throughput screens of protein interaction were developed, the information within interaction databases was collected from the small scale screens in hundreds of individual research papers. The biological relevance of each interaction had often been investigated thoroughly, sometimes with a repertoire of experimental techniques and often with multiple controls (4.Xenarios I. Eisenberg D. Protein interaction databases.Curr. Opin. Biotechnol. 2001; 12: 334-339Google Scholar, 5.Golemis E.A. Protein-Protein Interactions: A Molecular Cloning Manual. Cold Spring Harbor Laboratory Press, Cold Spring Harbor, New York2001Google Scholar). These independent, often repeated observations, coupled with controls and curation in the peer-review process, enhanced the reliability of the published data. In the past two years, high throughput, genome-wide detections of protein interactions by yeast two hybrid (Y2H) and mass spectrometric analysis of protein complexes have increased tremendously the experimental coverage. The new methods can generate rapidly more information than was collected by traditional means in more than a decade (6.Fromont-Racine M. Mayes A.E. Brunet-Simon A. Rain J.C. Colley A. Dix I. Decourty L. Joly N. Richard F. Beggs J.D. Legrain P. Genome-wide protein interaction screens reveal functional networks involving Sm-like proteins.Yeast. 2000; 17: 95-110Google Scholar–10.Newman J.R. Wolf E. Kim P.S. From the cover: A computationally directed screen identifying interacting coiled coils from Saccharomyces cerevisiae.Proc. Natl. Acad. Sci. U. S. A. 2000; 97: 13203-13208Google Scholar). However, the large size of such datasets makes it impractical to verify individual interactions by the same methods used previously in small scale experiments (11.Walhout A.J. Vidal M. High-throughput yeast two-hybrid assays for large-scale protein interaction mapping.Methods. 2001; 24: 297-306Google Scholar, 12.Hazbun T.R. Fields S. Networking proteins in yeast.Proc. Natl. Acad. Sci. U. S. A. 2001; 98: 4277-4278Google Scholar). The question then arises, Do these new, high throughput methods of detecting interactions provide information as reliable as the small scale experiments? Verifying the interactions from these high throughput methods is vital (11.Walhout A.J. Vidal M. High-throughput yeast two-hybrid assays for large-scale protein interaction mapping.Methods. 2001; 24: 297-306Google Scholar–15.Schwikowski B. Uetz P. Fields S. A network of protein-protein interactions in yeast.Nat. Biotechnol. 2000; 18: 1257-1261Google Scholar), because only then can the large and small scale data be combined into one self-consistent interaction network useful for further studies. To address these issues we have analyzed the complete set of 8063 protein-protein interactions identified in yeast, Saccharomyces cerevisiae, that are described in DIP as of November 2001. We demonstrate that the subset of interactions obtained through the high throughput Y2H screens differs in several respects from the subset based only on the small scale or multiple, redundant experiments. Most notably, analysis of the coexpression profiles of the interacting partners leads to the conclusion that, overall, only about 30% of the high throughput dataset possesses the same characteristic mRNA expression features as the dataset based on the small scale experiments. To further pinpoint the interactions within the dataset that are likely to be correct, interactions were analyzed between protein pairs that are paralogs of the tested proteins. This resulted in the identification of ∼1400 interactions likely to be correct. A reliable, self-consistent set of interactions totaling ∼3000 is extracted when these ∼1400 are combined with the small experiment datasets and with interactions verified by more than one experiment. The protein-protein interaction datasets analyzed in this work are listed in Table I. They are all, except for the RND sets, subsets of the S. cerevisiae protein-protein interaction network (DIP-YEAST; 8063 distinct interactions) extracted from the DIP database on November 19, 2001. The INT set contains all the interactions determined by one or more small scale experiment (defined as an experiment described in a published article listing no more than 100 distinct protein-protein interactions) whereas sets EC2 and EC3 contain interactions determined by, respectively, at least two or three independent experiments. The GY2H set contains all the interactions reported in high throughput protein-protein interaction screens (6.Fromont-Racine M. Mayes A.E. Brunet-Simon A. Rain J.C. Colley A. Dix I. Decourty L. Joly N. Richard F. Beggs J.D. Legrain P. Genome-wide protein interaction screens reveal functional networks involving Sm-like proteins.Yeast. 2000; 17: 95-110Google Scholar–8.Uetz P. Giot L. Cagney G. Mansfield T.A. Judson R.S. Knight J.R. Lockshon D. Narayan V. Srinivasan M. Pochart P. Qureshi-Emili A. Li Y. Godwin B. Conover D. Kalbfleisch T. Vijayadamodar G. Yang M. Johnston M. Fields S. Rothberg J.M. A comprehensive analysis of protein-protein interactions in Saccharomyces cerevisiae.Nature. 2000; 403: 623-627Google Scholar, 16.Ito T. Tashiro K. Muta S. Ozawa R. Chiba T. Nishizawa M. Yamamoto K. Kuhara S. Sakaki Y. Toward a protein-protein interaction map of the budding yeast: a comprehensive system to examine two-hybrid interactions in all possible combinations between the yeast proteins.Proc. Natl. Acad. Sci. U. S. A. 2000; 97: 1143-1147Google Scholar, 17.Fromont-Racine M. Rain J.C. Legrain P. Toward a functional analysis of the yeast genome through exhaustive two-hybrid screens.Nat. Genet. 1997; 16: 277-282Google Scholar), and GY2H′ is a subset of GY2H that excludes interactions occurring only in the ITO1 set. The ITO1, ITO2, … ITO8 are subsets of GY2H that contain all the interactions reported by Ito et al. (7.Ito T. Chiba T. Ozawa R. Yoshida M. Hattori M. Sakaki Y. A comprehensive two-hybrid analysis to explore the yeast protein interactome.Proc. Natl. Acad. Sci. U. S. A. 2001; 98: 4569-4574Google Scholar) as identified by at least 1,2, … 8 interaction sequence tags (ISTs) in a genome-wide Y2H protein-protein interaction screen. These datasets (ITO1, etc.) contain fewer interactions than the numbers reported in the original paper because of some redundancy of the original dataset (interactions were reported in both directions, P-P′ and P′-P). Also, some of the open reading frames could not be traced unambiguously to a unique SWISS-PROT, PIR (Protein Information Resource), or interaction of data reported by and interactions from each of the datasets by PVM as with reported by and G. D. expression in the of yeast cells to 2000; interactions from each of the datasets by PVM as correct. in a new The sets were by protein-protein pairs from the yeast genome that are not present in They are by the non-interacting pairs than of the true interactions interacting partners when by a of two the of interacting partners for each protein within the S. cerevisiae genome in the Uetz P. an of protein 2001; Scholar, A.J. R. N. Vidal M. Protein interaction in C. proteins in 2000; Scholar). Proteins have been to and in the Yeast Protein Database P. P. C. M. and of the an for protein Acids Res. 2001; 29: Scholar, J.D. P. C. C. M. The yeast database and database comprehensive resources for the and of protein Acids Res. 2000; 28: Scholar). is as the biological involving the protein and as the or of the The are and a large of proteins are with more than one or The functional and if one were collected for all the S. cerevisiae open reading frames from the We a if the two interacting proteins one or more in a to et al. B. Uetz P. Fields S. A network of protein-protein interactions in yeast.Nat. Biotechnol. 2000; 18: 1257-1261Google Scholar). The that one could two proteins to a was all possible pairs of proteins in a The expression profile reliability index (EPR) was extracted from the interaction datasets by the least by 2 of the method E. C. S. A. S. A. D. for and Scholar) and a of the to only with at least were in the was of the for the individual in each of the The of the was a with datasets as described in Press, Scholar). The expression between proteins A and was to is a of the expression of protein the as reported by and G. D. expression in the of yeast cells to 2000; Scholar). The is over a set of distinct the data by et al. G. D. expression in the of yeast cells to 2000; Scholar). The paralogous verification method interacting pairs the of paralogous interactions. were collected by and a new of protein database Acids Res. 1997; Scholar). open reading of S. cerevisiae as a sequence the database of S. The were the and the to in the To at the of were and the and were The set of known protein-protein interactions in budding yeast as in DIP on November contains distinct interactions between proteins of these interactions were detected by small scale experiments described in more than research The is from independent high throughput Y2H screens of the datasets that the of detected interactions obtained in the as as between of these datasets and the set from the small scale interaction is This by T.R. Fields S. Networking proteins in yeast.Proc. Natl. Acad. Sci. U. S. A. 2001; 98: 4277-4278Google Scholar, P. J.M. interaction a Genet. 2001; 17: Scholar, Uetz P. an of protein 2001; Scholar, B. Uetz P. Fields S. A network of protein-protein interactions in yeast.Nat. Biotechnol. 2000; 18: 1257-1261Google Scholar, P. L. Genome-wide protein interaction two-hybrid 2000; Scholar), is the of the present are possible for the of the of yeast in the of and the of in experiments. high throughput protein-protein interaction such as Y2H methods, the of identifying partners by protein-protein interactions. the partners that can that are in to one in the because of distinct or expression at the these can to the of false (interactions that be detected the or false interactions biological we on the two identifying the fraction of false within the high throughput datasets and identifying true We this by the properties of these datasets with of the set of biologically relevant interactions extracted from the DIP The of this is that, by the of size and this dataset the features of biologically relevant protein-protein interactions and can be used to the of interaction We by of functional we can between two interacting S. cerevisiae proteins in this we the interacting pairs into all EC3 and EC2 are datasets with than or to three or two the respectively, and INT is the set of interactions in at least one small scale experiment. A of the subsets is 2 the of and as by the P. P. C. M. and of the an for protein Acids Res. 2001; 29: Scholar, J.D. P. C. C. M. The yeast database and database comprehensive resources for the and of protein Acids Res. 2000; 28: Scholar) for the The the It that if we two proteins at from the set with known the of of pairs in The between the and this is large in all These can be with of We that of the pairs one or more with the found by Fields and B. Uetz P. Fields S. A network of protein-protein interactions in yeast.Nat. Biotechnol. 2000; 18: 1257-1261Google Scholar) in a analysis of published interactions of S. cerevisiae proteins. of was also tested in a by Vidal and H. Vidal M. between and data from Saccharomyces Genet. 2001; 29: Scholar) interacting pairs were found within the same expression These expression are to to functional H. Vidal M. between and data from Saccharomyces Genet. 2001; 29: Scholar, the Opin. Genet. Scholar, E.A. expression and 2000; Scholar, R. D. M. expression data with protein-protein Res. 12: Scholar). with the of the functional based on the expression was than in 2 that the INT set and the EC2 and EC3 sets than the set. The of of within the data could in because of the large of interactions between and proteins B. Uetz P. Fields S. A network of protein-protein interactions in yeast.Nat. Biotechnol. 2000; 18: 1257-1261Google these are as are of proteins between these through the two for the of 2000; Scholar). The INT dataset because of a between functional and protein interactions described in the small scale studies. However, if we pairs of proteins from as to the set, a of is This to a of multiple and possible in both It also be that the in these have been from proteins experimental and as such are to However, when we the for the set of proteins are to the described or is the least of the three This is not as an interaction between two proteins does not that an it that are in a functional the between functional could be biologically et al. B. Uetz P. Fields S. A network of protein-protein interactions in yeast.Nat. Biotechnol. 2000; 18: 1257-1261Google Scholar) found that are a large of interactions between the of protein and protein in the assessment of an individual of or not be be to the between the of the proteins. The of and within the dataset than the and EC3 datasets that small scale more reliable than high throughput this for which the reliability of a dataset and the reliability of interaction. we two computational methods that mRNA expression data and sequence respectively, to reliability of the high throughput datasets and to identify protein-protein interactions that are likely to be correct. of the two methods is in It has been that to be in a H. Vidal M. between and data from Saccharomyces Genet. 2001; 29: E.A. expression and 2000; Scholar). we this to the of datasets of interacting proteins. we a between the expression of the for the of an interacting pair we a dataset of protein interactions by the fraction of pairs each of This is the of the EPR index method on the of the of expression for several sets of protein interaction data. The the for sets of protein that it is the with the The INT is for the small scale dataset and is to have the and are as from a We the INT set to be a set of interacting proteins and the set to be of non-interacting proteins. On the of the we a that the of a dataset of protein interactions. To so we that the profile of the GY2H set to be between the interacting and non-interacting The this that the Y2H experiments in two of protein-protein the true positive relevant interactions) from the interacting and false from the non-interacting The of expression obtained for an experimental set, is then described by and are the expression for the interacting and non-interacting protein and the expression profile reliability index, to the fraction of the true in the experimental The can be obtained as the of expression for all protein pairs within a because the genome is of size for S. and be by the non-interacting The can be by the of the expression for all the reliable interactions present in The to be as the set of interactions described in is in the of obtained in a that not on the expression of the interacting it can be as a of the protein-protein interaction set, with to the expression of the interacting proteins. A of the GY2H dataset to the described by to the The is as for the GY2H that of the reported pairs in this set in false To verify that the of the experimental subsets of the GY2H to of were as reported by Ito et al. (7.Ito T. Chiba T. Ozawa R. Yoshida M. Hattori M. Sakaki Y. A comprehensive two-hybrid analysis to explore the yeast protein interactome.Proc. Natl. Acad. Sci. U. S. A. 2001; 98: 4569-4574Google Scholar). Ito et al. (7.Ito T. Chiba T. Ozawa R. Yoshida M. Hattori M. Sakaki Y. A comprehensive two-hybrid analysis to explore the yeast protein interactome.Proc. Natl. Acad. Sci. U. S. A. 2001; 98: 4569-4574Google Scholar) created these sets by identifying interactions with at least … 8 as ITO1 to the of the as by with increased This that the EPR index can be used to the of large scale protein-protein interaction and to the fraction of pairs that is However, the on rapidly with dataset the of EPR in to large in a new The high throughput Y2H screens can be by the least reliable protein pairs that only in the ITO1 set. In the reliability of the GY2H′ set to as by the EPR index However, this reliability at the of size by The reliability of a protein interaction can be by the of paralogous interactions. The for this is that if two proteins are paralogs then the proteins that are to with are often also This is to the of by Vidal and A.J. Vidal M. Yeast two-hybrid and protein interaction for yeast and 2000; 17: Scholar). To a interaction between a pair of and all the paralogs of and are and the of interactions in DIP between these two the interaction to are This is the PVM To the of this method to identify true interactions and false interactions, the on datasets of interacting proteins be with the on datasets of non-interacting proteins. We the datasets of non-interacting proteins computationally because of the in such a set from within the The three sets of protein interactions described were used as the non-interacting these sets not be of interactions, the be small sets of protein interactions were used as true interaction sets, the and EC3 sets The EC2 and EC3 sets are than the INT set and can be used by PVM are not as datasets for EPR, because the in is large for such small datasets The of the PVM method can be by a known as a characteristic in It that a that false is to of the true interactions. the method high a This of in the of paralogs of some proteins. interactions if the and EC2 sets are to only pairs at least one of each of the pairs has more than one an in of is The is not by the of paralogs is because of both the of experimental data in a of a of paralogous interactions. is possible of in PVM because of the identification of interactions in Y2H experiments. if and are paralogs of and a it is possible that in only the and interactions However, Y2H interactions between and and and as as the true interactions and The rate of PVM of that this is However, as with computational the be in the of data such as or of the proteins. The characteristic also that the of the is that a than a high that an interaction if a reliability interaction as Y2H R. A. H. a in Res. 2001; has paralogs a of it can be or by for a paralogous interaction. It is that PVM can only be used in the proteins in the interaction have In S. cerevisiae of the proteins have paralogs This of paralogs to be et al. analysis of functional and evolutionary Natl. Acad. Sci. U. S. A. Scholar) found that of the genome has and of the proteins within the database T.A. B. The new in of proteins from complete Acids Res. 2001; 29: Scholar) are found to have EPR can the of an interaction dataset the of individual interactions. that, the in the expression of interacting and non-interacting sets as by the in the mRNA can over a large of and with one it is not possible to the of the expression profiles as a of protein-protein interactions of the profiles an of the of biologically relevant interactions within a set. PVM, on the is to the of individual protein-protein interactions. However, it can also estimate the of biologically relevant interactions within a This is based on the that in the subsets of and INT with ∼50% of the interactions are identified by PVM PVM identify ∼50% of the biologically relevant interactions within The of true interactions within a set be as the by In the set only of the interactions that could the of true interactions is of the subset with This that of interactions are an rate for the of This with the EPR of in Table The of PVM to identify the true interactions within a dataset means that it can also be used to the of a by means of the of identified interactions. The Ito et al. (7.Ito T. Chiba T. Ozawa R. Yoshida M. Hattori M. Sakaki Y. A comprehensive two-hybrid analysis to explore the yeast protein interactome.Proc. Natl. Acad. Sci. U. S. A. 2001; 98: 4569-4574Google Scholar) subsets described and were PVM, and it was found that as the of independent of the interactions increased from to 8 the of the dataset identified as by PVM increased as the EPR index The of PVM can also be by the EPR of the subset of by It that this dataset within experimental the INT set are about interactions within the dataset identified in the genome-wide Y2H These interactions that were reported by Ito et al. (7.Ito T. Chiba T. Ozawa R. Yoshida M. Hattori M. Sakaki Y. A comprehensive two-hybrid analysis to explore the yeast protein interactome.Proc. Natl. Acad. Sci. U. S. A. 2001; 98: 4569-4574Google Scholar) as based on only these interactions are to contain false M. Legrain P. Yeast and Acids Res. Scholar) the in and demonstrate that contain a of true and the method such as PVM is to identify at least some of them. A subset of the interactions to be can be identified by the PVM INT and EC2 sets this a of 3003 interactions. This set is as the and is available on the DIP website of the interactions are identified by PVM and as such could not be by 2 that this set of interactions has a of that is to the sets to be and The of interactions to be based on the EPR index of is PVM is to identify putatively interactions with high it is with the of INT and EC2 to from all interactions, which are to be by We and for


Related Papers