Indexing by latent semantic analysisScott Deerwester, Susan Dumais, George W. Furnas et al.|Journal of the American Society for Information Science|1990 A new method for automatic indexing and retrieval is described. The approach is to take advantage of implicit higher-order structure in the association of terms with documents (“semantic structure”) in order to improve the detection of relevant documents on the basis of terms found in queries. The particular technique used is singular-value decomposition, in which a large term by document matrix is decomposed into a set of ca. 100 orthogonal factors from which the original matrix can be approximated by linear combination. Documents are represented by ca. 100 item vectors of factor weights. Queries are represented as pseudo-document vectors formed from weighted combinations of terms, and documents with supra-threshold cosine values are returned. Initial tests find this completely automatic method for retrieval to be promising. © 1990 John Wiley & Sons, Inc.
Support vector machinesMarti A. Hearst, Susan Dumais, E. Osuna et al.|IEEE Intelligent Systems and their Applications|1998 My first exposure to Support Vector Machines came this spring when heard Sue Dumais present impressive results on text categorization using this analysis technique. This issue's collection of essays should help familiarize our readers with this interesting new racehorse in the Machine Learning stable. Bernhard Scholkopf, in an introductory overview, points out that a particular advantage of SVMs over other learning algorithms is that it can be analyzed theoretically using concepts from computational learning theory, and at the same time can achieve good performance when applied to real problems. Examples of these real-world applications are provided by Sue Dumais, who describes the aforementioned text-categorization problem, yielding the best results to date on the Reuters collection, and Edgar Osuna, who presents strong results on application to face detection. Our fourth author, John Platt, gives us a practical guide and a new technique for implementing the algorithm efficiently.
A solution to Plato's problem: The latent semantic analysis theory of acquisition, induction, and representation of knowledge.How do people know as much as they do with as little information as they get? The problem takes many forms; learning vocabulary from text is an especially dramatic and convenient case for research. A new general theory of acquired similarity and knowledge representation, latent semantic analysis (LSA), is presented and used to successfully simulate such learning and several other psycholinguistic phenomena. By inducing global knowledge indirectly from local co-occurrence data in a large body of representative text, LSA acquired knowledge about the full vocabulary of English at a comparable rate to schoolchildren. LSA uses no prior linguistic or perceptual similarity knowledge; it is based solely on a general mathematical learning method that achieves powerful inductive effects by extracting the right number of dimensions (e.g., 300) to represent objects and contexts. Relations to other theories, phenomena, and problems are sketched.
The vocabulary problem in human-system communicationIn almost all computer applications, users must enter correct words for the desired objects or actions. For success without extensive training, or in first-tries for new targets, the system must recognize terms that will be chosen spontaneously. We studied spontaneous word choice for objects in five application-related domains, and found the variability to be surprisingly large. In every case two people favored the same term with probability <0.20. Simulations show how this fundamental property of language limits the success of various design methodologies for vocabulary-driven interaction. For example, the popular approach in which access is via one designer's favorite single word will result in 80-90 percent failure rates in many common situations. An optimal strategy, unlimited aliasing, is derived and shown to be capable of several-fold improvements.
Using Linear Algebra for Intelligent Information RetrievalAbstract Currently most approaches to retrieving textual materials from scientic databases depend on a lexical match between words in users requests and those in or assigned to documents in a database Because of the tremendous diversity in the words people use to describe the same document lexical methods are necessarily incomplete and imprecise Using the singular value decomposition SVD one can take advantage of the implicit higherorder structure in the association of terms with documents by determining the SVD of large sparse term by document matrices Terms and documents represented by \t of the largest singular vectors are then matched against user queries We call this retrieval method Latent Semantic Indexing LSI because the subspace represents important associative relationships between terms and documents that are not evident in individual documents LSI is a completely automatic yet intelligent indexing method widely applicable and a promising way to improve users access to many kinds of textual materials or to documents and services for which textual descriptions are available A survey of the computational requirements for managing LSIencoded databases as well as current and future applications of LSI is presented Key words indexing information latent matrices retrieval semantic singular value decomposition sparse updating AMSMOS subject classications A