Topical N-Grams: Phrase and Topic Discovery, with an Application to Information Retrieval

Xuerui Wang(University of Massachusetts Amherst), Andrew McCallum(University of Massachusetts Amherst), Xing Wei(University of Massachusetts Amherst)
Unknown
October 1, 2007
Cited by 486Open Access
Full Text

Abstract

Most topic models, such as latent Dirichlet allocation, rely on the bag-of-words assumption. However, word order and phrases are often critical to capturing the meaning of text in many text mining tasks. This paper presents topical n-grams, a topic model that discovers topics as well as topical phrases. The probabilistic model generates words in their textual order by, for each word, first sampling a topic, then sampling its status as a unigram or bigram, and then sampling the word from a topic-specific unigram or bigram distribution. Thus our model can model "white house" as a special meaning phrase in the 'politics' topic, but not in the 'real estate' topic. Successive bigrams form longer phrases. We present experiments showing meaningful phrases and more interpretable topics from the NIPS data and improved information retrieval performance on a TREC collection.


Related Papers

No related papers found

Powered by citation graph analysis