An approach to indexing and clustering news stories using continuous language models

Authors:
Richard Bache;Fabio Crestani
Affiliations:
University of Glasgow, Glasgow, Scotland;University of Lugano, Lugano, Switzerland
Venue:
NLDB'10 Proceedings of the Natural language processing and information systems, and 15th international conference on Applications of natural language to information systems
Year:
2010

Citing 7
Cited 1

A language modeling approach to information retrieval

Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval
Segmentation and detection at IBM: hybrid statistical models and two-tiered clustering

Topic detection and tracking
Bursty and hierarchical structure in streams

Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
Augmenting Naive Bayes Classifiers with Statistical Language Models

Information Retrieval
TopCat: Data Mining for Topic Identification in a Text Corpus

IEEE Transactions on Knowledge and Data Engineering
Estimating real-valued characteristics of criminals from their recorded crimes

Proceedings of the 17th ACM conference on Information and knowledge management
Combining naive bayes and n-gram language models for text classification

ECIR'03 Proceedings of the 25th European conference on IR research

i-JEN: visual interactive Malaysia crime news retrieval system

IVIC'11 Proceedings of the Second international conference on Visual informatics: sustaining research and innovations - Volume Part II

Quantified Score

Hi-index	0.00

Visualization

Abstract

Within the vocabulary used in a set of news stories a minority of terms will be topic-specific in that they occur largely or solely within those stories belonging to a common event. When applying unsupervised learning techniques such as clustering it is useful to determine which words are event-specific and which topic they relate to. Continuous language models are used to model the generation of news stories over time and from these models two measures are derived: bendiness which indicates whether a word is event specific and shape distance which indicates whether two terms are likely to relate to the same topic. These are used to construct a new clustering technique which identifies and characterises the underlying events within the news stream.