Combining statistical data analysis techniques to extract topical keyword classes from corpora

Authors:
Mathias Rossignol;Pascale Sébillot
Affiliations:
Irisa, Campus de Beaulieu, 35042 Rennes Cedex, France;Irisa, Campus de Beaulieu, 35042 Rennes Cedex, France
Venue:
Intelligent Data Analysis
Year:
2005

Citing 6
Cited 1

A Cache-Based Natural Language Model for Speech Recognition

IEEE Transactions on Pattern Analysis and Machine Intelligence
Automatic text decomposition using text segments and text themes

Proceedings of the the seventh ACM conference on Hypertext
Introduction to the special issue on word sense disambiguation: the state of the art

Computational Linguistics - Special issue on word sense disambiguation
A bootstrapping approach for robust topic analysis

Natural Language Engineering
Combining multiple knowledge sources for discourse segmentation

ACL '95 Proceedings of the 33rd annual meeting on Association for Computational Linguistics
Multi-paragraph segmentation of expository text

ACL '94 Proceedings of the 32nd annual meeting on Association for Computational Linguistics

An ontology-based approach to Chinese semantic advertising

Information Sciences: an International Journal

Quantified Score

Hi-index	0.01

Visualization

Abstract

We present an unsupervised method for the generation from a textual corpus of sets of keywords, that is, words whose occurrences in a text are strongly connected with the presence of a given topic. Each of these classes is associated with one of the main topics of the corpus, and can be used to detect the presence of that topic in any of its paragraphs, by a simple keyword co-occurrence criterion. The classes are extracted from the textual data in a fully automatic way, without requiring any a priori linguistic knowledge or making any assumptions about the topics to search for. The algorithms we have developed allow us to yield satisfactory and directly usable results despite the amount of noise inherent in textual data. That goal is reached thanks to a combination of several data analysis techniques. On a corpus of archives from the French monthly newspaper Le Monde Diplomatique, we obtain 40 classes of about 30 words each that accurately characterize precise topics, and allow us to detect their occurrences with a precision and recall of 85% and 65% respectively.