Automatic text summarization based on word-clusters and ranking algorithms

Authors:
Massih R. Amini;Nicolas Usunier;Patrick Gallinari
Affiliations:
Computer Science Laboratory of Paris 6, Paris, France;Computer Science Laboratory of Paris 6, Paris, France;Computer Science Laboratory of Paris 6, Paris, France
Venue:
ECIR'05 Proceedings of the 27th European conference on Advances in Information Retrieval Research
Year:
2005

Citing 14
Cited 10

The identification of important concepts in highly structured technical papers

SIGIR '93 Proceedings of the 16th annual international ACM SIGIR conference on Research and development in information retrieval
A trainable document summarizer

SIGIR '95 Proceedings of the 18th annual international ACM SIGIR conference on Research and development in information retrieval
Query expansion using local and global document analysis

SIGIR '96 Proceedings of the 19th annual international ACM SIGIR conference on Research and development in information retrieval
Machine learning of generic and user-focused summarization

AAAI '98/IAAI '98 Proceedings of the fifteenth national/tenth conference on Artificial intelligence/Innovative applications of artificial intelligence
Summarizing text documents: sentence selection and evaluation metrics

Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval
The automatic construction of large-scale corpora for summarization research

Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval
Extracting sentence segments for text summarization: a machine learning approach

SIGIR '00 Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval
Models for metasearch

Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval
The use of unlabeled data to improve supervised learning for text summarization

SIGIR '02 Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval
An efficient boosting algorithm for combining preferences

The Journal of Machine Learning Research
Fast generation of abstracts from general domain text corpora by extracting relevant sentences

COLING '96 Proceedings of the 16th conference on Computational linguistics - Volume 2
Ranking algorithms for named-entity extraction: boosting and the voted perceptron

ACL '02 Proceedings of the 40th Annual Meeting on Association for Computational Linguistics
Ranking and Reranking with Perceptron

Machine Learning
The automatic creation of literature abstracts

IBM Journal of Research and Development

Learning-based summarisation of XML documents

Information Retrieval
Extractive spoken document summarization for information retrieval

Pattern Recognition Letters
LIP6 at INEX'09: OWPC for ad hoc track

INEX'09 Proceedings of the Focused retrieval and evaluation, and 8th international conference on Initiative for the evaluation of XML retrieval
Applying regression models to query-focused multi-document summarization

Information Processing and Management: an International Journal
LIP6 at INEX'10: OWPC for ad hoc track

INEX'10 Proceedings of the 9th international conference on Initiative for the evaluation of XML retrieval: comparative evaluation of focused retrieval
Machine learning ranking and INEX’05

INEX'05 Proceedings of the 4th international conference on Initiative for the Evaluation of XML Retrieval
A computer-assisted qualitative data analysis framework for the engineering management domain

International Journal of Data Analysis Techniques and Strategies
Machine learning ranking for structured information retrieval

ECIR'06 Proceedings of the 28th European conference on Advances in Information Retrieval
NOMIT: automatic titling by nominalizing

NAACL HLT '12 Proceedings of the 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies
How can catchy titles be generated without loss of informativeness?

Expert Systems with Applications: An International Journal

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper investigates a new approach for Single Document Summarization based on a Machine Learning ranking algorithm. The use of machine learning techniques for this task allows one to adapt summaries to the user needs and to the corpus characteristics. These desirable properties have motivated an increasing amount of work in this field over the last few years. Most approaches attempt to generate summaries by extracting text-spans (sentences in our case) and adopt the classification framework which consists to train a classifier in order to discriminate between relevant and irrelevant spans of a document. A set of features is first used to produce a vector of scores for each sentence in a given document and a classifier is trained in order to make a global combination of these scores. We believe that the classification criterion for training a classifier is not adapted for SDS and propose an original framework based on ranking for this task. A ranking algorithm also combines the scores of different features but its criterion tends to reduce the relative misordering of sentences within a document. Features we use here are either based on the state-of-the-art or built upon word-clusters. These clusters are groups of words which often co-occur with each other, and can serve to expand a query or to enrich the representation of the sentences of the documents. We analyze the performance of our ranking algorithm on two data sets – the Computation and Language (cmp_lg) collection of TIPSTER SUMMAC and the WIPO collection. We perform comparisons with different baseline – non learning – systems, and a reference trainable summarizer system based on the classification framework. The experiments show that the learning algorithms perform better than the non-learning systems while the ranking algorithm outperforms the classifier. The difference of performance between the two learning algorithms depends on the nature of datasets. We give an explanation of this fact by the different separability hypothesis of the data made by the two learning algorithms.