Combining optimal clustering and Hidden Markov models for extractive summarization

Authors:
Pascale Fung;Grace Ngai;Chi-Shun Cheung
Affiliations:
University of Science & Technology (HKUST), Clear Water Bay, Hong Kong;Hong Kong Polytechnic University, Kowloon, Hong Kong;University of Science & Technology (HKUST), Clear Water Bay, Hong Kong
Venue:
MultiSumQA '03 Proceedings of the ACL 2003 workshop on Multilingual summarization and question answering - Volume 12
Year:
2003

Citing 22
Cited 6

A trainable document summarizer

SIGIR '95 Proceedings of the 18th annual international ACM SIGIR conference on Research and development in information retrieval
Foundations of statistical natural language processing

Foundations of statistical natural language processing
A hidden Markov model information retrieval system

Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval
Information retrieval as statistical translation

Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval
Summarizing Similarities and Differences Among Related Documents

Information Retrieval
A new approach to unsupervised text summarization

Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval
Generic summarization and keyphrase extraction using mutual reinforcement principle and sentence clustering

SIGIR '02 Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval
Statistics-Based Summarization - Step One: Sentence Compression

Proceedings of the Seventeenth National Conference on Artificial Intelligence and Twelfth Conference on Innovative Applications of Artificial Intelligence
Tagging English text with a probabilistic model

Computational Linguistics
A stochastic parts program and noun phrase parser for unrestricted text

ANLC '88 Proceedings of the second conference on Applied natural language processing
Distribution of content words and phrases in text and language modelling

Natural Language Engineering
Similarity-based methods for word sense disambiguation

ACL '98 Proceedings of the 35th Annual Meeting of the Association for Computational Linguistics and Eighth Conference of the European Chapter of the Association for Computational Linguistics
The rhetorical parsing of natural language texts

ACL '98 Proceedings of the 35th Annual Meeting of the Association for Computational Linguistics and Eighth Conference of the European Chapter of the Association for Computational Linguistics
Multi-paragraph segmentation of expository text

ACL '94 Proceedings of the 32nd annual meeting on Association for Computational Linguistics
Fast generation of abstracts from general domain text corpora by extracting relevant sentences

COLING '96 Proceedings of the 16th conference on Computational linguistics - Volume 2
Sentence ordering in multidocument summarization

HLT '01 Proceedings of the first international conference on Human language technology research
Supervised ranking in open-domain text summarization

ACL '02 Proceedings of the 40th Annual Meeting on Association for Computational Linguistics
Headline generation based on statistical translation

ACL '00 Proceedings of the 38th Annual Meeting on Association for Computational Linguistics
Centroid-based summarization of multiple documents: sentence extraction, utility-based evaluation, and user studies

NAACL-ANLP-AutoSum '00 Proceedings of the 2000 NAACL-ANLPWorkshop on Automatic summarization - Volume 4
Using maximum entropy for sentence extraction

AS '02 Proceedings of the ACL-02 Workshop on Automatic Summarization - Volume 4
Selecting sentences for multidocument summaries using randomized local search

AS '02 Proceedings of the ACL-02 Workshop on Automatic Summarization - Volume 4
Revisions that improve cohesion in multi-document summaries: a preliminary study

AS '02 Proceedings of the ACL-02 Workshop on Automatic Summarization - Volume 4

One story, one flow: Hidden Markov Story Models for multilingual multidocument summarization

ACM Transactions on Speech and Language Processing (TSLP)
Extraction of coherent relevant passages using hidden Markov models

ACM Transactions on Information Systems (TOIS)
Personalized text snippet extraction using statistical language models

Pattern Recognition
A rhetorical syntax-driven model for speech summarization

COLING '10 Proceedings of the 23rd International Conference on Computational Linguistics
Automatic summarization

HLT '11 Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Tutorial Abstracts of ACL 2011
Multiple documents summarization based on genetic algorithm

FSKD'06 Proceedings of the Third international conference on Fuzzy Systems and Knowledge Discovery

Quantified Score

Hi-index	0.00

Visualization

Abstract

We propose Hidden Markov models with unsupervised training for extractive summarization. Extractive summarization selects salient sentences from documents to be included in a summary. Unsupervised clustering combined with heuristics is a popular approach because no annotated data is required. However, conventional clustering methods such as K-means do not take text cohesion into consideration. Probabilistic methods are more rigorous and robust, but they usually require supervised training with annotated data. Our method incorporates unsupervised training with clustering, into a probabilistic framework. Clustering is done by modified K-means (MKM)---a method that yields more optimal clusters than the conventional K-means method. Text cohesion is modeled by the transition probabilities of an HMM, and term distribution is modeled by the emission probabilities. The final decoding process tags sentences in a text with theme class labels. Parameter training is carried out by the segmental K-means (SKM) algorithm. The output of our system can be used to extract salient sentences for summaries, or used for topic detection. Content-based evaluation shows that our method outperforms an existing extractive summarizer by 22.8% in terms of relative similarity, and outperforms a baseline summarizer that selects the top N sentences as salient sentences by 46.3%.