Combining optimal clustering and Hidden Markov models for extractive summarization

  • Authors:
  • Pascale Fung;Grace Ngai;Chi-Shun Cheung

  • Affiliations:
  • University of Science & Technology (HKUST), Clear Water Bay, Hong Kong;Hong Kong Polytechnic University, Kowloon, Hong Kong;University of Science & Technology (HKUST), Clear Water Bay, Hong Kong

  • Venue:
  • MultiSumQA '03 Proceedings of the ACL 2003 workshop on Multilingual summarization and question answering - Volume 12
  • Year:
  • 2003

Quantified Score

Hi-index 0.00

Visualization

Abstract

We propose Hidden Markov models with unsupervised training for extractive summarization. Extractive summarization selects salient sentences from documents to be included in a summary. Unsupervised clustering combined with heuristics is a popular approach because no annotated data is required. However, conventional clustering methods such as K-means do not take text cohesion into consideration. Probabilistic methods are more rigorous and robust, but they usually require supervised training with annotated data. Our method incorporates unsupervised training with clustering, into a probabilistic framework. Clustering is done by modified K-means (MKM)---a method that yields more optimal clusters than the conventional K-means method. Text cohesion is modeled by the transition probabilities of an HMM, and term distribution is modeled by the emission probabilities. The final decoding process tags sentences in a text with theme class labels. Parameter training is carried out by the segmental K-means (SKM) algorithm. The output of our system can be used to extract salient sentences for summaries, or used for topic detection. Content-based evaluation shows that our method outperforms an existing extractive summarizer by 22.8% in terms of relative similarity, and outperforms a baseline summarizer that selects the top N sentences as salient sentences by 46.3%.