Cross-lingual training of summarization systems using annotated corpora in a foreign language

  • Authors:
  • Marina Litvak;Mark Last

  • Affiliations:
  • Sami Shamoon Academic College of Engineering, Beer-Sheva, Israel 84100;Ben Gurion University of the Negev, Beer-Sheva, Israel 84105

  • Venue:
  • Information Retrieval
  • Year:
  • 2013

Quantified Score

Hi-index 0.00

Visualization

Abstract

The increasing trend of cross-border globalization and acculturation requires text summarization techniques to work equally well for multiple languages. However, only some of the automated summarization methods can be defined as "language-independent," i.e., not based on any language-specific knowledge. Such methods can be used for multilingual summarization, defined in Mani (Automatic summarization. Natural language processing. John Benjamins Publishing Company, Amsterdam, 2001) as "processing several languages, with a summary in the same language as input", but, their performance is usually unsatisfactory due to the exclusion of language-specific knowledge. Moreover, supervised machine learning approaches need training corpora in multiple languages that are usually unavailable for rare languages, and their creation is a very expensive and labor-intensive process. In this article, we describe cross-lingual methods for training an extractive single-document text summarizer called MUSE (MUltilingual Sentence Extractor)--a supervised approach, based on the linear optimization of a rich set of sentence ranking measures using a Genetic Algorithm. We evaluated MUSE's performance on documents in three different languages: English, Hebrew, and Arabic using several training scenarios. The summarization quality was measured using ROUGE-1 and ROUGE-2 Recall metrics. The results of the extensive comparative analysis showed that the performance of MUSE was better than that of the best known multilingual approach (TextRank) in all three languages. Moreover, our experimental results suggest that using the same sentence ranking model across languages results in a reasonable summarization quality, while saving considerable annotation efforts for the end-user. On the other hand, using parallel corpora generated by machine translation tools may improve the performance of a MUSE model trained on a foreign language. Comparative evaluation of an alternative optimization technique--Multiple Linear Regression--justifies the use of a Genetic Algorithm.