Cross-lingual training of summarization systems using annotated corpora in a foreign language

Authors:
Marina Litvak;Mark Last
Affiliations:
Sami Shamoon Academic College of Engineering, Beer-Sheva, Israel 84100;Ben Gurion University of the Negev, Beer-Sheva, Israel 84105
Venue:
Information Retrieval
Year:
2013

Citing 32
Cited 0

Automatic text structuring and summarization

Information Processing and Management: an International Journal - Special issue: methods and tools for the automatic construction of hypertext
The anatomy of a large-scale hypertextual Web search engine

WWW7 Proceedings of the seventh international conference on World Wide Web 7
Summarizing text documents: sentence selection and evaluation metrics

Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval
New Methods in Automatic Extracting

Journal of the ACM (JACM)
Authoritative sources in a hyperlinked environment

Journal of the ACM (JACM)
Fundamentals of Computer Numerical Analysis

Fundamentals of Computer Numerical Analysis
Genetic Algorithms in Search, Optimization and Machine Learning

Genetic Algorithms in Search, Optimization and Machine Learning
Evolution strategies –A comprehensive introduction

Natural Computing: an international journal
Learning Algorithms for Keyphrase Extraction

Information Retrieval
Generating Text Summaries through the Relative Importance of Topics

IBERAMIA-SBIA '00 Proceedings of the International Joint Conference, 7th Ibero-American Conference on AI: Advances in Artificial Intelligence
Enhancing Preference-Based Anaphora Resolution with Genetic Algorithms

NLP '00 Proceedings of the Second International Conference on Natural Language Processing
Identifying topics by position

ANLC '97 Proceedings of the fifth conference on Applied natural language processing
Robust generic and query-based summarisation

EACL '03 Proceedings of the tenth conference on European chapter of the Association for Computational Linguistics - Volume 2
Automatic evaluation of summaries using N-gram co-occurrence statistics

NAACL '03 Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology - Volume 1
Graph-Theoretic Techniques for Web Content Mining

Graph-Theoretic Techniques for Web Content Mining
Beyond SumBasic: Task-focused summarization with sentence simplification and lexical expansion

Information Processing and Management: an International Journal
Using only cross-document relationships for both generic and topic-focused multi-document summarizations

Information Retrieval
Extractive summarization using supervised and semi-supervised learning

COLING '08 Proceedings of the 22nd International Conference on Computational Linguistics - Volume 1
Graph-based keyword extraction for single-document summarization

MMIES '08 Proceedings of the Workshop on Multi-source Multilingual Information Extraction and Summarization
Language independent extractive summarization

AAAI'05 Proceedings of the 20th national conference on Artificial intelligence - Volume 4
LexRank: graph-based lexical centrality as salience in text summarization

Journal of Artificial Intelligence Research
The automatic creation of literature abstracts

IBM Journal of Research and Development
Machine-made index for technical literature: an experiment

IBM Journal of Research and Development
Generating extracts with genetic algorithms

ECIR'03 Proceedings of the 25th European conference on IR research
Genetic algorithm based multi-document summarization

PRICAI'06 Proceedings of the 9th Pacific Rim international conference on Artificial intelligence
Cross-language document summarization based on machine translation quality prediction

ACL '10 Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics
A new approach to improving multilingual summarization using a genetic algorithm

ACL '10 Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics
Multi-document summarization using A* search and discriminative training

EMNLP '10 Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing
Applying regression models to query-focused multi-document summarization

Information Processing and Management: an International Journal
Text summarization and singular value decomposition

ADVIS'04 Proceedings of the Third international conference on Advances in Information Systems
Multiple documents summarization based on genetic algorithm

FSKD'06 Proceedings of the Third international conference on Fuzzy Systems and Knowledge Discovery
Diversity in genetic programming: an analysis of measures and correlation with fitness

IEEE Transactions on Evolutionary Computation

Quantified Score

Hi-index	0.00

Visualization

Abstract

The increasing trend of cross-border globalization and acculturation requires text summarization techniques to work equally well for multiple languages. However, only some of the automated summarization methods can be defined as "language-independent," i.e., not based on any language-specific knowledge. Such methods can be used for multilingual summarization, defined in Mani (Automatic summarization. Natural language processing. John Benjamins Publishing Company, Amsterdam, 2001) as "processing several languages, with a summary in the same language as input", but, their performance is usually unsatisfactory due to the exclusion of language-specific knowledge. Moreover, supervised machine learning approaches need training corpora in multiple languages that are usually unavailable for rare languages, and their creation is a very expensive and labor-intensive process. In this article, we describe cross-lingual methods for training an extractive single-document text summarizer called MUSE (MUltilingual Sentence Extractor)--a supervised approach, based on the linear optimization of a rich set of sentence ranking measures using a Genetic Algorithm. We evaluated MUSE's performance on documents in three different languages: English, Hebrew, and Arabic using several training scenarios. The summarization quality was measured using ROUGE-1 and ROUGE-2 Recall metrics. The results of the extensive comparative analysis showed that the performance of MUSE was better than that of the best known multilingual approach (TextRank) in all three languages. Moreover, our experimental results suggest that using the same sentence ranking model across languages results in a reasonable summarization quality, while saving considerable annotation efforts for the end-user. On the other hand, using parallel corpora generated by machine translation tools may improve the performance of a MUSE model trained on a foreign language. Comparative evaluation of an alternative optimization technique--Multiple Linear Regression--justifies the use of a Genetic Algorithm.