Unsupervised synthesis of multilingual Wikipedia articles

Authors:
Chen Yuncong;Pascale Fung
Affiliations:
The Hong Kong University of Science and Technology;The Hong Kong University of Science and Technology
Venue:
COLING '10 Proceedings of the 23rd International Conference on Computational Linguistics
Year:
2010

Citing 6
Cited 0

TextTiling: segmenting text into multi-paragraph subtopic passages

Computational Linguistics
Identifying similarity in text: multi-lingual analysis for summarization

Identifying similarity in text: multi-lingual analysis for summarization
Multi-document summarization by sentence extraction

NAACL-ANLP-AutoSum '00 Proceedings of the 2000 NAACL-ANLP Workshop on Automatic Summarization
Text summarization model based on maximum coverage problem and its variant

EACL '09 Proceedings of the 12th Conference of the European Chapter of the Association for Computational Linguistics
Automatically generating Wikipedia articles: a structure-aware approach

ACL '09 Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP: Volume 1 - Volume 1
Natural Language Processing with Python

Natural Language Processing with Python

Quantified Score

Hi-index	0.00

Visualization

Abstract

In this paper, we propose an unsupervised approach to automatically synthesize Wikipedia articles in multiple languages. Taking an existing high-quality version of any entry as content guideline, we extract keywords from it and use the translated keywords to query the monolingual web of the target language. Candidate excerpts or sentences are selected based on an iterative ranking function and eventually synthesized into a complete article that resembles the reference version closely. 16 English and Chinese articles across 5 domains are evaluated to show that our algorithm is domain-independent. Both subjective evaluations by native Chinese readers and ROUGE-L scores computed with respect to standard reference articles demonstrate that synthesized articles outperform existing Chinese versions or MT texts in both content richness and readability. In practice our method can generate prototype texts for Wikipedia that facilitate later human authoring.