Automatic extraction of translations from web-based bilingual materials

Authors:
Qibo Zhu;Diana Inkpen;Ash Asudeh
Affiliations:
Statistics Canada, Ottawa, Canada and Institute of Cognitive Science, Carleton University, Ottawa, Canada;School of Information Technology & Engineering, University of Ottawa, Ottawa, Canada;Institute of Cognitive Science, Carleton University, Ottawa, Canada and School of Linguistics and Applied Language Studies, Carleton University, Ottawa, Canada
Venue:
Machine Translation
Year:
2007

Citing 18
Cited 0

Building probabilistic models for natural language

Building probabilistic models for natural language
Translingual vocabulary mappings for multilingual information access

SIGIR '02 Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval
Fast and Accurate Sentence Alignment of Bilingual Corpora

AMTA '02 Proceedings of the 5th Conference of the Association for Machine Translation in the Americas on Machine Translation: From Research to Real Users
The Web as a parallel corpus

Computational Linguistics - Special issue on web as corpus
Text-translation alignment

Computational Linguistics - Special issue on using large corpora: I
Bitext maps and alignment via pattern recognition

Computational Linguistics
An automatic reviser: the TransCheck system

ANLC '00 Proceedings of the sixth conference on Applied natural language processing
Aligning sentences in parallel corpora

ACL '91 Proceedings of the 29th annual meeting on Association for Computational Linguistics
A program for aligning sentences in bilingual corpora

ACL '91 Proceedings of the 29th annual meeting on Association for Computational Linguistics
Aligning a parallel English-Chinese corpus statistically with lexical criteria

ACL '94 Proceedings of the 32nd annual meeting on Association for Computational Linguistics
K-vec: a new approach for aligning parallel texts

COLING '94 Proceedings of the 15th conference on Computational linguistics - Volume 2
Mining the Web for bilingual text

ACL '99 Proceedings of the 37th annual meeting of the Association for Computational Linguistics on Computational Linguistics
Conceptual analysis of parallel corpus collected from the Web

Journal of the American Society for Information Science and Technology
Exploiting the Web as the multilingual corpus for unknown query translation

Journal of the American Society for Information Science and Technology
An argument-based decision support system for assessing natural language usage on the basis of the web corpus: Research Articles

International Journal of Intelligent Systems
Identification of confusable drug names: a new approach and evaluation methodology

COLING '04 Proceedings of the 20th international conference on Computational Linguistics
Evaluation of alignment methods for HTML parallel text

FinTAL'06 Proceedings of the 5th international conference on Advances in Natural Language Processing
EuroGOV: engineering a multilingual web corpus

CLEF'05 Proceedings of the 6th international conference on Cross-Language Evalution Forum: accessing Multilingual Information Repositories

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper describes the framework of the StatCan Daily Translation Extraction System (SDTES), a computer system that maps and compares web-based translation texts of Statistics Canada (StatCan) news releases in the StatCan publication The Daily. The goal is to extract translations for translation memory systems, for translation terminology building, for cross-language information retrieval and for corpus-based machine translation systems. Three years of officially published statistical news release texts at http://www.statcan.ca were collected to compose the StatCan Daily data bank. The English and French texts in this collection were roughly aligned using the Gale-Church statistical algorithm. After this, boundary markers of text segments and paragraphs were adjusted and the Gale-Church algorithm was run a second time for a more fine-grained text segment alignment. To detect misaligned areas of texts and to prevent mismatched translation pairs from being selected, key textual and structural properties of the mapped texts were automatically identified and used as anchoring features for comparison and misalignment detection. The proposed method has been tested with web-based bilingual materials from five other Canadian government websites. Results show that the SDTES model is very efficient in extracting translations from published government texts, and very accurate in identifying mismatched translations. With parameters tuned, the text-mapping part can be used to align corpus data collected from official government websites; and the text-comparing component can be applied in prepublication translation quality control and in evaluating the results of statistical machine translation systems.