Building probabilistic models for natural language
Building probabilistic models for natural language
Translingual vocabulary mappings for multilingual information access
SIGIR '02 Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval
Fast and Accurate Sentence Alignment of Bilingual Corpora
AMTA '02 Proceedings of the 5th Conference of the Association for Machine Translation in the Americas on Machine Translation: From Research to Real Users
Computational Linguistics - Special issue on web as corpus
Computational Linguistics - Special issue on using large corpora: I
Bitext maps and alignment via pattern recognition
Computational Linguistics
An automatic reviser: the TransCheck system
ANLC '00 Proceedings of the sixth conference on Applied natural language processing
Aligning sentences in parallel corpora
ACL '91 Proceedings of the 29th annual meeting on Association for Computational Linguistics
A program for aligning sentences in bilingual corpora
ACL '91 Proceedings of the 29th annual meeting on Association for Computational Linguistics
Aligning a parallel English-Chinese corpus statistically with lexical criteria
ACL '94 Proceedings of the 32nd annual meeting on Association for Computational Linguistics
K-vec: a new approach for aligning parallel texts
COLING '94 Proceedings of the 15th conference on Computational linguistics - Volume 2
Mining the Web for bilingual text
ACL '99 Proceedings of the 37th annual meeting of the Association for Computational Linguistics on Computational Linguistics
Conceptual analysis of parallel corpus collected from the Web
Journal of the American Society for Information Science and Technology
Exploiting the Web as the multilingual corpus for unknown query translation
Journal of the American Society for Information Science and Technology
International Journal of Intelligent Systems
Identification of confusable drug names: a new approach and evaluation methodology
COLING '04 Proceedings of the 20th international conference on Computational Linguistics
Evaluation of alignment methods for HTML parallel text
FinTAL'06 Proceedings of the 5th international conference on Advances in Natural Language Processing
EuroGOV: engineering a multilingual web corpus
CLEF'05 Proceedings of the 6th international conference on Cross-Language Evalution Forum: accessing Multilingual Information Repositories
Hi-index | 0.00 |
This paper describes the framework of the StatCan Daily Translation Extraction System (SDTES), a computer system that maps and compares web-based translation texts of Statistics Canada (StatCan) news releases in the StatCan publication The Daily. The goal is to extract translations for translation memory systems, for translation terminology building, for cross-language information retrieval and for corpus-based machine translation systems. Three years of officially published statistical news release texts at http://www.statcan.ca were collected to compose the StatCan Daily data bank. The English and French texts in this collection were roughly aligned using the Gale-Church statistical algorithm. After this, boundary markers of text segments and paragraphs were adjusted and the Gale-Church algorithm was run a second time for a more fine-grained text segment alignment. To detect misaligned areas of texts and to prevent mismatched translation pairs from being selected, key textual and structural properties of the mapped texts were automatically identified and used as anchoring features for comparison and misalignment detection. The proposed method has been tested with web-based bilingual materials from five other Canadian government websites. Results show that the SDTES model is very efficient in extracting translations from published government texts, and very accurate in identifying mismatched translations. With parameters tuned, the text-mapping part can be used to align corpus data collected from official government websites; and the text-comparing component can be applied in prepublication translation quality control and in evaluating the results of statistical machine translation systems.