Building parallel corpora by automatic title alignment using length-based and text-based approaches

Authors:
Christopher C. Yang;Kar Wing Li
Affiliations:
Department of Systems Engineering and Engineering Management, The Chinese University, of Hong Kong, Ho Sin Hang Engineering Building, Shatin, NT, Hong Kong;Department of Systems Engineering and Engineering Management, The Chinese University, of Hong Kong, Ho Sin Hang Engineering Building, Shatin, NT, Hong Kong
Venue:
Information Processing and Management: an International Journal
Year:
2004

Citing 12
Cited 6

Identifying word correspondence in parallel texts

HLT '91 Proceedings of the workshop on Speech and Natural Language
Chinese text segmentation for text retrieval: achievements and problems

Journal of the American Society for Information Science
Translingual alteration of conceptual information in medical translation: a crosslanguage analysis between English and Chinese

Journal of the American Society for Information Science
A Technical Word- and Term-Translation Aid Using Noisy Parallel Corpora across Language Groups

Machine Translation
Bilingual Sentence Alignment: Balancing Robustness and Accuracy

Machine Translation
Methods and practical issues in evaluating alignment techniques

COLING '98 Proceedings of the 17th international conference on Computational linguistics - Volume 1
Aligning sentences in parallel corpora

ACL '91 Proceedings of the 29th annual meeting on Association for Computational Linguistics
Char_align: a program for aligning parallel texts at the character level

ACL '93 Proceedings of the 31st annual meeting on Association for Computational Linguistics
A pattern matching method for finding noun and proper noun translations from noisy parallel corpora

ACL '95 Proceedings of the 33rd annual meeting on Association for Computational Linguistics
Aligning a parallel English-Chinese corpus statistically with lexical criteria

ACL '94 Proceedings of the 32nd annual meeting on Association for Computational Linguistics
Bilingual text, matching using bilingual dictionary and statistics

COLING '94 Proceedings of the 15th conference on Computational linguistics - Volume 2
Mining the Web for bilingual text

ACL '99 Proceedings of the 37th annual meeting of the Association for Computational Linguistics on Computational Linguistics

Sentence alignment using P-NNT and GMM

Computer Speech and Language
Focused web crawling in the acquisition of comparable corpora

Information Retrieval
Annotation and verification of sense pools in OntoNotes

Information Processing and Management: an International Journal
Creating a Persian-English comparable corpus

CLEF'10 Proceedings of the 2010 international conference on Multilingual and multimodal information access evaluation: cross-language evaluation forum
Editorial: Managing and mining multilingual documents: Introduction to the special topic issue of information processing management

Information Processing and Management: an International Journal
Cross-lingual text categorization: Conquering language boundaries in globalized environments

Information Processing and Management: an International Journal

Quantified Score

Hi-index	0.00

Visualization

Abstract

Cross-lingual semantic interoperability has drawn significant attention in recent digital library and World Wide Web research as the information in languages other than English has grown exponentially. Cross-lingual information retrieval (CLIR) across different European languages, such as English, Spanish, and French, has been widely explored; however, CLIR across European languages and Oriental languages is still in the initial stage. To cross language boundary, corpus-based approach is promising to overcome the limitation of the knowledge-based and controlled vocabulary approaches but collecting parallel corpora between European language and Oriental language is not an easy task. Length-based and text-based approaches are two major approaches to align parallel documents. In this paper, we investigate several techniques using these approaches and compare their performances in aligning English and Chinese titles of parallel documents available on the Web.