Aligning and using an English-Inuktitut parallel corpus

Authors:
Joel Martin;Howard Johnson;Benoit Farley;Anna Maclachlan
Affiliations:
Institute for Information Technology, Canada;Institute for Information Technology, Canada;Institute for Information Technology, Canada;Institute for Information Technology, Canada
Venue:
HLT-NAACL-PARALLEL '03 Proceedings of the HLT-NAACL 2003 Workshop on Building and using parallel texts: data driven machine translation and beyond - Volume 3
Year:
2003

Citing 4
Cited 6

Identifying word correspondence in parallel texts

HLT '91 Proceedings of the workshop on Speech and Natural Language
Line ‘Em Up: Advances in Alignment Technology and their Impact on Translation Support Tools

Machine Translation
A program for aligning sentences in bilingual corpora

Computational Linguistics - Special issue on using large corpora: I
Automatic identification of non-compositional phrases

ACL '99 Proceedings of the 37th annual meeting of the Association for Computational Linguistics on Computational Linguistics

Word alignment for languages with scarce resources

ParaText '05 Proceedings of the ACL Workshop on Building and Using Parallel Texts
Nukti: English-Inuktitut word alignment system description

ParaText '05 Proceedings of the ACL Workshop on Building and Using Parallel Texts
Models for Inuktitut-English word alignment

ParaText '05 Proceedings of the ACL Workshop on Building and Using Parallel Texts
An extensible crosslinguistic readability framework

BUCC '09 Proceedings of the 2nd Workshop on Building and Using Comparable Corpora: from Parallel to Non-parallel Corpora
Automatic identification of parallel documents with light or without linguistic resources

AI'05 Proceedings of the 18th Canadian Society conference on Advances in Artificial Intelligence
Evaluating a morphological analyser of Inuktitut

NAACL HLT '12 Proceedings of the 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

Quantified Score

Hi-index	0.00

Visualization

Abstract

A parallel corpus of texts in English and in Inuktitut, an Inuit language, is presented. These texts are from the Nunavut Hansards. The parallel texts are processed in two phases, the sentence alignment phase and the word correspondence phase. Our sentence alignment technique achieves a precision of 91.4% and a recall of 92.3%. Our word correspondence technique is aimed at providing the broadest coverage collection of reliable pairs of Inuktitut and English morphemes for dictionary expansion. For an agglutinative language like Inuktitut, this entails considering substrings, not simply whole words. We employ a Pointwise Mutual Information method (PMI) and attain a coverage of 72.3% of English words and a precision of 87%.