Sentence alignment of Hungarian-English parallel corpora using a hybrid algorithm

Authors:
Krisztina Tóth;Richárd Farkas;András Kocsor
Affiliations:
Hungarian Academy of Sciences and University of Szeged, Szeged, Hungary;Hungarian Academy of Sciences and University of Szeged, Szeged, Hungary;Hungarian Academy of Sciences and University of Szeged, Szeged, Hungary
Venue:
Acta Cybernetica
Year:
2008

Citing 12
Cited 2

The Strength of Weak Learnability

Machine Learning
C4.5: programs for machine learning

C4.5: programs for machine learning
Adaptive Bilingual Sentence Alignment

AMTA '02 Proceedings of the 5th Conference of the Association for Machine Translation in the Americas on Machine Translation: From Research to Real Users
Fast and Accurate Sentence Alignment of Bilingual Corpora

AMTA '02 Proceedings of the 5th Conference of the Association for Machine Translation in the Americas on Machine Translation: From Research to Real Users
Text-translation alignment

Computational Linguistics - Special issue on using large corpora: I
An experiment in hybrid dictionary and statistical sentence alignment

COLING '98 Proceedings of the 17th international conference on Computational linguistics - Volume 1
Aligning sentences in parallel corpora

ACL '91 Proceedings of the 29th annual meeting on Association for Computational Linguistics
A program for aligning sentences in bilingual corpora

ACL '91 Proceedings of the 29th annual meeting on Association for Computational Linguistics
Aligning sentences in bilingual corpora using lexical information

ACL '93 Proceedings of the 31st annual meeting on Association for Computational Linguistics
Introduction to the CoNLL-2003 shared task: language-independent named entity recognition

CONLL '03 Proceedings of the seventh conference on Natural language learning at HLT-NAACL 2003 - Volume 4
A multilingual named entity recognition system using boosting and c4.5 decision tree learning algorithms

DS'06 Proceedings of the 9th international conference on Discovery Science
Probabilistic neural network based english-arabic sentence alignment

CICLing'06 Proceedings of the 7th international conference on Computational Linguistics and Intelligent Text Processing

Hungarian corpus of light verb constructions

COLING '10 Proceedings of the 23rd International Conference on Computational Linguistics
Identifying comparable corpora using LDA

NAACL HLT '12 Proceedings of the 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

Quantified Score

Hi-index	0.00

Visualization

Abstract

We present an efficient hybrid method for aligning sentences with their translations in a parallel bilingual corpus. The new algorithm is composed of a length-based and anchor matching method that uses Named Entity recognition. This algorithm combines the speed of length-based models with the accuracy of anchor finding methods. The accuracy of finding cognates for Hungarian-English language pair is extremely low, hence we thought of using a novel approach that includes Named Entity recognition. Due to the well selected anchors it was found to outperform the best two sentence alignment algorithms so far published for the Hungarian-English language pair.