Automatic construction of domain-specific dictionaries on sparse parallel corpora in the Nordic languages

Authors:
Sumithra Velupillai;Hercules Dalianis
Affiliations:
DSV/KTH-Stockholm University, Sweden;DSV/KTH-Stockholm University, Sweden and Euroling AB, Stockholm, Sweden
Venue:
MMIES '08 Proceedings of the Workshop on Multi-source Multilingual Information Extraction and Summarization
Year:
2008

Citing 4
Cited 0

A systematic comparison of various statistical alignment models

Computational Linguistics
Combining clues for word alignment

EACL '03 Proceedings of the tenth conference on European chapter of the Association for Computational Linguistics - Volume 1
Extracting parallel sub-sentential fragments from non-parallel corpora

ACL-44 Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics
Word alignment for languages with scarce resources

ParaText '05 Proceedings of the ACL Workshop on Building and Using Parallel Texts

Quantified Score

Hi-index	0.00

Visualization

Abstract

Hallå Norden is a web site with information regarding mobility between the Nordic countries in five different languages; Swedish, Danish, Norwegian, Icelandic and Finnish. We wanted to create a Nordic cross-language dictionary for the use in a cross-language search engine for Hallå Norden. The entire set of texts on the web site was treated as one multilingual parallel corpus. From this we extracted parallel corpora for each language pair. The corpora were very sparse, containing on average less than 80 000 words per language pair. We have used the Uplug word alignment system (Tiedemann 2003a), for the creation of the dictionaries. The results gave on average 213 new dictionary words (frequency 3) per language pair. The average error rate was 16 percent. Different combinations with Finnish had a higher error rate, 33 percent, whereas the error rate for the remaining language pairs only yielded on average 9 percent errors. The high error rate for Finnish is possibly due to the fact that the Finnish language belongs to a different language family. Although the corpora were very sparse the word alignment results for the combinations of Swedish, Danish, Norwegian and Icelandic were surprisingly good compared to other experiments with larger corpora.