Automatic construction of domain-specific dictionaries on sparse parallel corpora in the Nordic languages

  • Authors:
  • Sumithra Velupillai;Hercules Dalianis

  • Affiliations:
  • DSV/KTH-Stockholm University, Sweden;DSV/KTH-Stockholm University, Sweden and Euroling AB, Stockholm, Sweden

  • Venue:
  • MMIES '08 Proceedings of the Workshop on Multi-source Multilingual Information Extraction and Summarization
  • Year:
  • 2008

Quantified Score

Hi-index 0.00

Visualization

Abstract

Hallå Norden is a web site with information regarding mobility between the Nordic countries in five different languages; Swedish, Danish, Norwegian, Icelandic and Finnish. We wanted to create a Nordic cross-language dictionary for the use in a cross-language search engine for Hallå Norden. The entire set of texts on the web site was treated as one multilingual parallel corpus. From this we extracted parallel corpora for each language pair. The corpora were very sparse, containing on average less than 80 000 words per language pair. We have used the Uplug word alignment system (Tiedemann 2003a), for the creation of the dictionaries. The results gave on average 213 new dictionary words (frequency 3) per language pair. The average error rate was 16 percent. Different combinations with Finnish had a higher error rate, 33 percent, whereas the error rate for the remaining language pairs only yielded on average 9 percent errors. The high error rate for Finnish is possibly due to the fact that the Finnish language belongs to a different language family. Although the corpora were very sparse the word alignment results for the combinations of Swedish, Danish, Norwegian and Icelandic were surprisingly good compared to other experiments with larger corpora.