Resource selection for domain-specific cross-lingual IR
Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval
Domain adaptation of translation models for multilingual applications
Domain adaptation of translation models for multilingual applications
Hi-index | 0.00 |
Automatic subset selection from a parallel corpus significantly cross-lingual information retrieval (CLIR) performance, in addition to increasing its efficiency. Our selection method extracts relevant training data by incorporating additional criteria (i.e. estimated corpus quality, taxonomy projection and size) in addition to lexical-based criteria. The challenge lies in combining these criteria using a meaningful scoring function that can be used for ranking parallel sentence candidates. We choose weighted geometric mean for its soft-AND properties, and we optimize criteria weights by wrapping the CLIR task in an optimization shell. Due to the indeterminate nature of the search space convexity properties, we have explored continuous reactive tabu search (CRTS), a global optimization method. We use a large parallel corpus in the medical domain to examine the effect of adaptation criteria and their combination on CLIR performance. In our experiments, 100 selected sentences yield 90% of the performance obtained with 5,000 times more in-domain parallel sentences. Our optimized criteria weights considerably outperform the uniform distribution baseline, as well as lexical similarity adaptation.