A statistical approach to crosslingual natural language tasks

Authors:
David Pinto;Jorge Civera;Alberto Barrón-Cedeòo;Alfons Juan;Paolo Rosso
Affiliations:
Facultad de Ciencias de la Computación, Benemérita Universidad Autónoma de Puebla, Mexico and Departamento de Sistemas Informáticos y Computación, Universidad Politéc ...;Departamento de Sistemas Informáticos y Computación, Universidad Politécnica de Valencia, Spain;Departamento de Sistemas Informáticos y Computación, Universidad Politécnica de Valencia, Spain;Departamento de Sistemas Informáticos y Computación, Universidad Politécnica de Valencia, Spain;Departamento de Sistemas Informáticos y Computación, Universidad Politécnica de Valencia, Spain
Venue:
Journal of Algorithms
Year:
2009

Citing 21
Cited 7

Probabilistic models in information retrieval

The Computer Journal - Special issue on information retrieval
CHECK: a document plagiarism detection system

SAC '97 Proceedings of the 1997 ACM symposium on Applied computing
Text Classification from Labeled and Unlabeled Documents using EM

Machine Learning - Special issue on information retrieval
BoosTexter: A Boosting-based Systemfor Text Categorization

Machine Learning - Special issue on information retrieval
Information Retrieval

Information Retrieval
Naive (Bayes) at Forty: The Independence Assumption in Information Retrieval

ECML '98 Proceedings of the 10th European Conference on Machine Learning
Text Categorization with Suport Vector Machines: Learning with Many Relevant Features

ECML '98 Proceedings of the 10th European Conference on Machine Learning
Fast and Accurate Sentence Alignment of Bilingual Corpora

AMTA '02 Proceedings of the 5th Conference of the Association for Machine Translation in the Americas on Machine Translation: From Research to Real Users
A systematic comparison of various statistical alignment models

Computational Linguistics
Word reordering and a dynamic programming beam search algorithm for statistical machine translation

Computational Linguistics
Augmenting Naive Bayes Classifiers with Statistical Language Models

Information Retrieval
Embedding web-based statistical translation models in cross-language information retrieval

Computational Linguistics - Special issue on web as corpus
The mathematics of statistical machine translation: parameter estimation

Computational Linguistics - Special issue on using large corpora: II
Document preprocessing for naive Bayes classification and clustering with mixture of multinomials

Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining
Statistical phrase-based translation

NAACL '03 Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology - Volume 1
Word-Level Confidence Estimation for Machine Translation

Computational Linguistics
Parallel corpora segmentation using anchor words

EAMT '03 Proceedings of the 7th International EAMT workshop on MT and other Language Technology Tools, Improving MT through other Language Technology Tools: Resources and Tools for Building MT
Using query-relevant documents pairs for cross-lingual information retrieval

TSD'07 Proceedings of the 10th international conference on Text, speech and dialogue
A Wikipedia-based multilingual retrieval model

ECIR'08 Proceedings of the IR research, 30th European conference on Advances in information retrieval
Overview of WebCLEF 2005

CLEF'05 Proceedings of the 6th international conference on Cross-Language Evalution Forum: accessing Multilingual Information Repositories
EuroGOV: engineering a multilingual web corpus

CLEF'05 Proceedings of the 6th international conference on Cross-Language Evalution Forum: accessing Multilingual Information Repositories

Plagiarism detection across distant language pairs

COLING '10 Proceedings of the 23rd International Conference on Computational Linguistics
Cross-language plagiarism detection

Language Resources and Evaluation
Developing a corpus of plagiarised short answers

Language Resources and Evaluation
Towards the detection of cross-language source code reuse

NLDB'11 Proceedings of the 16th international conference on Natural language processing and information systems
A machine-translation method for normalization of SMS

MCPR'12 Proceedings of the 4th Mexican conference on Pattern Recognition
Cross-Language high similarity search using a conceptual thesaurus

CLEF'12 Proceedings of the Third international conference on Information Access Evaluation: multilinguality, multimodality, and visual analytics
Managing information disparity in multilingual document collections

ACM Transactions on Speech and Language Processing (TSLP)

Quantified Score

Hi-index	0.01

Visualization

Abstract

The existence of huge volumes of documents written in multiple languages on Internet leads to investigate novel algorithmic approaches to deal with information of this kind. However, most crosslingual natural language processing (NLP) tasks consider a decoupled approach in which monolingual NLP techniques are applied along with an independent translation process. This two-step approach is too sensitive to translation errors, and in general to the accumulative effect of errors. To solve this problem, we propose to use a direct probabilistic crosslingual NLP system which integrates both steps, translation and the specific NLP task, into a single one. In order to perform this integrated approach to crosslingual tasks, we propose to use the statistical IBM 1 word alignment model (M1). The M1 model may show a non-monotonic behaviour when aligning words from a sentence in a source language to words from another sentence in a different, target language. This is the case of languages with different word order. In English, for instance, adjectives appear before nouns, whereas in Spanish it is exactly the opposite. The successful experimental results reported in three different tasks - text classification, information retrieval and plagiarism analysis - highlight the benefits of the statistical integrated approach proposed in this work.