Probabilistic models in information retrieval
The Computer Journal - Special issue on information retrieval
CHECK: a document plagiarism detection system
SAC '97 Proceedings of the 1997 ACM symposium on Applied computing
Text Classification from Labeled and Unlabeled Documents using EM
Machine Learning - Special issue on information retrieval
BoosTexter: A Boosting-based Systemfor Text Categorization
Machine Learning - Special issue on information retrieval
Information Retrieval
Naive (Bayes) at Forty: The Independence Assumption in Information Retrieval
ECML '98 Proceedings of the 10th European Conference on Machine Learning
Text Categorization with Suport Vector Machines: Learning with Many Relevant Features
ECML '98 Proceedings of the 10th European Conference on Machine Learning
Fast and Accurate Sentence Alignment of Bilingual Corpora
AMTA '02 Proceedings of the 5th Conference of the Association for Machine Translation in the Americas on Machine Translation: From Research to Real Users
A systematic comparison of various statistical alignment models
Computational Linguistics
Word reordering and a dynamic programming beam search algorithm for statistical machine translation
Computational Linguistics
Augmenting Naive Bayes Classifiers with Statistical Language Models
Information Retrieval
Embedding web-based statistical translation models in cross-language information retrieval
Computational Linguistics - Special issue on web as corpus
The mathematics of statistical machine translation: parameter estimation
Computational Linguistics - Special issue on using large corpora: II
Document preprocessing for naive Bayes classification and clustering with mixture of multinomials
Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining
Statistical phrase-based translation
NAACL '03 Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology - Volume 1
Word-Level Confidence Estimation for Machine Translation
Computational Linguistics
Parallel corpora segmentation using anchor words
EAMT '03 Proceedings of the 7th International EAMT workshop on MT and other Language Technology Tools, Improving MT through other Language Technology Tools: Resources and Tools for Building MT
Using query-relevant documents pairs for cross-lingual information retrieval
TSD'07 Proceedings of the 10th international conference on Text, speech and dialogue
A Wikipedia-based multilingual retrieval model
ECIR'08 Proceedings of the IR research, 30th European conference on Advances in information retrieval
CLEF'05 Proceedings of the 6th international conference on Cross-Language Evalution Forum: accessing Multilingual Information Repositories
EuroGOV: engineering a multilingual web corpus
CLEF'05 Proceedings of the 6th international conference on Cross-Language Evalution Forum: accessing Multilingual Information Repositories
Plagiarism detection across distant language pairs
COLING '10 Proceedings of the 23rd International Conference on Computational Linguistics
Cross-language plagiarism detection
Language Resources and Evaluation
Developing a corpus of plagiarised short answers
Language Resources and Evaluation
Towards the detection of cross-language source code reuse
NLDB'11 Proceedings of the 16th international conference on Natural language processing and information systems
A machine-translation method for normalization of SMS
MCPR'12 Proceedings of the 4th Mexican conference on Pattern Recognition
Cross-Language high similarity search using a conceptual thesaurus
CLEF'12 Proceedings of the Third international conference on Information Access Evaluation: multilinguality, multimodality, and visual analytics
Managing information disparity in multilingual document collections
ACM Transactions on Speech and Language Processing (TSLP)
Hi-index | 0.01 |
The existence of huge volumes of documents written in multiple languages on Internet leads to investigate novel algorithmic approaches to deal with information of this kind. However, most crosslingual natural language processing (NLP) tasks consider a decoupled approach in which monolingual NLP techniques are applied along with an independent translation process. This two-step approach is too sensitive to translation errors, and in general to the accumulative effect of errors. To solve this problem, we propose to use a direct probabilistic crosslingual NLP system which integrates both steps, translation and the specific NLP task, into a single one. In order to perform this integrated approach to crosslingual tasks, we propose to use the statistical IBM 1 word alignment model (M1). The M1 model may show a non-monotonic behaviour when aligning words from a sentence in a source language to words from another sentence in a different, target language. This is the case of languages with different word order. In English, for instance, adjectives appear before nouns, whereas in Spanish it is exactly the opposite. The successful experimental results reported in three different tasks - text classification, information retrieval and plagiarism analysis - highlight the benefits of the statistical integrated approach proposed in this work.