Language detection and tracking in multilingual documents using weak estimators

Authors:
Aleksander Stensby;B. John Oommen;Ole-Christoffer Granmo
Affiliations:
Dept. of ICT, University of Agder, Grimstad, Norway;Dept. of ICT, University of Agder, Grimstad, Norway and School of Computer Science, Carleton University, Ottawa, Canada;Dept. of ICT, University of Agder, Grimstad, Norway
Venue:
SSPR&SPR'10 Proceedings of the 2010 joint IAPR international conference on Structural, syntactic, and statistical pattern recognition
Year:
2010

Citing 5
Cited 2

The automatic identification of languages using linguistic recognition signals

The automatic identification of languages using linguistic recognition signals
Unsupervised segmentation of words using prior distributions of morph length and frequency

ACL '03 Proceedings of the 41st Annual Meeting on Association for Computational Linguistics - Volume 1
Unsupervised discovery of morphemes

MPL '02 Proceedings of the ACL-02 workshop on Morphological and phonological learning - Volume 6
Stochastic learning-based weak estimation of multinomial random variables and its applications to pattern recognition in non-stationary environments

Pattern Recognition
Language identification in multi-lingual web-documents

NLDB'06 Proceedings of the 11th international conference on Applications of Natural Language to Information Systems

Tracking the preferences of users using weak estimators

AI'11 Proceedings of the 24th international conference on Advances in Artificial Intelligence
A stochastic search on the line-based solution to discretized estimation

IEA/AIE'12 Proceedings of the 25th international conference on Industrial Engineering and Other Applications of Applied Intelligent Systems: advanced research in applied artificial intelligence

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper deals with the extremely complicated problem of language detection and tracking in real-life electronic (for example, in Word-of-Mouth (WoM)) applications, where various segments of the text are written in different languages. The difficulties in solving the problem are many-fold. First of all, the analyst has no knowledge of when one language stops and when the next starts. Further, the features which one uses for any one language (for example, the n-grams) will not be valid to recognize another. Finally, and most importantly, in most reallife applications, such as in WoM, the fragments of text available before the switching, are so small that it renders any meaningful classification using traditional estimation methods almost meaningless. Earlier, the authors of [10] had recommended that for a variety of problems, the use of strong estimators (i.e., estimators that converge with probability 1) is sub-optimal. In this vein, we propose to solve the current problem using novel estimators that are pertinent for non-stationary environments. The classification results which involve as many as 8 languages demonstrates that our proposed methodology is both powerful and efficient.