A high performance centroid-based classification approach for language identification

Authors:
Hidayet Takçı;Tunga GüNgöR
Affiliations:
Department of Computer Engineering, GYTE, Kocaeli 41400, Turkey;Department of Computer Engineering, Boğaziçi University, İstanbul 34342, Turkey
Venue:
Pattern Recognition Letters
Year:
2012

Citing 23
Cited 1

Term-weighting approaches in automatic text retrieval

Information Processing and Management: an International Journal
Machine learning in automated text categorization

ACM Computing Surveys (CSUR)
Text Categorization with Support Vector Machines. How to Represent Texts in Input Space?

Machine Learning
Centroid-Based Document Classification: Analysis and Experimental Results

PKDD '00 Proceedings of the 4th European Conference on Principles of Data Mining and Knowledge Discovery
A Fast Algorithm for Hierarchical Text Classification

DaWaK 2000 Proceedings of the Second International Conference on Data Warehousing and Knowledge Discovery
A language and character set determination method based on N-gram statistics

ACM Transactions on Asian Language Information Processing (TALIP)
Combining PPM Models Using A Text Mining Approach

DCC '01 Proceedings of the Data Compression Conference
Supervised term weighting for automated text categorization

Proceedings of the 2003 ACM symposium on Applied computing
An Empirical Study of Feature Selection for Text Categorization based on Term Weightage

WI '04 Proceedings of the 2004 IEEE/WIC/ACM International Conference on Web Intelligence
TF-ICF: A New Term Weighting Scheme for Clustering Dynamic Data Streams

ICMLA '06 Proceedings of the 5th International Conference on Machine Learning and Applications
Linguini: language identification for multilingual documents

Journal of Management Information Systems - Special section: Exploring the outlands of the MIS discipline
Learning with unlabeled data for text categorization using bootstrapping and feature projection techniques

ACL '04 Proceedings of the 42nd Annual Meeting on Association for Computational Linguistics
Large margin DragPushing strategy for centroid text categorization

Expert Systems with Applications: An International Journal
Improved centroids estimation for the nearest shrunken centroid classifier

Bioinformatics
An improved centroid classifier for text categorization

Expert Systems with Applications: An International Journal
A class-feature-centroid classifier for text categorization

Proceedings of the 18th international conference on World wide web
Improved Letter Weighting Feature Selection on Arabic Script Language Identification

ACIIDS '09 Proceedings of the 2009 First Asian Conference on Intelligent Information and Database Systems
Text classification with the support of pruned dependency patterns

Pattern Recognition Letters
Language identification of names with SVMs

HLT '10 Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics
A comparison of language identification approaches on short, query-style texts

ECIR'2010 Proceedings of the 32nd European conference on Advances in Information Retrieval
Enhanced centroid-based classification technique by filtering outliers

TSD'06 Proceedings of the 9th international conference on Text, Speech and Dialogue
Class normalization in centroid-based text categorization

Information Sciences: an International Journal
Analyzing document collections via context-aware term extraction

NLDB'09 Proceedings of the 14th international conference on Applications of Natural Language to Information Systems

Technology classification with latent semantic indexing

Expert Systems with Applications: An International Journal

Quantified Score

Hi-index	0.10

Visualization

Abstract

Centroid-based classification is a machine learning approach used in the text classification domain. The main advantage of centroid-based classifiers is their high performance during both the training stage and the classification stage. However, the success rate can be lower than the other classifiers if good centroid values are not used. In this paper, we apply the centroid-based classification method to the language identification problem, which can be considered as a sub-problem of text classification. We propose a novel method named as inverse class frequency to increase the quality of the centroid values, which involves an update of the classical values. We also use a feature set formed of individual characters rather than words or n-gram sequences to decrease the training and classification times. The experiments were performed on the ECI/MCI corpus and the method was compared with other methods and previous studies. The results showed that the proposed approach yields high success rates and works very efficiently for language identification.