Term-weighting approaches in automatic text retrieval
Information Processing and Management: an International Journal
Machine learning in automated text categorization
ACM Computing Surveys (CSUR)
Centroid-Based Document Classification: Analysis and Experimental Results
PKDD '00 Proceedings of the 4th European Conference on Principles of Data Mining and Knowledge Discovery
A Fast Algorithm for Hierarchical Text Classification
DaWaK 2000 Proceedings of the Second International Conference on Data Warehousing and Knowledge Discovery
A language and character set determination method based on N-gram statistics
ACM Transactions on Asian Language Information Processing (TALIP)
Combining PPM Models Using A Text Mining Approach
DCC '01 Proceedings of the Data Compression Conference
Supervised term weighting for automated text categorization
Proceedings of the 2003 ACM symposium on Applied computing
An Empirical Study of Feature Selection for Text Categorization based on Term Weightage
WI '04 Proceedings of the 2004 IEEE/WIC/ACM International Conference on Web Intelligence
TF-ICF: A New Term Weighting Scheme for Clustering Dynamic Data Streams
ICMLA '06 Proceedings of the 5th International Conference on Machine Learning and Applications
Linguini: language identification for multilingual documents
Journal of Management Information Systems - Special section: Exploring the outlands of the MIS discipline
ACL '04 Proceedings of the 42nd Annual Meeting on Association for Computational Linguistics
Large margin DragPushing strategy for centroid text categorization
Expert Systems with Applications: An International Journal
An improved centroid classifier for text categorization
Expert Systems with Applications: An International Journal
A class-feature-centroid classifier for text categorization
Proceedings of the 18th international conference on World wide web
Improved Letter Weighting Feature Selection on Arabic Script Language Identification
ACIIDS '09 Proceedings of the 2009 First Asian Conference on Intelligent Information and Database Systems
Text classification with the support of pruned dependency patterns
Pattern Recognition Letters
Language identification of names with SVMs
HLT '10 Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics
A comparison of language identification approaches on short, query-style texts
ECIR'2010 Proceedings of the 32nd European conference on Advances in Information Retrieval
Enhanced centroid-based classification technique by filtering outliers
TSD'06 Proceedings of the 9th international conference on Text, Speech and Dialogue
Class normalization in centroid-based text categorization
Information Sciences: an International Journal
Analyzing document collections via context-aware term extraction
NLDB'09 Proceedings of the 14th international conference on Applications of Natural Language to Information Systems
Technology classification with latent semantic indexing
Expert Systems with Applications: An International Journal
Hi-index | 0.10 |
Centroid-based classification is a machine learning approach used in the text classification domain. The main advantage of centroid-based classifiers is their high performance during both the training stage and the classification stage. However, the success rate can be lower than the other classifiers if good centroid values are not used. In this paper, we apply the centroid-based classification method to the language identification problem, which can be considered as a sub-problem of text classification. We propose a novel method named as inverse class frequency to increase the quality of the centroid values, which involves an update of the classical values. We also use a feature set formed of individual characters rather than words or n-gram sequences to decrease the training and classification times. The experiments were performed on the ECI/MCI corpus and the method was compared with other methods and previous studies. The results showed that the proposed approach yields high success rates and works very efficiently for language identification.