A Novel Multilingual Text Categorization System using Latent Semantic Indexing

Authors:
Chung-Hong Lee;Hsin-Chang Yang;Sheng-Min Ma
Affiliations:
National Kaohsiung University of Applied Sciences, Taiwan;Chang Jung University, Taiwan;National Kaohsiung University of Applied Sciences, Taiwan
Venue:
ICICIC '06 Proceedings of the First International Conference on Innovative Computing, Information and Control - Volume 2
Year:
2006

Citing 0
Cited 5

Update summarization based on novel topic distribution

Proceedings of the 9th ACM symposium on Document engineering
Update Summarization Based on Latent Semantic Analysis

TSD '09 Proceedings of the 12th International Conference on Text, Speech and Dialogue
Exploring Wikipedia and DMoz as Knowledge Bases for Engineering a User Interests Hierarchy for Social Network Applications

OTM '09 Proceedings of the Confederated International Conferences, CoopIS, DOA, IS, and ODBASE 2009 on On the Move to Meaningful Internet Systems: Part II
A new unsupervised feature selection method for text clustering based on genetic algorithms

Journal of Intelligent Information Systems
A Heuristic Method for Learning Path Sequencing for Intelligent Tutoring System ITS in E-learning

International Journal of Intelligent Information Technologies

Quantified Score

Hi-index	0.01

Visualization

Abstract

Latent Semantic Indexing is a well known technique in Information Retrieval, especially in dealing with polysemy and synonymy. LSI use SVD process to decompose the original term-document matrix into a lower dimension triplet. The triplet (the resulted matrices) is the approximation to original matrix and can capture the latent semantic relation between terms. In this paper, we propose a novel method for multilingual text categorization using Latent Semantic Indexing. The centroid of each class has been calculated in the decomposed SVD space. The similarity threshold of categorization is predefined for each centroid. Test documents with similarity measurement larger than the threshold will be labeled "Positive" (Relevant) or else would be labeled "Negative" (Non-Relevant). Experimental result indicated that the performance on the precision, recall and F1 are quite good using LSI technique to categorize the multi-language text. The F1 measurement has an average value of 70% and the precision can reach 80% using our algorithm.