A comparative study of TF*IDF, LSI and multi-words for text classification

Authors:
Wen Zhang;Taketoshi Yoshida;Xijin Tang
Affiliations:
Laboratory for Internet Software Technologies, Institute of Software, Chinese Academy of Sciences, Beijing 100190, PR China;School of Knowledge Science, Japan Advanced Institute of Science and Technology, 1-1 Ashahidai, Nomi, Ishikawa 923-1292, Japan;Institute of Systems Science, Academy of Mathematics and Systems Science, Chinese Academy of Sciences, Beijing 100190, PR China
Venue:
Expert Systems with Applications: An International Journal
Year:
2011

Citing 22
Cited 8

Automatic text processing: the transformation, analysis, and retrieval of information by computer

Automatic text processing: the transformation, analysis, and retrieval of information by computer
Word association norms, mutual information, and lexicography

Computational Linguistics
Text representation for intelligent text retrieval: a classification-oriented view

Text-based intelligent systems
An evaluation of phrasal and clustered representations on a text categorization task

SIGIR '92 Proceedings of the 15th annual international ACM SIGIR conference on Research and development in information retrieval
The nature of statistical learning theory

The nature of statistical learning theory
Using linear algebra for intelligent information retrieval

SIAM Review
Document classification using multiword features

Proceedings of the seventh international conference on Information and knowledge management
A semidiscrete matrix decomposition for latent semantic indexing information retrieval

ACM Transactions on Information Systems (TOIS)
Foundations of statistical natural language processing

Foundations of statistical natural language processing
A re-examination of text categorization methods

Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval
A learner-independent evaluation of the usefulness of statistical phrases for automated text categorization

Text databases & document management
Introduction to Modern Information Retrieval

Introduction to Modern Information Retrieval
Text Categorization with Support Vector Machines. How to Represent Texts in Input Space?

Machine Learning
Ontologies Improve Text Document Clustering

ICDM '03 Proceedings of the Third IEEE International Conference on Data Mining
Data Mining: Concepts and Techniques

Data Mining: Concepts and Techniques
Document Clustering Using Locality Preserving Indexing

IEEE Transactions on Knowledge and Data Engineering
Identifying Multi-Word Terms by Text-Segments

WAIMW '06 Proceedings of the Seventh International Conference on Web-Age Information Management Workshops
Text document clustering based on frequent word meaning sequences

Data & Knowledge Engineering
Augmenting the power of LSI in text retrieval: Singular value rescaling

Data & Knowledge Engineering
Text classification based on multi-word with support vector machine

Knowledge-Based Systems
Using ontology to improve precision of terminology extraction from documents

Expert Systems with Applications: An International Journal
Text Mining: Predictive Methods for Analyzing Unstructured Information

Text Mining: Predictive Methods for Analyzing Unstructured Information

Dimensionality reduction with category information fusion and non-negative matrix factorization for text categorization

AICI'11 Proceedings of the Third international conference on Artificial intelligence and computational intelligence - Volume Part III
A multi-classifier system for text categorization

Proceedings of the 2011 ACM Symposium on Research in Applied Computation
Automated approaches for detecting integration in student essays

ITS'12 Proceedings of the 11th international conference on Intelligent Tutoring Systems
A generalized cluster centroid based classifier for text categorization

Information Processing and Management: an International Journal
Recognition of word collocation habits using frequency rank ratio and inter-term intimacy

Expert Systems with Applications: An International Journal
Clustering Software Components for Component Reuse and Program Restructuring

Proceedings of the Second International Conference on Innovative Computing and Cloud Computing
Review-based measurement of customer satisfaction in mobile service: Sentiment analysis and VIKOR approach

Expert Systems with Applications: An International Journal
Enhanced cross-domain document clustering with a semantically enhanced text stemmer SETS

International Journal of Knowledge-based and Intelligent Engineering Systems - Selected papers of KES2012-Part 2 of 2

Quantified Score

Hi-index	12.05

Visualization

Abstract

One of the main themes in text mining is text representation, which is fundamental and indispensable for text-based intellegent information processing. Generally, text representation inludes two tasks: indexing and weighting. This paper has comparatively studied TF*IDF, LSI and multi-word for text representation. We used a Chinese and an English document collection to respectively evaluate the three methods in information retreival and text categorization. Experimental results have demonstrated that in text categorization, LSI has better performance than other methods in both document collections. Also, LSI has produced the best performance in retrieving English documents. This outcome has shown that LSI has both favorable semantic and statistical quality and is different with the claim that LSI can not produce discriminative power for indexing.