Text classification based on multi-word with support vector machine

Authors:
Wen Zhang;Taketoshi Yoshida;Xijin Tang
Affiliations:
School of Knowledge Science, Japan Advanced Institute of Science and Technology, 1-1 Ashahidai, Tatsunokuchi, Ishikawa 923-1292, Japan;School of Knowledge Science, Japan Advanced Institute of Science and Technology, 1-1 Ashahidai, Tatsunokuchi, Ishikawa 923-1292, Japan;Institute of Systems Science, Academy of Mathematics and Systems Science, Chinese Academy of Sciences, Beijing 100190, PR China
Venue:
Knowledge-Based Systems
Year:
2008

Citing 22
Cited 14

Self-organized language modeling for speech recognition

Readings in speech recognition
An evaluation of phrasal and clustered representations on a text categorization task

SIGIR '92 Proceedings of the 15th annual international ACM SIGIR conference on Research and development in information retrieval
The nature of statistical learning theory

The nature of statistical learning theory
Document classification using multiword features

Proceedings of the seventh international conference on Information and knowledge management
A re-examination of text categorization methods

Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval
Machine learning in automated text categorization

ACM Computing Surveys (CSUR)
Computer Algorithms: Introduction to Design and Analysis

Computer Algorithms: Introduction to Design and Analysis
Text Categorization with Support Vector Machines. How to Represent Texts in Input Space?

Machine Learning
Text Categorization with Suport Vector Machines: Learning with Many Relevant Features

ECML '98 Proceedings of the 10th European Conference on Machine Learning
A Comparative Study on Feature Selection in Text Categorization

ICML '97 Proceedings of the Fourteenth International Conference on Machine Learning
Ontologies Improve Text Document Clustering

ICDM '03 Proceedings of the Third IEEE International Conference on Data Mining
Retrieving collocations from text: Xtract

Computational Linguistics - Special issue on using large corpora: I
Distribution of content words and phrases in text and language modelling

Natural Language Engineering
Towards automatic extraction of monolingual and bilingual terminology

COLING '94 Proceedings of the 15th conference on Computational linguistics - Volume 1
Surface grammatical analysis for the extraction of terminological noun phrases

COLING '92 Proceedings of the 14th conference on Computational linguistics - Volume 3
Text Mining: Predictive Methods for Analyzing Unstructured Information

Text Mining: Predictive Methods for Analyzing Unstructured Information
Automatic glossary extraction: beyond terminology identification

COLING '02 Proceedings of the 19th international conference on Computational linguistics - Volume 1
Data Mining: Concepts and Techniques

Data Mining: Concepts and Techniques
Extraction of Chinese compound words: an experimental study on a very large corpus

CLPW '00 Proceedings of the second workshop on Chinese language processing: held in conjunction with the 38th Annual Meeting of the Association for Computational Linguistics - Volume 12
An associative classification-based recommendation system for personalization in B2C e-commerce applications

Expert Systems with Applications: An International Journal
Intrusion detection in web applications using text mining

Engineering Applications of Artificial Intelligence
Text document clustering based on frequent word meaning sequences

Data & Knowledge Engineering

Improving effectiveness of mutual information for substantival multiword expression extraction

Expert Systems with Applications: An International Journal
ROLEX-SP: Rules of lexical syntactic patterns for free text categorization

Knowledge-Based Systems
A comparative study of TF*IDF, LSI and multi-words for text classification

Expert Systems with Applications: An International Journal
Word AdHoc Network: Using Google Core Distance to extract the most relevant information

Knowledge-Based Systems
Warning system for online market research - Identifying critical situations in online opinion formation

Knowledge-Based Systems
Towards an RDF encoding of ConceptNet

ISNN'11 Proceedings of the 8th international conference on Advances in neural networks - Volume Part III
Topic detection and multi-word terms extraction for arabic unvowelized documents

AIRS'11 Proceedings of the 7th Asia conference on Information Retrieval Technology
A weighted twin support vector regression

Knowledge-Based Systems
A hybrid generative/discriminative method for semi-supervised classification

Knowledge-Based Systems
A generalized cluster centroid based classifier for text categorization

Information Processing and Management: an International Journal
Free-gram phrase identification for modeling Chinese text

Information Processing Letters
A comparative study on feature selection and adaptive strategies for email foldering using the ABC-DynF framework

Knowledge-Based Systems
Projected-prototype based classifier for text categorization

Knowledge-Based Systems
Fuzzy unordered rule induction algorithm in text categorization on top of geometric particle swarm optimization term selection

Knowledge-Based Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

One of the main themes which support text mining is text representation; that is, its task is to look for appropriate terms to transfer documents into numerical vectors. Recently, many efforts have been invested on this topic to enrich text representation using vector space model (VSM) to improve the performances of text mining techniques such as text classification and text clustering. The main concern in this paper is to investigate the effectiveness of using multi-words for text representation on the performances of text classification. Firstly, a practical method is proposed to implement the multi-word extraction from documents based on the syntactical structure. Secondly, two strategies as general concept representation and subtopic representation are presented to represent the documents using the extracted multi-words. In particular, the dynamic k-mismatch is proposed to determine the presence of a long multi-word which is a subtopic of the content of a document. Finally, we carried out a series of experiments on classifying the Reuters-21578 documents using the representations with multi-words. We used the performance of representation in individual words as the baseline, which has the largest dimension of feature set for representation without linguistic preprocessing. Moreover, linear kernel and non-linear polynomial kernel in support vector machines (SVM) are examined comparatively for classification to investigate the effect of kernel type on their performances. Index terms with low information gain (IG) are removed from the feature set at different percentages to observe the robustness of each classification method. Our experiments demonstrate that in multi-word representation, subtopic representation outperforms the general concept representation and the linear kernel outperforms the non-linear kernel of SVM in classifying the Reuters data. The effect of applying different representation strategies is greater than the effect of applying the different SVM kernels on classification performance. Furthermore, the representation using individual words outperforms any representation using multi-words. This is consistent with the major opinions concerning the role of linguistic preprocessing on documents' features when using SVM for text classification.