Free-gram phrase identification for modeling Chinese text

Authors:
Xi Peng;Zhang Yi;Xiao-Yong Wei;De-Zhong Peng;Yong-Sheng Sang
Affiliations:
Machine Intelligence Laboratory, College of Computer Science, Sichuan University, Chengdu, 610065, China;Machine Intelligence Laboratory, College of Computer Science, Sichuan University, Chengdu, 610065, China;Machine Intelligence Laboratory, College of Computer Science, Sichuan University, Chengdu, 610065, China;Machine Intelligence Laboratory, College of Computer Science, Sichuan University, Chengdu, 610065, China;Machine Intelligence Laboratory, College of Computer Science, Sichuan University, Chengdu, 610065, China
Venue:
Information Processing Letters
Year:
2013

Citing 21
Cited 0

An evaluation of phrasal and clustered representations on a text categorization task

SIGIR '92 Proceedings of the 15th annual international ACM SIGIR conference on Research and development in information retrieval
What is the goal of sensory coding?

Neural Computation
A learner-independent evaluation of the usefulness of statistical phrases for automated text categorization

Text databases & document management
Machine learning in automated text categorization

ACM Computing Surveys (CSUR)
Text Categorization with Suport Vector Machines: Learning with Many Relevant Features

ECML '98 Proceedings of the 10th European Conference on Machine Learning
Chinese Documents Classification Based on N-Grams

CICLing '02 Proceedings of the Third International Conference on Computational Linguistics and Intelligent Text Processing
Text classification using string kernels

The Journal of Machine Learning Research
A machine learning approach to coreference resolution of noun phrases

Computational Linguistics - Special issue on computational anaphora resolution
A novel refinement approach for text categorization

Proceedings of the 14th ACM international conference on Information and knowledge management
Chinese lexical analysis using hierarchical hidden Markov model

SIGHAN '03 Proceedings of the second SIGHAN workshop on Chinese language processing - Volume 17
Sequential patterns for text categorization

Intelligent Data Analysis
Fast exact string matching algorithms

Information Processing Letters
A unified architecture for natural language processing: deep neural networks with multitask learning

Proceedings of the 25th international conference on Machine learning
LIBLINEAR: A Library for Large Linear Classification

The Journal of Machine Learning Research
Text classification based on multi-word with support vector machine

Knowledge-Based Systems
Rich document representation and classification: An analysis

Knowledge-Based Systems
An aggressive algorithm for multiple string matching

Information Processing Letters
A Hierarchical n-Grams Extraction Approach for Classification Problem

Advanced Internet Based Systems and Applications
Improving practical exact string matching

Information Processing Letters
An effective refinement strategy for KNN text classifier

Expert Systems with Applications: An International Journal
A document-sensitive graph model for multi-document summarization

Knowledge and Information Systems

Quantified Score

Hi-index	0.89

Visualization

Abstract

Vector space model using bag of phrases plays an important role in modeling Chinese text. However, the conventional way of using fixed gram scanning to identify free-length phrases is costly. To address this problem, we propose a novel approach for key phrase identification which is capable of identify phrases with all lengths and thus improves the coding efficiency and discrimination of the data representation. In the proposed method, we first convert each document into a context graph, a directed graph that encapsulates the statistical and positional information of all the 2-word strings in the document. We treat every transmission path in the graph as a hypothesis for a phrase, and select the corresponding phrase as a candidate phrase if the hypothesis is valid in the original document. Finally, we selectively divide some of the complex candidate phrases into sub-phrases to improve the coding efficiency, resulting in a set of phrases for codebook construction. The experiments on both balanced and unbalanced datasets show that the codebooks generated by our approach are more efficient than those by conventional methods (one syntactical method and three statistical methods are investigated). Furthermore, the data representation created by our approach has demonstrated higher discrimination than those by conventional methods in classification task.