Domain-specific Chinese word segmentation using suffix tree and mutual information

Authors:
Daniel Zeng;Donghua Wei;Michael Chau;Feiyue Wang
Affiliations:
Chinese Academy of Sciences, Institute of Automation, Beijing, China and The University of Arizona, Tucson, USA;Chinese Academy of Sciences, Institute of Automation, Beijing, China;The University of Hong Kong, Hong Kong, China;Chinese Academy of Sciences, Institute of Automation, Beijing, China and The University of Arizona, Tucson, USA
Venue:
Information Systems Frontiers
Year:
2011

Citing 22
Cited 4

Chinese text segmentation for text retrieval: achievements and problems

Journal of the American Society for Information Science
A stochastic finite-state word-segmentation algorithm for Chinese

Computational Linguistics
PAT-tree-based keyword extraction for Chinese information retrieval

Proceedings of the 20th annual international ACM SIGIR conference on Research and development in information retrieval
A new statistical formula for Chinese text segmentation incorporating contextual information

Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval
A Space-Economical Suffix Tree Construction Algorithm

Journal of the ACM (JACM)
Constructing Suffix Trees On-Line in Linear Time

Proceedings of the IFIP 12th World Computer Congress on Algorithms, Software, Architecture - Information Processing '92, Volume 1 - Volume I
Self-Supervised Chinese Word Segmentation

IDA '01 Proceedings of the 4th International Conference on Advances in Intelligent Data Analysis
A compression-based algorithm for Chinese word segmentation

Computational Linguistics
A trainable rule-based algorithm for word segmentation

ACL '98 Proceedings of the 35th Annual Meeting of the Association for Computational Linguistics and Eighth Conference of the European Chapter of the Association for Computational Linguistics
Chinese word segmentation based on maximum matching and word binding force

COLING '96 Proceedings of the 16th conference on Computational linguistics - Volume 1
Dynamic dictionary matching and compressed suffix trees

SODA '05 Proceedings of the sixteenth annual ACM-SIAM symposium on Discrete algorithms
Building a large-scale annotated Chinese corpus

COLING '02 Proceedings of the 19th international conference on Computational linguistics - Volume 1
A character-net based Chinese text segmentation method

SEMANET '02 Proceedings of the 2002 workshop on Building and using semantic networks - Volume 11
HHMM-based Chinese lexical analyzer ICTCLAS

SIGHAN '03 Proceedings of the second SIGHAN workshop on Chinese language processing - Volume 17
Unsupervised models for morpheme segmentation and morphology learning

ACM Transactions on Speech and Language Processing (TSLP)
Chinese segmentation and new word detection using conditional random fields

COLING '04 Proceedings of the 20th international conference on Computational Linguistics
Mining communities and their relationships in blogs: A study of online hate groups

International Journal of Human-Computer Studies
Cyberinfrastructure for homeland security: Advances in information sharing, data mining, and collaboration systems

Decision Support Systems
Co-word analysis using the Chinese character set

Journal of the American Society for Information Science and Technology
Linear pattern matching algorithms

SWAT '73 Proceedings of the 14th Annual Symposium on Switching and Automata Theory (swat 1973)
Intelligence and security informatics

Annual Review of Information Science and Technology
Editorial: Intelligence and security informatics: information systems perspective

Decision Support Systems - Special issue: Intelligence and security informatics

Introduction to special issue on terrorism informatics

Information Systems Frontiers
A hybrid system for online detection of emotional distress

PAISI'12 Proceedings of the 2012 Pacific Asia conference on Intelligence and Security Informatics
Character usage in Chinese short message service SMS: a real-world study in Mainland China

International Journal of Mobile Communications
Analyzing sentiments in Web 2.0 social media data in Chinese: experiments on business and marketing related Chinese Web forums

Information Technology and Management

Quantified Score

Hi-index	0.00

Visualization

Abstract

As the amount of online Chinese contents grows, there is a critical need for effective Chinese word segmentation approaches to facilitate Web computing applications in a range of domains including terrorism informatics. Most existing Chinese word segmentation approaches are either statistics-based or dictionary-based. The pure statistical method has lower precision, while the pure dictionary-based method cannot deal with new words beyond the dictionary. In this paper, we propose a hybrid method that is able to avoid the limitations of both types of approaches. Through the use of suffix tree and mutual information (MI) with the dictionary, our segmenter, called IASeg, achieves high accuracy in word segmentation when domain training is available. It can also identify new words through MI-based token merging and dictionary updating. In addition, with the proposed Improved Bigram method IASeg can process N-grams. To evaluate the performance of our segmenter, we compare it with two well-known systems, the Hylanda segmenter and the ICTCLAS segmenter, using a terrorism-centric corpus and a general corpus. The experiment results show that IASeg performs better than the benchmarks in both precision and recall for the domain-specific corpus and achieves comparable performance for the general corpus.