A survey of Chinese text similarity computation

Authors:
Xiuhong Wang;Shiguang Ju;Shengli Wu
Affiliations:
Jiangsu University, Zhenjiang, China;Jiangsu University, Zhenjiang, China;University of Ulster, Northern Ireland, UK
Venue:
AIRS'08 Proceedings of the 4th Asia information retrieval conference on Information retrieval technology
Year:
2008

Citing 6
Cited 2

On modeling of information retrieval concepts in vector spaces

ACM Transactions on Database Systems (TODS)
Comparing representations in Chinese information retrieval

Proceedings of the 20th annual international ACM SIGIR conference on Research and development in information retrieval
Computer Evaluation of Indexing and Text Processing

Journal of the ACM (JACM)
Modern Information Retrieval

Modern Information Retrieval
An Information-Theoretic Definition of Similarity

ICML '98 Proceedings of the Fifteenth International Conference on Machine Learning
An information-theoretic measure for document similarity

Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval

Cross-lingual semantic relatedness using encyclopedic knowledge

EMNLP '09 Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume 3 - Volume 3
Recognizing and regulating e-learners' emotions based on interactive Chinese texts in e-learning systems

Knowledge-Based Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

There is not a natural delimiter between words in Chinese texts. Moreover, Chinese is a semotactic language with complicated structures focusing on semantics. Its differences from Western languages bring more difficulties in Chinese word segmentation and more challenges in Chinese natural language understanding. How to compute the Chinese text similarity with high precision, recall and low cost is a very important but challenging task. Many researchers have studied it for long time. In this paper, we examine existing Chinese text similarity measures, including measures based on statistics and semantics. Our work provides insights into the advantages and disadvantages of each method, including tradeoffs between effectiveness and efficiency. New directions of the future work are discussed.