Combining an order-semisensitive text similarity and closest fit approach to textual missing values in knowledge discovery

Authors:
Yi Feng;Zhaohui Wu;Zhongmei Zhou
Affiliations:
College of Computer Science, Zhejiang University, Hangzhou, P.R. China;College of Computer Science, Zhejiang University, Hangzhou, P.R. China;College of Computer Science, Zhejiang University, Hangzhou, P.R. China
Venue:
KES'05 Proceedings of the 9th international conference on Knowledge-Based Intelligent Information and Engineering Systems - Volume Part II
Year:
2005

Citing 2
Cited 2

Block edit models for approximate string matching

Theoretical Computer Science - Special issue: Latin American theoretical informatics
The string-to-string correction problem with block moves

ACM Transactions on Computer Systems (TOCS)

KISTCM: knowledge discovery system for traditional Chinese medicine

Applied Intelligence
Multi-label text categorization using k-nearest neighbor approach with m-similarity

SPIRE'05 Proceedings of the 12th international conference on String Processing and Information Retrieval

Quantified Score

Hi-index	0.00

Visualization

Abstract

The ubiquity of textual information nowadays reflects its great significance in knowledge discovery. However, effective usage of these textual materials is always hampered by data incompleteness in real-life applications. In this paper, we apply a closest fit approach to attack textual missing values. To evaluate the closeness of texts in this application, we present an order perspective of text similarity and propose a hybrid order-semisensitive measure, M-similarity, to capture the proximity of texts. This measure combines single item matching, maximum sequence matching and potential matching and get a proper balance between usage of sequence information and efficiency. We incorporate M-similarity into two closest fit methods to missing values in textual attributes and evaluate them on data sets of Traditional Chinese Medicine (TCM). Experimental results illustrate the effectiveness of these methods with M-similarity.