Similarity Model and Term Association For Document Categorization

Authors:
Huaizhong Kou;Georges Gardarin
Affiliations:
-;-
Venue:
DEXA '02 Proceedings of the 13th International Workshop on Database and Expert Systems Applications
Year:
2002

Citing 0
Cited 4

A fuzzy clustering approach for finding similar documents using a novel similarity measure

Expert Systems with Applications: An International Journal
A new approach on search for similar documents with multiple categories using fuzzy clustering

Expert Systems with Applications: An International Journal
Navigating among search results: an information content approach

WISE'07 Proceedings of the 8th international conference on Web information systems engineering
A ConceptLink graph for text structure mining

ACSC '09 Proceedings of the Thirty-Second Australasian Conference on Computer Science - Volume 91

Quantified Score

Hi-index	0.00

Visualization

Abstract

Both Euclidean distance 驴 and cosine-based similarity models are widely used for measures of document similarity in information retrieval and document categorization community. These two similarity models are based on the assumption that term vectors areorthogonal. But this assumption is not true. Term associations are ignored in such similarity models. In document categorization context, we analyze the properties of term-document space, term-category space and category-document space. Then, without the assumption of term independence, we propose a new mathematical model to estimate the association between terms and define a 驴-similarity model of documents. Here we make best use of existing category membership represented by corpus as much as possible, and the objective is to improve categorization performance. Experiments have been done with k-NN classifier overReuters-5178 corpus. The empirical results show that utilization of term association can improve the effectiveness of categorization system and 驴-similarity model outperforms than ones without term association.