On Effective Conceptual Indexing and Similarity Search in Text Data

Authors:
Charu C. Aggarwal;Philip S. Yu
Affiliations:
-;-
Venue:
ICDM '01 Proceedings of the 2001 IEEE International Conference on Data Mining
Year:
2001

Citing 0
Cited 8

A novel document similarity measure based on earth mover's distance

Information Sciences: an International Journal
An automated system for web portal personalization

VLDB '02 Proceedings of the 28th international conference on Very Large Data Bases
Using text classification and multiple concepts to answer e-mails

Expert Systems with Applications: An International Journal
A hybrid computational model for an automated image descriptor for visually impaired users

Computers in Human Behavior
Document similarity search based on generic summaries

AIRS'05 Proceedings of the Second Asia conference on Asia Information Retrieval Technology
Extract salient words with wordrank for effective similarity search in text data

WISE'05 Proceedings of the 6th international conference on Web Information Systems Engineering
On enhancing the performance of spam mail filtering system using semantic enrichment

AI'04 Proceedings of the 17th Australian joint conference on Advances in Artificial Intelligence
A conceptual representation of documents and queries for information retrieval systems by using light ontologies

Expert Systems with Applications: An International Journal

Quantified Score

Hi-index	0.00

Visualization

Abstract

Similarity search in text has proven to be an interesting problem from the qualitative perspective because of inherent redundancies and ambiguities in textual descriptions. The methods used in search engines in order to retrieve documents most similar to user-defined sets of keywords are not applicable to targets which are medium to large size documents, because of even greater noise effects stemming from the presence of a large number of words unrelated to the overall topic in the document. The inverted representation is the dominant method for indexing text, but it is not as suitable for document-to-document similarity search, as for short user-queries. One way of improving the quality of similarity search is Latent Semantic Indexing (LSI), which maps the documents from the original set of words to a concept space. U fortunately, LSI maps the data into a domain in which it is not possible to provide effectiveindexing techniques. In this paper, we investigate new ways of providing conceptual search among documents bycreating a representation in terms of conceptual word-chains. This technique also allows effective indexing techniques so that similarity queries ca be performed on large collectionsof documents by accessing a small amount of data. We demonstrate that our scheme outperforms standard textual similarity search o the inverted representation both in terms of quality a d search efficiency.