Dynamic user-defined similarity searching in semi-structured text retrieval

Authors:
Filippo Geraci;Marco Pellegrini
Affiliations:
Istituto di Informatica e Telematica, CNR, Pisa (Italy);Istituto di Informatica e Telematica, CNR, Pisa (Italy)
Venue:
Proceedings of the 3rd international conference on Scalable information systems
Year:
2008

Citing 16
Cited 0

Approximate nearest neighbors: towards removing the curse of dimensionality

STOC '98 Proceedings of the thirtieth annual ACM symposium on Theory of computing
Sublinear time algorithms for metric space problems

STOC '99 Proceedings of the thirty-first annual ACM symposium on Theory of computing
Searching in metric spaces

ACM Computing Surveys (CSUR)
Searching in metric spaces with user-defined and approximate distances

ACM Transactions on Database Systems (TODS)
Efficient User-Adaptable Similarity Search in Large Multimedia Databases

VLDB '97 Proceedings of the 23rd International Conference on Very Large Data Bases
Efficient similarity search and classification via rank aggregation

Proceedings of the 2003 ACM SIGMOD international conference on Management of data
Index-driven similarity search in metric spaces (Survey Article)

ACM Transactions on Database Systems (TODS)
LSH forest: self-tuning indexes for similarity search

WWW '05 Proceedings of the 14th international conference on World Wide Web
Similarity Search: The Metric Space Approach (Advances in Database Systems)

Similarity Search: The Metric Space Approach (Advances in Database Systems)
A scalable algorithm for high-quality clustering of web snippets

Proceedings of the 2006 ACM symposium on Applied computing
Dynamic similarity search in multi-metric spaces

MIR '06 Proceedings of the 8th ACM international workshop on Multimedia information retrieval
Finding near neighbors through cluster pruning

Proceedings of the twenty-sixth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
VISTO: visual storyboard for web video browsing

Proceedings of the 6th ACM international conference on Image and video retrieval
Efficiency-quality tradeoffs for vector score aggregation

VLDB '04 Proceedings of the Thirtieth international conference on Very large data bases - Volume 30
MapReduce: simplified data processing on large clusters

Communications of the ACM - 50th anniversary issue: 1958 - 2008
Cluster generation and cluster labelling for web snippets: a fast and accurate hierarchical solution

SPIRE'06 Proceedings of the 13th international conference on String Processing and Information Retrieval

Quantified Score

Hi-index	0.00

Visualization

Abstract

Modern text retrieval systems often provide a similarity search utility, that allows the user to find efficiently a fixed number h of documents in the data set that are the most similar to a given query (here a query is either a simple sequence of keywords or a full document). We consider the case of a textual database made of semi-structured documents. For example, in a corpus of bibliographic records any record may be structured into three fields: title, authors and abstract, where each field is an unstructured free text. Each field, in turns, may be modelled with a specific vector space. The problem is more complex when we also allow users to associate at query time to each vector space a weight influencing its contribution to the overall dynamic aggregated and weighted similarity. We investigate the use of metric k-center clustering to prune the search space at query time. The embedding of the weights in the data structure is investigated with the purpose of allowing users query customization without any data replication. The validity of our approach is demonstrated experimentally by showing significant quality/time performance improvements over two state of the art methods. We also speed up the pre-processing time by a factor at least thirty with respect to a method based on k-means clustering.