Set-based vector model: An efficient approach for correlation-based ranking

Authors:
Bruno Pôssas;Nivio Ziviani;Wagner Meira, Jr.;Berthier Ribeiro-Neto
Affiliations:
Federal University of Minas Gerais, MG, Brazil;Federal University of Minas Gerais, MG, Brazil;Federal University of Minas Gerais, MG, Brazil;Federal University of Minas Gerais, Brazil and Akwan Information Technologies, MG, Brazil
Venue:
ACM Transactions on Information Systems (TOIS)
Year:
2005

Citing 38
Cited 9

Soft evaluation of Boolean search queries in information retrieval systems

Information Technology Research Development Applications
On modeling of information retrieval concepts in vector spaces

ACM Transactions on Database Systems (TODS)
Term-weighting approaches in automatic text retrieval

Information Processing and Management: an International Journal
Inference networks for document retrieval

SIGIR '90 Proceedings of the 13th annual international ACM SIGIR conference on Research and development in information retrieval
Mining association rules between sets of items in large databases

SIGMOD '93 Proceedings of the 1993 ACM SIGMOD international conference on Management of data
Some simple effective approximations to the 2-Poisson model for probabilistic weighted retrieval

SIGIR '94 Proceedings of the 17th annual international ACM SIGIR conference on Research and development in information retrieval
Efficient retrieval of partial documents

TREC-2 Proceedings of the second conference on Text retrieval conference
Generalized vector spaces model in information retrieval

SIGIR '85 Proceedings of the 8th annual international ACM SIGIR conference on Research and development in information retrieval
Passage retrieval revisited

Proceedings of the 20th annual international ACM SIGIR conference on Research and development in information retrieval
A language modeling approach to information retrieval

Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval
How reliable are the results of large-scale information retrieval experiments?

Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval
On the necessity of term dependence in a query space for weighted retrieval

Journal of the American Society for Information Science
Information retrieval as statistical translation

Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval
Results and challenges in Web search evaluation

WWW '99 Proceedings of the eighth international conference on World Wide Web
A general language model for information retrieval

Proceedings of the eighth international conference on Information and knowledge management
Experiments on the determination of the relationships between terms

ACM Transactions on Database Systems (TODS)
On Relevance, Probabilistic Indexing and Information Retrieval

Journal of the ACM (JACM)
Computer Evaluation of Indexing and Text Processing

Journal of the ACM (JACM)
Precision Weighting—An Effective Automatic Indexing Method

Journal of the ACM (JACM)
Managing gigabytes (2nd ed.): compressing and indexing documents and images

Managing gigabytes (2nd ed.): compressing and indexing documents and images
Efficient passage ranking for document databases

ACM Transactions on Information Systems (TOIS)
Enhancing Concept-Based Retrieval Based onMinimal Term Sets

Journal of Intelligent Information Systems - Special issue on methodologies for intelligent information systems
Generating non-redundant association rules

Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining
Document language models, query models, and risk minimization for information retrieval

Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval
A context vector model for information retrieval

Journal of the American Society for Information Science and Technology
Information Retrieval

Information Retrieval
Modern Information Retrieval

Modern Information Retrieval
Set-based model: a new approach for information retrieval

SIGIR '02 Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval
Biterm language models for document retrieval

SIGIR '02 Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval
Introduction to Modern Information Retrieval

Introduction to Modern Information Retrieval
Capturing term dependencies using a language model based on sentence trees

Proceedings of the eleventh international conference on Information and knowledge management
From E-Sex to E-Commerce: Web Search Changes

Computer
An evaluation of term dependence models in information retrieval

SIGIR '82 Proceedings of the 5th annual ACM conference on Research and development in information retrieval
Fast Algorithms for Mining Association Rules in Large Databases

VLDB '94 Proceedings of the 20th International Conference on Very Large Data Bases
Enhancing the Set-Based Model Using Proximity Information

SPIRE 2002 Proceedings of the 9th International Symposium on String Processing and Information Retrieval
CoBWeb A Crawler for the Brazilian Web

SPIRE '99 Proceedings of the String Processing and Information Retrieval Symposium & International Workshop on Groupware
Dependence language model for information retrieval

Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval
The SMART Retrieval System—Experiments in Automatic Document Processing

The SMART Retrieval System—Experiments in Automatic Document Processing

Maximal termsets as a query structuring mechanism

Proceedings of the 14th ACM international conference on Information and knowledge management
A combined component approach for finding collection-adapted ranking functions based on genetic programming

SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
Query-sets: using implicit feedback and query patterns to organize web documents

Proceedings of the 17th international conference on World Wide Web
Exploiting Morphological Query Structure Using Genetic Optimisation

NLDB '08 Proceedings of the 13th international conference on Natural Language and Information Systems: Applications of Natural Language to Information Systems
Structure of morphologically expanded queries: A genetic algorithm approach

Data & Knowledge Engineering
A vector space approach to tag cloud similarity ranking

Information Processing Letters
Query clauses and term independence

CLEF'08 Proceedings of the 9th Cross-language evaluation forum conference on Evaluating systems for multilingual and multimodal information access
A peer-to-peer architecture for information retrieval across digital library collections

ECDL'06 Proceedings of the 10th European conference on Research and Advanced Technology for Digital Libraries
Using genetic algorithms for query reformulation

FDIA'07 Proceedings of the 1st BCS IRSG conference on Future Directions in Information Access

Quantified Score

Hi-index	0.00

Visualization

Abstract

This work presents a new approach for ranking documents in the vector space model. The novelty lies in two fronts. First, patterns of term co-occurrence are taken into account and are processed efficiently. Second, term weights are generated using a data mining technique called association rules. This leads to a new ranking mechanism called the set-based vector model. The components of our model are no longer index terms but index termsets, where a termset is a set of index terms. Termsets capture the intuition that semantically related terms appear close to each other in a document. They can be efficiently obtained by limiting the computation to small passages of text. Once termsets have been computed, the ranking is calculated as a function of the termset frequency in the document and its scarcity in the document collection. Experimental results show that the set-based vector model improves average precision for all collections and query types evaluated, while keeping computational costs small. For the 2-gigabyte TREC-8 collection, the set-based vector model leads to a gain in average precision figures of 14.7% and 16.4% for disjunctive and conjunctive queries, respectively, with respect to the standard vector space model. These gains increase to 24.9% and 30.0%, respectively, when proximity information is taken into account. Query processing times are larger but, on average, still comparable to those obtained with the standard vector model (increases in processing time varied from 30% to 300%). Our results suggest that the set-based vector model provides a correlation-based ranking formula that is effective with general collections and computationally practical.