CiteData: a new multi-faceted dataset for evaluating personalized search performance

Authors:
Abhay Harpale;Yiming Yang;Siddharth Gopal;Daqing He;Zhen Yue
Affiliations:
Carnegie Mellon University, Pittsburgh, PA, USA;Carnegie Mellon University, Pittsburgh, PA, USA;Carnegie Mellon University, Pittsburgh, PA, USA;University of Pittsburgh, Pittsburgh, PA, USA;University of Pittsburgh, Pittsburgh, PA, USA
Venue:
CIKM '10 Proceedings of the 19th ACM international conference on Information and knowledge management
Year:
2010

Citing 17
Cited 2

A study of thresholding strategies for text categorization

Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval
Topic-sensitive PageRank

Proceedings of the 11th international conference on World Wide Web
Novelty and redundancy detection in adaptive filtering

SIGIR '02 Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval
Personalized web search by mapping user queries to categories

Proceedings of the eleventh international conference on Information and knowledge management
Latent Class Models for Collaborative Filtering

IJCAI '99 Proceedings of the Sixteenth International Joint Conference on Artificial Intelligence
Scaling personalized web search

WWW '03 Proceedings of the 12th international conference on World Wide Web
Modeling annotated data

Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval
Collaborative filtering via gaussian probabilistic latent semantic analysis

Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval
RCV1: A New Benchmark Collection for Text Categorization Research

The Journal of Machine Learning Research
Unifying collaborative and content-based filtering

ICML '04 Proceedings of the twenty-first international conference on Machine learning
Ontology-based personalized search and browsing

Web Intelligence and Agent Systems
A Bayesian approach toward active learning for collaborative filtering

UAI '04 Proceedings of the 20th conference on Uncertainty in artificial intelligence
Document classification through interactive supervision of document and term labels

PKDD '04 Proceedings of the 8th European Conference on Principles and Practice of Knowledge Discovery in Databases
Relevance weighting for query independent evidence

Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval
Test theory for assessing IR test collections

SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
Personalized active learning for collaborative filtering

Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval
Exploring folksonomy for personalized search

Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval

Efficient personalized pagerank with accuracy assurance

Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining
A fuzzy-summary-based approach to faceted search in relational databases

ADBIS'12 Proceedings of the 16th East European conference on Advances in Databases and Information Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

Personalized search systems have evolved to utilize heterogeneous features including document hyperlinks, category labels in various taxonomies and social tags in addition to free-text of the documents. Consequently, classifiers, PageRank algorithms and Collaborative Filtering methods are often used as intermediate steps in such personalized retrieval systems. Thorough comparative evaluation of such complex systems has been difficult due to the lack of appropriate publicly available datasets that provide such diverse feature sets. To remedy the situation, we have created CiteData, a new dataset for benchmark evaluations of personalized search performance, that will be made publicly accessible. CiteData is a collection of academic articles extracted from CiteULike and CiteSeer repositories, with rich feature sets such as authors, author-affiliations, topic labels, social tags and citation information. We further supplement it with personalized queries and relevance judgments which were obtained from volunteer users. This paper starts with a discussion of the design criteria and characteristics of the CiteData dataset in comparison with current benchmark datasets, followed by a set of task-oriented empirical evaluations of popular algorithms in statistical classification, collaborative filtering and link analysis as intermediate steps for personalized search. Our results show significant performance improvement of personalized approaches, over that of unpersonalized approaches. We also observe that a meta personalized search engine that leverages information from multiple sources of features performs better than algorithms that use only one of the constituent source of features.