Efficient jaccard-based diversity analysis of large document collections

Authors:
Fan Deng;Stefan Siersdorfer;Sergej Zerr
Affiliations:
L3S Research Center, Hannover, Germany;L3S Research Center, Hannover, Germany;L3S Research Center, Hannover, Germany
Venue:
Proceedings of the 21st ACM international conference on Information and knowledge management
Year:
2012

Citing 21
Cited 1

Approximate nearest neighbors: towards removing the curse of dimensionality

STOC '98 Proceedings of the thirtieth annual ACM symposium on Theory of computing
Min-wise independent permutations

Journal of Computer and System Sciences - 30th annual ACM symposium on theory of computing
An Optimal Algorithm for Monte Carlo Estimation

SIAM Journal on Computing
Similarity estimation techniques from rounding algorithms

STOC '02 Proceedings of the thiry-fourth annual ACM symposium on Theory of computing
Models and issues in data stream systems

Proceedings of the twenty-first ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Modern Information Retrieval

Modern Information Retrieval
Min-wise Independent Permutations: Theory and Practice

ICALP '00 Proceedings of the 27th International Colloquium on Automata, Languages and Programming
Identifying and Filtering Near-Duplicate Documents

COM '00 Proceedings of the 11th Annual Symposium on Combinatorial Pattern Matching
On the Resemblance and Containment of Documents

SEQUENCES '97 Proceedings of the Compression and Complexity of Sequences 1997
The learning-curve sampling method applied to model-based clustering

The Journal of Machine Learning Research
Locality-sensitive hashing scheme based on p-stable distributions

SCG '04 Proceedings of the twentieth annual symposium on Computational geometry
RCV1: A New Benchmark Collection for Text Categorization Research

The Journal of Machine Learning Research
Improving recommendation lists through topic diversification

WWW '05 Proceedings of the 14th international conference on World Wide Web
Incremental hierarchical clustering of text documents

CIKM '06 Proceedings of the 15th ACM international conference on Information and knowledge management
Introduction to Information Retrieval

Introduction to Information Retrieval
An axiomatic approach for result diversification

Proceedings of the 18th international conference on World wide web
Efficient Computation of Diverse Query Results

ICDE '08 Proceedings of the 2008 IEEE 24th International Conference on Data Engineering
Data clustering: 50 years beyond K-means

Pattern Recognition Letters
Diversifying web search results

Proceedings of the 19th international conference on World wide web
Incremental diversification for very large sets: a streaming-based approach

Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval
Text clustering for peer-to-peer networks with probabilistic guarantees

ECIR'2010 Proceedings of the 32nd European conference on Advances in Information Retrieval

Understanding the diversity of tweets in the time of outbreaks

Proceedings of the 22nd international conference on World Wide Web companion

Quantified Score

Hi-index	0.00

Visualization

Abstract

We propose two efficient algorithms for exploring topic diversity in large document corpora such as user generated content on the social web, bibliographic data, or other web repositories. Analyzing diversity is useful for obtaining insights into knowledge evolution, trends, periodicities, and topic heterogeneity of such collections. Calculating diversity statistics requires averaging over the similarity of all object pairs, which, for large corpora, is prohibitive from a computational point of view. Our proposed algorithms overcome the quadratic complexity of the average pair-wise similarity computation, and allow for constant time (depending on dataset properties) or linear time approximation with probabilistic guarantees. We show examples of diversity-based studies on large samples from corpora such as the social photo sharing site Flickr, the DBLP bibliography, and US Census data.