Approximate nearest neighbors: towards removing the curse of dimensionality
STOC '98 Proceedings of the thirtieth annual ACM symposium on Theory of computing
Min-wise independent permutations
Journal of Computer and System Sciences - 30th annual ACM symposium on theory of computing
An Optimal Algorithm for Monte Carlo Estimation
SIAM Journal on Computing
Similarity estimation techniques from rounding algorithms
STOC '02 Proceedings of the thiry-fourth annual ACM symposium on Theory of computing
Models and issues in data stream systems
Proceedings of the twenty-first ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Modern Information Retrieval
Min-wise Independent Permutations: Theory and Practice
ICALP '00 Proceedings of the 27th International Colloquium on Automata, Languages and Programming
Identifying and Filtering Near-Duplicate Documents
COM '00 Proceedings of the 11th Annual Symposium on Combinatorial Pattern Matching
On the Resemblance and Containment of Documents
SEQUENCES '97 Proceedings of the Compression and Complexity of Sequences 1997
The learning-curve sampling method applied to model-based clustering
The Journal of Machine Learning Research
Locality-sensitive hashing scheme based on p-stable distributions
SCG '04 Proceedings of the twentieth annual symposium on Computational geometry
RCV1: A New Benchmark Collection for Text Categorization Research
The Journal of Machine Learning Research
Improving recommendation lists through topic diversification
WWW '05 Proceedings of the 14th international conference on World Wide Web
Incremental hierarchical clustering of text documents
CIKM '06 Proceedings of the 15th ACM international conference on Information and knowledge management
Introduction to Information Retrieval
Introduction to Information Retrieval
An axiomatic approach for result diversification
Proceedings of the 18th international conference on World wide web
Efficient Computation of Diverse Query Results
ICDE '08 Proceedings of the 2008 IEEE 24th International Conference on Data Engineering
Data clustering: 50 years beyond K-means
Pattern Recognition Letters
Diversifying web search results
Proceedings of the 19th international conference on World wide web
Incremental diversification for very large sets: a streaming-based approach
Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval
Text clustering for peer-to-peer networks with probabilistic guarantees
ECIR'2010 Proceedings of the 32nd European conference on Advances in Information Retrieval
Understanding the diversity of tweets in the time of outbreaks
Proceedings of the 22nd international conference on World Wide Web companion
Hi-index | 0.00 |
We propose two efficient algorithms for exploring topic diversity in large document corpora such as user generated content on the social web, bibliographic data, or other web repositories. Analyzing diversity is useful for obtaining insights into knowledge evolution, trends, periodicities, and topic heterogeneity of such collections. Calculating diversity statistics requires averaging over the similarity of all object pairs, which, for large corpora, is prohibitive from a computational point of view. Our proposed algorithms overcome the quadratic complexity of the average pair-wise similarity computation, and allow for constant time (depending on dataset properties) or linear time approximation with probabilistic guarantees. We show examples of diversity-based studies on large samples from corpora such as the social photo sharing site Flickr, the DBLP bibliography, and US Census data.