Exploratory analysis of highly heterogeneous document collections

Authors:
Arun S. Maiya;John P. Thompson;Francisco Loaiza-Lemos;Robert M. Rolfe
Affiliations:
Institute for Defense Analyses, Alexandria, VA, USA;Institute for Defense Analyses, Alexandria, VA, USA;Institute for Defense Analyses, Alexandria, VA, USA;Institute for Defense Analyses, Alexandria, VA, USA
Venue:
Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining
Year:
2013

Citing 17
Cited 0

Concepts and effectiveness of the cover-coefficient-based clustering methodology for text databases

ACM Transactions on Database Systems (TODS)
Foundations of statistical natural language processing

Foundations of statistical natural language processing
Topic Extraction from News Archive Using TF*PDF Algorithm

WISE '02 Proceedings of the 3rd International Conference on Web Information Systems Engineering
Latent dirichlet allocation

The Journal of Machine Learning Research
Accurate methods for the statistics of surprise and coincidence

Computational Linguistics - Special issue on using large corpora: I
BuzzTrack: topic detection and tracking in email

Proceedings of the 12th international conference on Intelligent user interfaces
Tag clouds for summarizing web search results

Proceedings of the 16th international conference on World Wide Web
Introduction to Information Retrieval

Introduction to Information Retrieval
Text, Image and Vector Graphics Based Appraisal of Contemporary Documents

ICMLA '08 Proceedings of the 2008 Seventh International Conference on Machine Learning and Applications
Content-Based Clustering for Tag Cloud Visualization

ASONAM '09 Proceedings of the 2009 International Conference on Advances in Social Network Analysis and Mining
An extensive empirical study of collocation extraction methods

ACLstudent '05 Proceedings of the ACL Student Research Workshop
Detecting topic evolution in scientific literature: how can citations help?

Proceedings of the 18th ACM conference on Information and knowledge management
Tag Clusters as Information Retrieval Interfaces

HICSS '10 Proceedings of the 2010 43rd Hawaii International Conference on System Sciences
TIARA: a visual exploratory text analytic system

Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining
Faceted Search

Faceted Search
PatentMiner: topic-driven patent analysis and mining

Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining
Navigating information facets on twitter (NIF-T)

Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining

Quantified Score

Hi-index	0.00

Visualization

Abstract

We present an effective multifaceted system for exploratory analysis of highly heterogeneous document collections. Our system is based on intelligently tagging individual documents in a purely automated fashion and exploiting these tags in a powerful faceted browsing framework. Tagging strategies employed include both unsupervised and supervised approaches based on machine learning and natural language processing. As one of our key tagging strategies, we introduce the KERA algorithm (Keyword Extraction for Reports and Articles). KERA extracts topic-representative terms from individual documents in a purely unsupervised fashion and is revealed to be significantly more effective than state-of-the-art methods. Finally, we evaluate our system in its ability to help users locate documents pertaining to military critical technologies buried deep in a large heterogeneous sea of information.