Interesting-phrase mining for ad-hoc text analytics

Authors:
Srikanta Bedathur;Klaus Berberich;Jens Dittrich;Nikos Mamoulis;Gerhard Weikum
Affiliations:
Max-Planck-Institut für Informatik, Saarbrücken, Germany;Max-Planck-Institut für Informatik, Saarbrücken, Germany;Saarland University, Saarbrücken, Germany;Max-Planck-Institut für Informatik, Saarbrücken, Germany;Max-Planck-Institut für Informatik, Saarbrücken, Germany
Venue:
Proceedings of the VLDB Endowment
Year:
2010

Citing 21
Cited 1

Mining frequent patterns without candidate generation

SIGMOD '00 Proceedings of the 2000 ACM SIGMOD international conference on Management of data
SPADE: an efficient algorithm for mining frequent sequences

Machine Learning
Bursty and hierarchical structure in streams

Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
Multi-structural databases

Proceedings of the twenty-fourth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Using top-ranking sentences to facilitate effective information access: Book Reviews

Journal of the American Society for Information Science and Technology
Efficient implementation of large-scale multi-structural databases

VLDB '05 Proceedings of the 31st international conference on Very large data bases
Mining Minimal Distinguishing Subsequence Patterns with Gap Constraints

ICDM '05 Proceedings of the Fifth IEEE International Conference on Data Mining
Clustering versus faceted categories for information exploration

Communications of the ACM - Supporting exploratory search
Inverted files for text search engines

ACM Computing Surveys (CSUR)
Visualizing tags over time

ACM Transactions on the Web (TWEB)
A method for online analytical processing of text data

Proceedings of the sixteenth ACM conference on Conference on information and knowledge management
BlogScope: a system for online analysis of high volume text streams

VLDB '07 Proceedings of the 33rd international conference on Very large data bases
Beyond basic faceted search

WSDM '08 Proceedings of the 2008 International Conference on Web Search and Data Mining
Introduction to Information Retrieval

Introduction to Information Retrieval
Multidimensional content eXploration

Proceedings of the VLDB Endowment
Dynamic faceted search for discovery-driven analysis

Proceedings of the 17th ACM conference on Information and knowledge management
Query by document

Proceedings of the Second ACM International Conference on Web Search and Data Mining
Text Cube: Computing IR Measures for Multidimensional Text Database Analysis

ICDM '08 Proceedings of the 2008 Eighth IEEE International Conference on Data Mining
Direct Discriminative Pattern Mining for Effective Classification

ICDE '08 Proceedings of the 2008 IEEE 24th International Conference on Data Engineering
Meme-tracking and the dynamics of the news cycle

Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining
Document-centric OLAP in the schema-chaos world

BIRTE'06 Proceedings of the 1st international conference on Business intelligence for the real-time enterprises

Top-k interesting phrase mining in ad-hoc collections using sequence pattern indexing

Proceedings of the 15th International Conference on Extending Database Technology

Quantified Score

Hi-index	0.00

Visualization

Abstract

Large text corpora with news, customer mail and reports, or Web 2.0 contributions offer a great potential for enhancing business-intelligence applications. We propose a framework for performing text analytics on such data in a versatile, efficient, and scalable manner. While much of the prior literature has emphasized mining keywords or tags in blogs or social-tagging communities, we emphasize the analysis of interesting phrases. These include named entities, important quotations, market slogans, and other multi-word phrases that are prominent in a dynamically derived ad-hoc subset of the corpus, e.g., being frequent in the subset but relatively infrequent in the overall corpus. We develop preprocessing and indexing methods for phrases, paired with new search techniques for the top-k most interesting phrases in ad-hoc subsets of the corpus. Our framework is evaluated using a large-scale real-world corpus of New York Times news articles.