Building re-usable dictionary repositories for real-world text mining

  • Authors:
  • Shantanu Godbole;Indrajit Bhattacharya;Ajay Gupta;Ashish Verma

  • Affiliations:
  • IBM Research - India, New Delhi, India;Indian Institute of Science, Bangalore, India;IBM Research - India, New Delhi, India;IBM Research - India, New Delhi, India

  • Venue:
  • CIKM '10 Proceedings of the 19th ACM international conference on Information and knowledge management
  • Year:
  • 2010

Quantified Score

Hi-index 0.00

Visualization

Abstract

Text mining, though still a nascent industry, has been growing quickly along with the awareness of the importance of unstructured data in business analytics, customer retention and extension, social media, and legal applications. There has been a recent increase in the number of commercial text mining product and service offerings, but successful or wide-spread deployments are rare, mainly due to a dependence on the expertise and skill of practitioners. Accordingly, there is a growing need for re-usable repositories for text mining. In this paper, we focus on dictionary-based text mining and its role in enabling practitioners in understanding and analyzing large text datasets. We motivate and define the problem of exploratory dictionary construction for capturing concepts of interest, and propose a framework for efficient construction, tuning, and re-use of these dictionaries across datasets. The construction framework offers a range of interaction modes to the user to quickly build concept dictionaries over large datasets. We also show how to adapt one or more dictionaries across domains and tasks, thereby enabling reuse of knowledge and effort in industrial practice. We present results and case studies on real-life CRM analytics datasets, where such repositories and tooling significantly cut down practitioner time and effort for dictionary-based text mining.