A research case study: Difficulties and recommendations when using a textual data mining tool

  • Authors:
  • Abeer A. Al-Hassan;Faleh Alshameri;Edgar H. Sibley

  • Affiliations:
  • -;-;-

  • Venue:
  • Information and Management
  • Year:
  • 2013

Quantified Score

Hi-index 0.00

Visualization

Abstract

Although many interesting results have been reported by researchers using numeric data mining methods, there are still questions that need answering before textual data mining tools will be considered generally useful due to the effort needed to learn and use them. In 2011, we generated a dataset from the legal statements (mainly privacy policy and terms of use) on the websites of 475 of the US Fortune 500 Companies and used it as input to see what we could detect about the organizational relationships between the companies by using a textual data mining tool. We hoped to find that the tool would cluster similar corporations into the same industrial sector, as validated by the company's self-reported North American Industry Classification System code (NAICS). Unfortunately, this proved only marginally successful, leading us to ask why and to pose our research question: What problems occur when a data-mining tool is used to analyze large textual datasets that are unstructured, complex, duplicative, and contain many homonyms and synonyms? In analyzing our large dataset we learned a great deal about the problem and fortunately, after significant effort, determined how to ''massage'' the raw dataset to improve the process and learn how the tool can be better used in research situations. We also found that NAICS, as self-reported by companies, are of dubious value to a researcher-a matter briefly discussed.