A research case study: Difficulties and recommendations when using a textual data mining tool

Authors:
Abeer A. Al-Hassan;Faleh Alshameri;Edgar H. Sibley
Affiliations:
-;-;-
Venue:
Information and Management
Year:
2013

Citing 3
Cited 0

Applying an Enhanced Algorithm for Mining Incremental Updates on an Egyptian Newspaper Website

NCM '09 Proceedings of the 2009 Fifth International Joint Conference on INC, IMS and IDC
An Improved Fuzzy Clustering Method for Text Mining

NSWCTC '10 Proceedings of the 2010 Second International Conference on Networks Security, Wireless Communications and Trusted Computing - Volume 01
Research of fast SOM clustering for text information

Expert Systems with Applications: An International Journal

Quantified Score

Hi-index	0.00

Visualization

Abstract

Although many interesting results have been reported by researchers using numeric data mining methods, there are still questions that need answering before textual data mining tools will be considered generally useful due to the effort needed to learn and use them. In 2011, we generated a dataset from the legal statements (mainly privacy policy and terms of use) on the websites of 475 of the US Fortune 500 Companies and used it as input to see what we could detect about the organizational relationships between the companies by using a textual data mining tool. We hoped to find that the tool would cluster similar corporations into the same industrial sector, as validated by the company's self-reported North American Industry Classification System code (NAICS). Unfortunately, this proved only marginally successful, leading us to ask why and to pose our research question: What problems occur when a data-mining tool is used to analyze large textual datasets that are unstructured, complex, duplicative, and contain many homonyms and synonyms? In analyzing our large dataset we learned a great deal about the problem and fortunately, after significant effort, determined how to ''massage'' the raw dataset to improve the process and learn how the tool can be better used in research situations. We also found that NAICS, as self-reported by companies, are of dubious value to a researcher-a matter briefly discussed.