Clustering and classification of maintenance logs using text data mining

Authors:
Brett Edwards;Michael Zatorsky;Richi Nayak
Affiliations:
Queensland University of Technology, Brisbane, Queensland;Queensland University of Technology, Brisbane, Queensland;Queensland University of Technology, Brisbane, Queensland
Venue:
AusDM '08 Proceedings of the 7th Australasian Data Mining Conference - Volume 87
Year:
2008

Citing 5
Cited 2

Information Retrieval: Algorithms and Heuristics (The Kluwer International Series on Information Retrieval)

Information Retrieval: Algorithms and Heuristics (The Kluwer International Series on Information Retrieval)
Using text mining and natural language processing for health care claims processing

ACM SIGKDD Explorations Newsletter - Natural language processing and text mining
Text mining for product attribute extraction

ACM SIGKDD Explorations Newsletter
Text mining for insurance claim cost prediction

Data Mining
Support vector machines for spam categorization

IEEE Transactions on Neural Networks

Textual data mining for industrial knowledge management and text classification: A business oriented approach

Expert Systems with Applications: An International Journal
Evaluation of biometric systems: a study of users' acceptance and satisfaction

International Journal of Biometrics

Quantified Score

Hi-index	0.00

Visualization

Abstract

Spreadsheets applications allow data to be stored with low development overheads, but also with low data quality. Reporting on data from such sources is difficult using traditional techniques. This case study uses text data mining techniques to analyse 12 years of data from dam pump station maintenance logs stored as free text in a spreadsheet application. The goal was to classify the data as scheduled maintenance or unscheduled repair jobs. Data preparation steps required to transform the data into a format appropriate for text data mining are discussed. The data is then mined by calculating term weights to which clustering techniques are applied. Clustering identified some groups that contained relatively homogeneous types of jobs. Training a classification model to learn the cluster groups allowed those jobs to be identified in unseen data. Yet clustering did not provide a clear overall distinction between scheduled and unscheduled jobs. With some manual analysis to code a target variable for a subset of the data, classification models were trained to predict the target variable based on text features. This was achieved with a moderate level of accuracy.