Computer assisted processing of large unstructured document sets: a case study in the construction industry

  • Authors:
  • John McKechnie;Sameh Shaaban;Stephen Lockley

  • Affiliations:
  • Construction Informatics, Newcastle Upon Tyne, UK;Construction Informatics, Newcastle Upon Tyne, UK;Construction Informatics, Newcastle Upon Tyne, UK

  • Venue:
  • DocEng '01 Proceedings of the 2001 ACM Symposium on Document engineering
  • Year:
  • 2001

Quantified Score

Hi-index 0.00

Visualization

Abstract

Construction is one of the most information intensive industries; typically professionals access the industry information resources on a daily basis. The major constraints to the future development of a formally encoded knowledge base are fragmented information sources and lack of comprehensive classification schemes. In agreement with earlier research and over twenty years of practical experience we have found that manually categorising a large collection of documents is error-prone, time-consuming, expensive and produces inconsistent results. Attempts over recent years to automate this using state-of-the-art categorisation techniques, have also proven to be wanting due to the shallow internal representation in the document set. In this paper we describe an approach to overcome this problem by combining the benefits of automated categorisation with efficient and effective use of human judgement. We present a tool based on this philosophy that utilises machine learning, information retrieval and information visualisation techniques to help bibliographers analyse the document collection. By analysing the content of the unstructured document, this tool suggests to the bibliographer keywords, subject headings and candidate documents to include under subject headings. This greatly increases the speed at which bibliographers can process the documents, increases the accuracy of their work and results in a categorisation system that reflects the terminology and relationships held in the actual knowledge base. This work is now being applied to enhance one of the market leading retrieval products in the construction industry.