Multilayer SOM with tree-structured data for efficient document retrieval and plagiarism detection

  • Authors:
  • Tommy W. S. Chow;M. K. M. Rahman

  • Affiliations:
  • Department of Electronic Engineering, City University of Hong Kong, Kowloon, Hong Kong;Department of Electrical and Electronic Engineering, United International University, Dhaka, Bangladesh and Department of Electronic Engineering, City University of Hong Kong, Kowloon, Hong Kong

  • Venue:
  • IEEE Transactions on Neural Networks
  • Year:
  • 2009

Quantified Score

Hi-index 0.01

Visualization

Abstract

This paper proposes a new document retrieval (DR) and plagiarism detection (PD) system using multilayer self-organizing map (MLSOM). A document is modeled by a rich tree-structured representation, and a SOM-based system is used as a computationally effective solution. Instead of relying on keywords/lines, the proposed scheme compares a full document as a query for performing retrieval and PD. The tree-structured representation hierarchically includes document features as document, pages, and paragraphs. Thus, it can reflect underlying context that is difficult to acquire from the currently used word-frequency information. We show that the tree-structured data is effective for DR and PD. To handle tree-structured representation in an efficient way, we use an MLSOM algorithm, which was previously developed by the authors for the application of image retrieval. In this study, it serves as an effective clustering algorithm. Using the MLSOM, local matching techniques are developed for comparing text documents. Two novel MLSOM-based PD methods are proposed. Detailed simulations are conducted and the experimental results corroborate that the proposed approach is computationally efficient and accurate for DR and PD.