Content-based hierarchical document organization using multi-layer hybrid network and tree-structured features

  • Authors:
  • M. K. M. Rahman;Tommy W. S. Chow

  • Affiliations:
  • Dept. of Electronic Engineering, City University of Hong Kong, G6409, Tat Che Avenue, Kowloon Tong, Hong Kong;Dept. of Electronic Engineering, City University of Hong Kong, G6409, Tat Che Avenue, Kowloon Tong, Hong Kong

  • Venue:
  • Expert Systems with Applications: An International Journal
  • Year:
  • 2010

Quantified Score

Hi-index 12.05

Visualization

Abstract

Automatic organizing documents through a hierarchical tree is demanding in many real applications. In this work, we focus on the problem of content-based document organization through a hierarchical tree which can be viewed as a classification problem. We proposed a new document representation to enhance the classification accuracy. We developed a new hybrid neural network model to handle the new document representation. In our document representation, a document is represented by a tree-structure that has a superior capability of encoding document characteristics. Compared to traditional feature representation that encodes only global characteristics of a document, the proposed approach can encode both global and local characteristics of a document through a hierarchical tree. Unlike traditional representation, the tree representation reflects the spatial organizations of words through pages and paragraphs of a document that help to encode better semantics of a document. Processing hierarchical tree is another challenging task in terms of computational complexity. We developed a hybrid neural network model, composed of SOM and MLP, for this task. Experimental results corroborate that our approach is efficient and effective in registering documents into organized tree compared with other approach.