Structure-based document model with discrete wavelet transforms and its application to document classification

  • Authors:
  • Supphachai Thaicharoen;Tom Altman;Krzysztof J. Cios

  • Affiliations:
  • University of Colorado Denver, Denver, CO;University of Colorado Denver, Denver, CO;Virginia Commonwealth University, Richmond, VA

  • Venue:
  • AusDM '08 Proceedings of the 7th Australasian Data Mining Conference - Volume 87
  • Year:
  • 2008

Quantified Score

Hi-index 0.00

Visualization

Abstract

Term signal is an existing text representation that depicts a term as a vector of frequencies of occurrences in a number of user-defined partitions of a document. Although term signal augments the traditional vector space model with patterns of term occurrences, its document division is not coherent with the actual logical structure of a document. In this paper, we propose a novel document model, termed Structure-Based Document Model with Discrete Wavelet Transforms (SDMDWT), that exploits the structural information of documents and mathematical transforms for document representation. The proposed SDMDWT model enhances the existing term signal concept by additionally taking into consideration document's structural information during document division. We evaluated the proposed model on two different domains of standard data sets, WebKB 4-Universities and TREC Genomics 2005, using Support Vector Machines binary classification. The experimental results show that using our SDMDWT model for document representation demonstrates promising improvements of classification performances over existing document models.