A semi-structured document model for text mining

  • Authors:
  • Yang Jianwu;Chen Xiaoou

  • Affiliations:
  • National Key Laboratory for Text Processing, Institute of Computer Science and Technology Peking University, Beijing 100871, P.R. China;National Key Laboratory for Text Processing, Institute of Computer Science and Technology Peking University, Beijing 100871, P.R. China

  • Venue:
  • Journal of Computer Science and Technology
  • Year:
  • 2002

Quantified Score

Hi-index 0.00

Visualization

Abstract

A semi-structured document has more structured information compared to an ordinary document, and the relation among semi-structured documents can be fully utilized. In order to take advantage of the structure and link information in a semi-structured document for better mining, a structured link vector model (SLVM) is presented in this paper, where a vector represents a document, and vectors' elements are determined by terms, document structure and neighboring documents. Text mining based on SLVM is described in the procedure of K-means for briefness and clarity: calculating document similarity and calculating cluster center. The clustering based on SLVM performs significantly better than that based on a conventional vector space model in the experiments, and its F value increases from 0.65-0.73 to 0.82-0.86.