Clustering DTDs: an interactive two-level approach

  • Authors:
  • Aoying Zhou;Weining Qian;Hailei Qian;Long Zhang;Yuqi Liang;Wen Jin

  • Affiliations:
  • Department of Computer Science, Laboratory for Intelligent Information Processing Fudan University, Shanghai 200433, P.R. China;Department of Computer Science, Laboratory for Intelligent Information Processing Fudan University, Shanghai 200433, P.R. China;Department of Computer Science, Laboratory for Intelligent Information Processing Fudan University, Shanghai 200433, P.R. China;Department of Computer Science, Laboratory for Intelligent Information Processing Fudan University, Shanghai 200433, P.R. China;Department of Computer Science, Laboratory for Intelligent Information Processing Fudan University, Shanghai 200433, P.R. China;Department of Computer Science, Simon Fraser University, Canada

  • Venue:
  • Journal of Computer Science and Technology
  • Year:
  • 2002

Quantified Score

Hi-index 0.00

Visualization

Abstract

XML (eXtensible Markup Language) is a standard which is widely applied in data representation and data exchange. However, as an important concept of XML, DTD (Document Type Definition) is not taken full advantage in current applications. In this paper, a new method for clustering DTDs is presented, and it can be used in XML document clustering. The two-level method clusters the elements in DTDs and clusters DTDs separately. Element clustering forms the first level and provides dement clusters, which are the generalization of relevant elements. DTD clustering utilizes the generalized information and forms the second level in the whole clustering process. The two-level method has the following advantages: 1) It takes into consideration both the content and the structure within DTDs; 2) The generalized information about elements is more useful than the separated words in the vector model; 3) The two-level method facilitates the searching of outliers. The experiments show that this method is able to categorize the relevant DTDs effectively.