An integrated system of mining HTML texts and filtering structured documents

  • Authors:
  • Bo-Hyun Yun;Myung-Eun Lim;Soo-Hyun Park

  • Affiliations:
  • Dept. of Human Information Processing, Electronics and Telecommunications Research Institute, Daejon, Korea;Dept. of Human Information Processing, Electronics and Telecommunications Research Institute, Daejon, Korea;School of Business IT, Kookmin University, Seoul, Korea

  • Venue:
  • PAKDD'03 Proceedings of the 7th Pacific-Asia conference on Advances in knowledge discovery and data mining
  • Year:
  • 2003

Quantified Score

Hi-index 0.00

Visualization

Abstract

This paper presents a method of mining HTML documents into structured documents and of filtering structured documents by using both slot weighting and token weighting. The goal of a mining algorithm is to find slot-token patterns in HTML documents. In order to express user interests in structured document filtering, slot and token are considered. Our preference computation algorithm applies vector similarity and Bayesian probability to filter structured documents. The experimental results show that it is important to consider hyperlinking and unlablelling in mining HTML texts; slot and token weighting can enhance the performance of structured document filtering.