Bottom-up discovery of clusters of maximal ranges in HTML trees for search engines results extraction

  • Authors:
  • Dominik Flejter;Roman Hryniewiecki

  • Affiliations:
  • Poznan University of Economics, Department of Information Systems, Poznan, Poland;Poznan University of Economics, Department of Information Systems, Poznan, Poland

  • Venue:
  • BIS'07 Proceedings of the 10th international conference on Business information systems
  • Year:
  • 2007

Quantified Score

Hi-index 0.00

Visualization

Abstract

Unsupervised HTML records detection is an important step in many Web content mining applications. In this paper we propose a method of bottom-up discovery of clusters of maximal, non-agglomerative similar HTML ranges in nested set HTML tree representation. Afterward we demonstrate its applicability to records detection in search engines results. For performance measurement several distance assessment strategies were evaluated and two test collections were prepared containing results pages from almost 60 global and country-specific search engines and almost 100 methodically generated complex HTML trees with pre-set properties respectively. Empirical study shows that our method performs well and can detect successfully most of search results ranges clusters.