Automated internal web page clustering for improved data extraction

  • Authors:
  • Cornelia Győrödi;Robert Győrödi;Mihai Cornea;George Pecherle

  • Affiliations:
  • University of Oradea;University of Oradea;University of Oradea;University of Oradea

  • Venue:
  • Proceedings of the 2nd International Conference on Web Intelligence, Mining and Semantics
  • Year:
  • 2012

Quantified Score

Hi-index 0.00

Visualization

Abstract

In this paper, we would like to present an algorithm to determine the repeating patterns inside the DOM tree of a webpage. By doing this we can cluster the content inside a web page and obtain more relevant structured data. The determined DOM structure can be used to mine other web pages that are similar in structure and one hop away from the initial targeted web page. Also, the clusters are similar in structure not in contents, and our method is based on in-page clustering. This is what differentiates our algorithm from similar technologies that work on entire web pages.