Bottom-up discovery of clusters of maximal ranges in HTML trees for search engines results extraction

Authors:
Dominik Flejter;Roman Hryniewiecki
Affiliations:
Poznan University of Economics, Department of Information Systems, Poznan, Poland;Poznan University of Economics, Department of Information Systems, Poznan, Poland
Venue:
BIS'07 Proceedings of the 10th international conference on Business information systems
Year:
2007

Citing 13
Cited 0

IEPAD: information extraction based on pattern discovery

Proceedings of the 10th international conference on World Wide Web
Data Structures and Algorithms

Data Structures and Algorithms
Comparing Hierarchical Data in External Memory

VLDB '99 Proceedings of the 25th International Conference on Very Large Data Bases
Data extraction and label assignment for web databases

WWW '03 Proceedings of the 12th international conference on World Wide Web
Mining data records in Web pages

Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
Structured databases on the web: observations and implications

ACM SIGMOD Record
Fully automatic wrapper generation for search engines

WWW '05 Proceedings of the 14th international conference on World Wide Web
Web data extraction based on partial tree alignment

WWW '05 Proceedings of the 14th international conference on World Wide Web
Canonical forms for labelled trees and their applications in frequent subtree mining

Knowledge and Information Systems
Automating the extraction of data from HTML tables with unknown structure

Data & Knowledge Engineering - Special issue: ER 2002
AutoFeed: an unsupervised learning system for generating webfeeds

Proceedings of the 3rd international conference on Knowledge capture
ViPER: augmenting automatic information extraction with visual perceptions

Proceedings of the 14th ACM international conference on Information and knowledge management
Automatic extraction of dynamic record sections from search engine result pages

VLDB '06 Proceedings of the 32nd international conference on Very large data bases

Quantified Score

Hi-index	0.00

Visualization

Abstract

Unsupervised HTML records detection is an important step in many Web content mining applications. In this paper we propose a method of bottom-up discovery of clusters of maximal, non-agglomerative similar HTML ranges in nested set HTML tree representation. Afterward we demonstrate its applicability to records detection in search engines results. For performance measurement several distance assessment strategies were evaluated and two test collections were prepared containing results pages from almost 60 global and country-specific search engines and almost 100 methodically generated complex HTML trees with pre-set properties respectively. Empirical study shows that our method performs well and can detect successfully most of search results ranges clusters.