IEPAD: information extraction based on pattern discovery
Proceedings of the 10th international conference on World Wide Web
Data Structures and Algorithms
Data Structures and Algorithms
Comparing Hierarchical Data in External Memory
VLDB '99 Proceedings of the 25th International Conference on Very Large Data Bases
Data extraction and label assignment for web databases
WWW '03 Proceedings of the 12th international conference on World Wide Web
Mining data records in Web pages
Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
Structured databases on the web: observations and implications
ACM SIGMOD Record
Fully automatic wrapper generation for search engines
WWW '05 Proceedings of the 14th international conference on World Wide Web
Web data extraction based on partial tree alignment
WWW '05 Proceedings of the 14th international conference on World Wide Web
Canonical forms for labelled trees and their applications in frequent subtree mining
Knowledge and Information Systems
Automating the extraction of data from HTML tables with unknown structure
Data & Knowledge Engineering - Special issue: ER 2002
AutoFeed: an unsupervised learning system for generating webfeeds
Proceedings of the 3rd international conference on Knowledge capture
ViPER: augmenting automatic information extraction with visual perceptions
Proceedings of the 14th ACM international conference on Information and knowledge management
Automatic extraction of dynamic record sections from search engine result pages
VLDB '06 Proceedings of the 32nd international conference on Very large data bases
Hi-index | 0.00 |
Unsupervised HTML records detection is an important step in many Web content mining applications. In this paper we propose a method of bottom-up discovery of clusters of maximal, non-agglomerative similar HTML ranges in nested set HTML tree representation. Afterward we demonstrate its applicability to records detection in search engines results. For performance measurement several distance assessment strategies were evaluated and two test collections were prepared containing results pages from almost 60 global and country-specific search engines and almost 100 methodically generated complex HTML trees with pre-set properties respectively. Empirical study shows that our method performs well and can detect successfully most of search results ranges clusters.