The World-Wide Web: quagmire or gold mine?
Communications of the ACM
ACM SIGKDD Explorations Newsletter
IEPAD: information extraction based on pattern discovery
Proceedings of the 10th international conference on World Wide Web
On Clustering Validation Techniques
Journal of Intelligent Information Systems
RoadRunner: Towards Automatic Data Extraction from Large Web Sites
Proceedings of the 27th International Conference on Very Large Data Bases
Data-rich Section Extraction from HTML pages
WISE '02 Proceedings of the 3rd International Conference on Web Information Systems Engineering
Data extraction and label assignment for web databases
WWW '03 Proceedings of the 12th international conference on World Wide Web
Relationship-based clustering and cluster ensembles for high-dimensional data mining
Relationship-based clustering and cluster ensembles for high-dimensional data mining
Clustering web pages based on their structure
Data & Knowledge Engineering - Special issue: WIDM 2003
Standardized Evaluation Method for Web Clustering Results
WI '05 Proceedings of the 2005 IEEE/WIC/ACM International Conference on Web Intelligence
Hi-index | 0.00 |
An approach to designing a Universal Web Wrapper has been in stages of implementation for over years. An issue associated with this is the automated selection of web pages and thereby extraction of content of interest. We propose an algorithm to cluster pages on the basis of their structure. Due to high amount of similarity in these pages, it is be easier to categorise them and extract any particular section of the page. This algorithm makes use of only the structural factors leading to complexity equivalent to O(log n). Further, the algorithm evaluation illustrates the precision and efficiency of the algorithm.