Exploiting tree structure of a web page for clustering

Authors:
Bhaskar Biswas;Karan Jain;Vipul Mittal;K. K. Shukla
Affiliations:
Department of Computer Engineering, Institute of Technology, Banaras Hindu University, Varanasi 221005, India.;Department of Computer Engineering, Institute of Technology, Banaras Hindu University, Varanasi 221005, India.;Department of Computer Engineering, Institute of Technology, Banaras Hindu University, Varanasi 221005, India.;Department of Computer Engineering, Institute of Technology, Banaras Hindu University, Varanasi 221005, India
Venue:
International Journal of Knowledge and Web Intelligence
Year:
2009

Citing 10
Cited 0

The World-Wide Web: quagmire or gold mine?

Communications of the ACM
Web mining research: a survey

ACM SIGKDD Explorations Newsletter
IEPAD: information extraction based on pattern discovery

Proceedings of the 10th international conference on World Wide Web
On Clustering Validation Techniques

Journal of Intelligent Information Systems
RoadRunner: Towards Automatic Data Extraction from Large Web Sites

Proceedings of the 27th International Conference on Very Large Data Bases
Data-rich Section Extraction from HTML pages

WISE '02 Proceedings of the 3rd International Conference on Web Information Systems Engineering
Data extraction and label assignment for web databases

WWW '03 Proceedings of the 12th international conference on World Wide Web
Relationship-based clustering and cluster ensembles for high-dimensional data mining

Relationship-based clustering and cluster ensembles for high-dimensional data mining
Clustering web pages based on their structure

Data & Knowledge Engineering - Special issue: WIDM 2003
Standardized Evaluation Method for Web Clustering Results

WI '05 Proceedings of the 2005 IEEE/WIC/ACM International Conference on Web Intelligence

Quantified Score

Hi-index	0.00

Visualization

Abstract

An approach to designing a Universal Web Wrapper has been in stages of implementation for over years. An issue associated with this is the automated selection of web pages and thereby extraction of content of interest. We propose an algorithm to cluster pages on the basis of their structure. Due to high amount of similarity in these pages, it is be easier to categorise them and extract any particular section of the page. This algorithm makes use of only the structural factors leading to complexity equivalent to O(log n). Further, the algorithm evaluation illustrates the precision and efficiency of the algorithm.