Automated internal web page clustering for improved data extraction

Authors:
Cornelia Győrödi;Robert Győrödi;Mihai Cornea;George Pecherle
Affiliations:
University of Oradea;University of Oradea;University of Oradea;University of Oradea
Venue:
Proceedings of the 2nd International Conference on Web Intelligence, Mining and Semantics
Year:
2012

Citing 6
Cited 0

Web mining research: a survey

ACM SIGKDD Explorations Newsletter
RoadRunner: Towards Automatic Data Extraction from Large Web Sites

Proceedings of the 27th International Conference on Very Large Data Bases
Discovering informative content blocks from Web documents

Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
Mining data records in Web pages

Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
Automatic extraction of informative blocks from webpages

Proceedings of the 2005 ACM symposium on Applied computing
Web page analysis based on HTML DOM and its usage for forum statistics, alerts and geo targeted data retrieval

WSEAS Transactions on Computers

Quantified Score

Hi-index	0.00

Visualization

Abstract

In this paper, we would like to present an algorithm to determine the repeating patterns inside the DOM tree of a webpage. By doing this we can cluster the content inside a web page and obtain more relevant structured data. The determined DOM structure can be used to mine other web pages that are similar in structure and one hop away from the initial targeted web page. Also, the clusters are similar in structure not in contents, and our method is based on in-page clustering. This is what differentiates our algorithm from similar technologies that work on entire web pages.