ACM SIGKDD Explorations Newsletter
RoadRunner: Towards Automatic Data Extraction from Large Web Sites
Proceedings of the 27th International Conference on Very Large Data Bases
Discovering informative content blocks from Web documents
Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
Mining data records in Web pages
Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
Automatic extraction of informative blocks from webpages
Proceedings of the 2005 ACM symposium on Applied computing
WSEAS Transactions on Computers
Hi-index | 0.00 |
In this paper, we would like to present an algorithm to determine the repeating patterns inside the DOM tree of a webpage. By doing this we can cluster the content inside a web page and obtain more relevant structured data. The determined DOM structure can be used to mine other web pages that are similar in structure and one hop away from the initial targeted web page. Also, the clusters are similar in structure not in contents, and our method is based on in-page clustering. This is what differentiates our algorithm from similar technologies that work on entire web pages.