The paths more taken: matching DOM trees to search logs for accurate webpage clustering

Authors:
Deepayan Chakrabarti;Rupesh Mehta
Affiliations:
Yahoo! Research, Sunnyvale, CA, USA;Yahoo! Labs, Bangalore, India
Venue:
Proceedings of the 19th international conference on World wide web
Year:
2010

Citing 13
Cited 3

Fast algorithms for the unit cost editing distance between trees

Journal of Algorithms
Learning page-independent heuristics for extracting data from Web pages

WWW '99 Proceedings of the eighth international conference on World Wide Web
Rank aggregation methods for the Web

Proceedings of the 10th international conference on World Wide Web
IEPAD: information extraction based on pattern discovery

Proceedings of the 10th international conference on World Wide Web
Template detection via data mining and its applications

Proceedings of the 11th international conference on World Wide Web
A note on greedy algorithms for the maximum weighted independent set problem

Discrete Applied Mathematics
RoadRunner: Towards Automatic Data Extraction from Large Web Sites

Proceedings of the 27th International Conference on Very Large Data Bases
Automatic web news extraction using tree edit distance

Proceedings of the 13th international conference on World Wide Web
Thresher: automating the unwrapping of semantic content from the World Wide Web

WWW '05 Proceedings of the 14th international conference on World Wide Web
Approximate matching of hierarchical data using pq-grams

VLDB '05 Proceedings of the 31st international conference on Very large data bases
Clustering web pages based on their structure

Data & Knowledge Engineering - Special issue: WIDM 2003
Interactive wrapper generation with minimal user effort

Proceedings of the 15th international conference on World Wide Web
Joint optimization of wrapper generation and template detection

Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining

Myngle: unifying and filtering web content for unplanned access between multiple personal devices

Proceedings of the 13th international conference on Ubiquitous computing
Assessing the effort of repairing the accessibility of web sites

ICCHP'12 Proceedings of the 13th international conference on Computers Helping People with Special Needs - Volume Part I
Exploiting user clicks for automatic seed set generation for entity matching

Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining

Quantified Score

Hi-index	0.00

Visualization

Abstract

An unsupervised clustering of the webpages on a website is a primary requirement for most wrapper induction and automated data extraction methods. Since page content can vary drastically across pages of one cluster (e.g., all product pages on amazon.com), traditional clustering methods typically use some distance function between the DOM trees representing a pair of webpages. However, without knowing which portions of the DOM tree are "important," such distance functions might discriminate between similar pages based on trivial features (e.g., differing number of reviews on two product pages), or club together distinct types of pages based on superficial features present in the DOM trees of both (e.g., matching footer/copyright), leading to poor clustering performance. We propose using search logs to automatically find paths in the DOM trees that mark out important portions of pages, e.g., the product title in a product page. Such paths are identified via a global analysis of the entire website, whereby search data for popular pages can be used to infer good paths even for other pages that receive little or no search traffic. The webpages on the website are then clustered using these "key" paths. Our algorithm only requires information on search queries, and the webpages clicked in response to them; there is no need for human input, and it does not need to be told which portion of a webpage the user found interesting. The resulting clusterings achieve an adjusted RAND score of over 0.9 on half of the websites (a score of 1 indicating a perfect clustering), and 59% better scores on average than competing algorithms. Besides leading to refined clusterings, these key paths can be useful in the wrapper induction process itself, as shown by the high degree of match between the key paths and the manually identified paths used in existing wrappers for these sites (90% average precision).