As we may perceive: finding the boundaries of compound documents on the web

Authors:
Pavel Dmitriev
Affiliations:
Cornell University, Ithaca, NY, USA
Venue:
Proceedings of the 17th international conference on World Wide Web
Year:
2008

Citing 5
Cited 4

Defining logical domains in a web site

HYPERTEXT '00 Proceedings of the eleventh ACM on Hypertext and hypermedia
Untangling compound documents on the web

Proceedings of the fourteenth ACM conference on Hypertext and hypermedia
As we may perceive: inferring logical documents from hypertext

Proceedings of the sixteenth ACM conference on Hypertext and hypermedia
Mining Generalized Graph Patterns Based on User Examples

ICDM '06 Proceedings of the Sixth International Conference on Data Mining
Finding the boundaries of compound documents on the web

Finding the boundaries of compound documents on the web

Automatically assessing resource quality for educational digital libraries

Proceedings of the 3rd workshop on Information credibility on the web
Automatically characterizing resource quality for educational digital libraries

Proceedings of the 9th ACM/IEEE-CS joint conference on Digital libraries
Web-site boundary detection

ICDM'10 Proceedings of the 10th industrial conference on Advances in data mining: applications and theoretical aspects
Incremental web-site boundary detection using random walks

MLDM'11 Proceedings of the 7th international conference on Machine learning and data mining in pattern recognition

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper considers the problem of identifying on the Web compound documents (cDocs) -- groups of web pages that in aggregate constitute semantically coherent information entities. Examples of cDocs are a news article consisting of several html pages, or a set of pages describing specifications, price, and reviews of a digital camera. Being able to identify cDocs would be useful in many applications including web and intranet search, user navigation, automated collection generation, and information extraction. In the past, several heuristic approaches have been proposed to identify cDocs [1][5]. However, heuristics fail to capture the variety of types, styles and goals of information on the web, and do not account for the fact that the definition of a cDoc often depends on the context. This paper presents an experimental evaluation of three machine learning-based algorithms for cDoc discovery. These algorithms are responsive to the varying structure of cDocs and adaptive to their application-specific nature. Based on our previous work [4], this paper proposes a different scenario for discovering cDocs, and compares in this new setting the local machine learned clustering algorithm from [4] to a global purely graph based approach [3] and a Conditional Markov Network approach previously applied to noun coreference task [6]. The results show that the approach of [4] outperforms the other algorithms, suggesting that global relational characteristics of web sites are too noisy for cDoc identification purposes.