Inferring Web communities from link topology
Proceedings of the ninth ACM conference on Hypertext and hypermedia : links, objects, time and space---structure in hypermedia systems: links, objects, time and space---structure in hypermedia systems
Improved algorithms for topic distillation in a hyperlinked environment
Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval
Finding context paths for Web pages
Proceedings of the tenth ACM Conference on Hypertext and hypermedia : returning to our diverse roots: returning to our diverse roots
Trawling the Web for emerging cyber-communities
WWW '99 Proceedings of the eighth international conference on World Wide Web
Focused crawling: a new approach to topic-specific Web resource discovery
WWW '99 Proceedings of the eighth international conference on World Wide Web
Authoritative sources in a hyperlinked environment
Proceedings of the ninth annual ACM-SIAM symposium on Discrete algorithms
Defining logical domains in a web site
HYPERTEXT '00 Proceedings of the eleventh ACM on Hypertext and hypermedia
Efficient identification of Web communities
Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining
Retrieving and organizing web pages by “information unit”
Proceedings of the 10th international conference on World Wide Web
Creating a Web community chart for navigating related communities
Proceedings of the 12th ACM conference on Hypertext and Hypermedia
Proceedings of the 2nd ACM/IEEE-CS joint conference on Digital libraries
Extracting Large-Scale Knowledge Bases from the Web
VLDB '99 Proceedings of the 25th International Conference on Very Large Data Bases
Automatic document metadata extraction using support vector machines
Proceedings of the 3rd ACM/IEEE-CS joint conference on Digital libraries
Untangling compound documents on the web
Proceedings of the fourteenth ACM conference on Hypertext and hypermedia
Metaextract: an NLP system to automatically assign metadata
Proceedings of the 4th ACM/IEEE-CS joint conference on Digital libraries
Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval
Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval
Extracting content structure for web pages based on visual representation
APWeb'03 Proceedings of the 5th Asia-Pacific web conference on Web technologies and applications
Discriminative probabilistic models for relational data
UAI'02 Proceedings of the Eighteenth conference on Uncertainty in artificial intelligence
As we may perceive: finding the boundaries of compound documents on the web
Proceedings of the 17th international conference on World Wide Web
Automatically constructing descriptive site maps
APWeb'06 Proceedings of the 8th Asia-Pacific Web conference on Frontiers of WWW Research and Development
A Web-based resource model for scholarship 2.0: object reuse & exchange
Concurrency and Computation: Practice & Experience
Hi-index | 0.00 |
In recent years, many algorithms for the Web have been developed that work with information units distinct from individual web pages. These include segments of web pages or aggregation of web pages into web communities. Such logical information units improve a variety of web algorithms and provide the building blocks for the construction of organized information spaces such as digital libraries. In this paper, we focus on a type of logical information units called "compound documents". We argue that the ability to identify compound documents can improve information retrieval, automatic metadata generation, and navigation on the Web. We propose a unified framework for identifying the boundaries of compound documents, which combines both structural and content features of constituent web pages. The framework is based on a combination of machine learning and clustering algorithms, with the former algorithm supervising the latter one. We also propose a new method for evaluating quality of clusterings, based on a user behavior model. Experiments on a collection of educational web sites show that our approach can reliably identify most of the compound documents on these sites.