Untangling compound documents on the web

Authors:
Nadav Eiron;Kevin S. McCurley
Affiliations:
IBM Almaden Research Center, San Jose, CA;IBM Almaden Research Center, San Jose, CA
Venue:
Proceedings of the fourteenth ACM conference on Hypertext and hypermedia
Year:
2003

Citing 11
Cited 18

KMS: a distributed hypermedia system for managing knowledge in organizations

Communications of the ACM
Reflections on NoteCards: seven issues for the next generation of hypermedia systems

Communications of the ACM
Identifying aggregates in hypertext structures

HYPERTEXT '91 Proceedings of the third annual ACM conference on Hypertext
The Dexter hypertext reference model

Communications of the ACM
HyPursuit: a hierarchical network search engine that exploits content-link hypertext clustering

Proceedings of the the seventh ACM conference on Hypertext
ParaSite: mining structural information on the Web

Selected papers from the sixth international conference on World Wide Web
Finding context paths for Web pages

Proceedings of the tenth ACM Conference on Hypertext and hypermedia : returning to our diverse roots: returning to our diverse roots
Searching the Web: the public and their queries

Journal of the American Society for Information Science and Technology
Query Relaxation by Structure and Semantics for Retrieval of Logical Web Documents

IEEE Transactions on Knowledge and Data Engineering
Extracting Large-Scale Knowledge Bases from the Web

VLDB '99 Proceedings of the 25th International Conference on Very Large Data Bases
Cha-Cha: a system for organizing intranet search results

USITS'99 Proceedings of the 2nd conference on USENIX Symposium on Internet Technologies and Systems - Volume 2

Ranking the web frontier

Proceedings of the 13th international conference on World Wide Web
Properties of academic paper references

Proceedings of the fifteenth ACM conference on Hypertext and hypermedia
The site browser: catalyzing improvements in hypertext organization

Proceedings of the fifteenth ACM conference on Hypertext and hypermedia
Distribution of relevant documents in domain-level aggregates for topic distillation

Proceedings of the 13th international World Wide Web conference on Alternate track papers & posters
Finding the boundaries of information resources on the web

WWW '05 Special interest tracks and posters of the 14th international conference on World Wide Web
As we may perceive: inferring logical documents from hypertext

Proceedings of the sixteenth ACM conference on Hypertext and hypermedia
A decision mechanism for the selective combination of evidence in topic distillation

Information Retrieval
Site level noise removal for search engines

Proceedings of the 15th international conference on World Wide Web
As we may perceive: finding the boundaries of compound documents on the web

Proceedings of the 17th international conference on World Wide Web
Computational Intelligence techniques for Web personalization

Web Intelligence and Agent Systems
Automatically assessing resource quality for educational digital libraries

Proceedings of the 3rd workshop on Information credibility on the web
Automatically characterizing resource quality for educational digital libraries

Proceedings of the 9th ACM/IEEE-CS joint conference on Digital libraries
Combining evidence for relevance criteria: a framework and experiments in web retrieval

ECIR'07 Proceedings of the 29th European conference on IR research
Automatically constructing descriptive site maps

APWeb'06 Proceedings of the 8th Asia-Pacific Web conference on Frontiers of WWW Research and Development
An overview of web data clustering practices

EDBT'04 Proceedings of the 2004 international conference on Current Trends in Database Technology
Towards logical hypertext structure

IICS'04 Proceedings of the 4th international conference on Innovative Internet Community Systems
MenuMiner: revealing the information architecture of large web sites by analyzing maximal cliques

Proceedings of the 21st international conference companion on World Wide Web
A Web-based resource model for scholarship 2.0: object reuse & exchange

Concurrency and Computation: Practice & Experience

Quantified Score

Hi-index	0.00

Visualization

Abstract

Most text analysis is designed to deal with the concept of a "document", namely a cohesive presentation of thought on a unifying subject. By contrast, individual nodes on the World Wide Web tend to have a much smaller granularity than text documents. We claim that the notions of "document" and "web node" are not synonymous, and that authors often tend to deploy documents as collections of URLs, which we call "compound documents". In this paper we present new techniques for identifying and working with such compound documents, and the results of some large-scale studies on such web documents. The primary motivation for this work stems from the fact that information retrieval techniques are better suited to working on documents than individual hypertext nodes.