Elimination of junk document surrogate candidates through pattern recognition

Authors:
Eunyee Koh;Daniel Caruso;Andruid Kerne;Ricardo Gutierrez-Osuna
Affiliations:
Texas A&M University, College Station, TX;Texas A&M University, College Station, TX;Texas A&M University, College Station, TX;Texas A&M University, College Station, TX
Venue:
Proceedings of the 2007 ACM symposium on Document engineering
Year:
2007

Citing 20
Cited 2

Information seeking in electronic environments

Information seeking in electronic environments
Hypertext paths and the World-Wide Web: experiences with Walden's Paths

HYPERTEXT '97 Proceedings of the eighth ACM conference on Hypertext
Principles of mixed-initiative user interfaces

Proceedings of the SIGCHI conference on Human Factors in Computing Systems
Multimodal surrogates for video browsing

Proceedings of the fourth ACM conference on Digital libraries
Previews and overviews in digital libraries: designing surrogates to support visual information seeking

Journal of the American Society for Information Science - Special topic issue on digital libraries: part 2
New technology and new roles: the need for “corpus editors”

DL '00 Proceedings of the fifth ACM conference on Digital libraries
IEPAD: information extraction based on pattern discovery

Proceedings of the 10th international conference on World Wide Web
Building a hypertextual digital library in the humanities: a case study on London

Proceedings of the 1st ACM/IEEE-CS joint conference on Digital libraries
A comparison of the use of text summaries, plain thumbnails, and enhanced thumbnails for Web search tasks

Journal of the American Society for Information Science and Technology
Neural Networks: A Comprehensive Foundation

Neural Networks: A Comprehensive Foundation
Automatic removal of advertising from web-page display

Proceedings of the 2nd ACM/IEEE-CS joint conference on Digital libraries
Discovering informative content blocks from Web documents

Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
How fast is too fast?: evaluating fast forward surrogates for digital video

Proceedings of the 3rd ACM/IEEE-CS joint conference on Digital libraries
Extracting structured data from Web pages

Proceedings of the 2003 ACM SIGMOD international conference on Management of data
Pattern Classification (2nd Edition)

Pattern Classification (2nd Edition)
Human + agent: creating recombinant information

MULTIMEDIA '03 Proceedings of the eleventh ACM international conference on Multimedia
CS AKTive space: representing computer science in the semantic web

Proceedings of the 13th international conference on World Wide Web
Collection understanding

Proceedings of the 4th ACM/IEEE-CS joint conference on Digital libraries
The information discovery framework

DIS '04 Proceedings of the 5th conference on Designing interactive systems: processes, practices, methods, and techniques
Evaluating navigational surrogate formats with divergent browsing tasks

CHI '05 Extended Abstracts on Human Factors in Computing Systems

Deriving image-text document surrogates to optimize cognition

Proceedings of the 9th ACM symposium on Document engineering
A first approach to the automatic recognition of structural patterns in XML documents

Proceedings of the 2012 ACM symposium on Document engineering

Quantified Score

Hi-index	0.00

Visualization

Abstract

A surrogate is an object that stands for a document and enables navigation to that document. Hypermedia is often represented with textual surrogates, even though studies have shown that image and text surrogates facilitate the formation of mental models and overall understanding. Surrogates may be formed by breaking a document down into a set of smaller elements, each of which is a surrogate candidate. While processing these surrogate candidates from an HTML document, relevant information may appear together with less useful junk material, such as navigation bars and advertisements. This paper develops a pattern recognition based approach for eliminating junk while building the set of surrogate candidates. The approach defines features on candidate elements, and uses classification algorithms to make selection decisions based on these features. For the purpose of defining features in surrogate candidates, we introduce the Document Surrogate Model (DSM), a streamlined Document Object Model (DOM)-like representation of semantic structure. Using a quadratic classifier, we were able to eliminate junk surrogate candidates with an average classification rate of 80%. By using this technique, semiautonomous agents can be developed to more effectively generate surrogate collections for users. We end by describing a new approach for hypermedia and the semantic web, which uses the DSM to define value-added surrogates for a document.