Bootstrapping Semantic Annotation for Content-Rich HTML Documents

Authors:
Saikat Mukherjee;I. V. Ramakrishnan;Amarjeet Singh
Affiliations:
State University of New York at Stony Brook;State University of New York at Stony Brook;State University of New York at Stony Brook
Venue:
ICDE '05 Proceedings of the 21st International Conference on Data Engineering
Year:
2005

Citing 24
Cited 12

Combinatorial optimization: algorithms and complexity

Combinatorial optimization: algorithms and complexity
Training algorithms for linear text classifiers

SIGIR '96 Proceedings of the 19th annual international ACM SIGIR conference on Research and development in information retrieval
Template-based wrappers in the TSIMMIS system

SIGMOD '97 Proceedings of the 1997 ACM SIGMOD international conference on Management of data
Wrapper generation for semi-structured Internet sources

ACM SIGMOD Record
Ontology-based extraction and structuring of information from data-rich unstructured documents

Proceedings of the seventh international conference on Information and knowledge management
Record-boundary discovery in Web documents

SIGMOD '99 Proceedings of the 1999 ACM SIGMOD international conference on Management of data
A re-examination of text categorization methods

Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval
Focused Web searching with PDAs

Proceedings of the 9th international World Wide Web conference on Computer networks : the international journal of computer and telecommunications netowrking
Annotea: an open RDF infrastructure for shared Web annotations

Proceedings of the 10th international conference on World Wide Web
Machine learning in automated text categorization

ACM Computing Surveys (CSUR)
A flexible learning system for wrapping tables and lists in HTML documents

Proceedings of the 11th international conference on World Wide Web
Authoring and annotation of web pages in CREAM

Proceedings of the 11th international conference on World Wide Web
Topic Detection and Tracking: Event-Based Information Organization

Topic Detection and Tracking: Event-Based Information Organization
A brief survey of web data extraction tools

ACM SIGMOD Record
A Comparative Study on Feature Selection in Text Categorization

ICML '97 Proceedings of the Fourteenth International Conference on Machine Learning
Improving pseudo-relevance feedback in web information retrieval using web page segmentation

WWW '03 Proceedings of the 12th international conference on World Wide Web
SemTag and seeker: bootstrapping the semantic web via automated semantic annotation

WWW '03 Proceedings of the 12th international conference on World Wide Web
Detecting web page structure for adaptive viewing on small form factor devices

WWW '03 Proceedings of the 12th international conference on World Wide Web
On deep annotation

WWW '03 Proceedings of the 12th international conference on World Wide Web
XWRAP: An XML-Enabled Wrapper Construction System for Web Information Sources

ICDE '00 Proceedings of the 16th International Conference on Data Engineering
HTML Page Analysis Based on Visual Cues

ICDAR '01 Proceedings of the Sixth International Conference on Document Analysis and Recognition
Reverse Engineering for Web Data: From Visual to Semantic Structures

ICDE '02 Proceedings of the 18th International Conference on Data Engineering
Eliminating noisy information in Web pages for data mining

Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
Automatic detection of fragments in dynamically generated web pages

Proceedings of the 13th international conference on World Wide Web

Browsing fatigue in handhelds: semantic bookmarking spells relief

WWW '05 Proceedings of the 14th international conference on World Wide Web
Dialog generation for voice browsing

W4A '06 Proceedings of the 2006 international cross-disciplinary workshop on Web accessibility (W4A): Building the mobile web: rediscovering accessibility?
Model-directed web transactions under constrained modalities

Proceedings of the 15th international conference on World Wide Web
Multi-layer dialog generation for non-visual web access

ACM SIGACCESS Accessibility and Computing
Csurf: a context-driven non-visual web-browser

Proceedings of the 16th international conference on World Wide Web
Context browsing with mobiles - when less is more

Proceedings of the 5th international conference on Mobile systems, applications and services
Model-directed Web transactions under constrained modalities

ACM Transactions on the Web (TWEB)
Automated Semantic Analysis of Schematic Data

World Wide Web
Bridging the Web Accessibility Divide

Electronic Notes in Theoretical Computer Science (ENTCS)
Semantic annotation of web objects using constrained conditional random fields

WAIM'10 Proceedings of the 11th international conference on Web-age information management
From layout to semantic: a reranking model for mapping web documents to mediated XML representations

Large Scale Semantic Access to Content (Text, Image, Video, and Sound)
2D correlative-chain conditional random fields for semantic annotation of web objects

Journal of Computer Science and Technology

Quantified Score

Hi-index	0.00

Visualization

Abstract

Enormous amount of semantic data is still being encoded in HTML documents. Identifying and annotating the semantic concepts implicit in such documents makes them directly amenable for Semantic Web processing. In this paper we describe a highly automated technique for annotating HTML documents, especially template-based content-rich documents, containing many different semantic concepts per document. Starting with a (small) seed of hand-labeled instances of semantic concepts in a set of HTML documents we bootstrap an annotation process that automatically identifies unlabeled concept instances present in other documents. The bootstrapping technique exploits the observation that semantically related items in content-rich documents exhibit consistency in presentation style and spatial locality to learn a statistical model for accurately identifying different semantic concepts in HTML documents drawn from a variety ofWeb sources. We also present experimental results on the effectiveness of the technique.