Recognizing Ontology-Applicable Multiple-Record Web Documents

Authors:
David W. Embley;Yiu-Kai Ng;Li Xu
Affiliations:
-;-;-
Venue:
ER '01 Proceedings of the 20th International Conference on Conceptual Modeling: Conceptual Modeling
Year:
2001

Citing 10
Cited 7

C4.5: programs for machine learning

C4.5: programs for machine learning
Information extraction as a basis for high-precision text classification

ACM Transactions on Information Systems (TOIS)
Distributional clustering of words for text classification

Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval
Record-boundary discovery in Web documents

SIGMOD '99 Proceedings of the 1999 ACM SIGMOD international conference on Management of data
Focused crawling: a new approach to topic-specific Web resource discovery

WWW '99 Proceedings of the eighth international conference on World Wide Web
Automatic Document Classification

Journal of the ACM (JACM)
Conceptual-model-based data extraction from multiple-record Web pages

Data & Knowledge Engineering
Modern Information Retrieval

Modern Information Retrieval
Introduction to Modern Information Retrieval

Introduction to Modern Information Retrieval
Record Location and Reconfiguration in Unstructured Multiple-Record Web Documents

Selected papers from the Third International Workshop WebDB 2000 on The World Wide Web and Databases

Wiccap Data Model: Mapping Physical Websites to Logical Views

ER '02 Proceedings of the 21st International Conference on Conceptual Modeling
Performing Binary-Categorization on Multiple-Record Web Documents Using Information Retrieval Models and Application Ontologies

World Wide Web
An Ontology-Based Binary-Categorization Approach for Recognizing Multiple-Record Web Documents Using a Probabilistic Retrieval Model

Information Retrieval
Learning Rules for Conceptual Structure on the Web

Journal of Intelligent Information Systems
Toward semantic understanding: an approach based on information extraction ontologies

ADC '04 Proceedings of the 15th Australasian database conference - Volume 27
Automating the extraction of data from HTML tables with unknown structure

Data & Knowledge Engineering - Special issue: ER 2002
Categorisation of web documents using extraction ontologies

International Journal of Metadata, Semantics and Ontologies

Quantified Score

Hi-index	0.00

Visualization

Abstract

Automatically recognizing which Web documents are "of interest" for some specified application is non-trivial. As a step toward solving this problem, we propose a technique for recognizing which multiplere-cord Web documents apply to an ontologically specified application. Given the values and kinds of values recognized by an ontological specification in an unstructuredWeb document, we apply three heuristics: (1) a density heuristic that measures the percent of the document that appears to apply to an application ontology, (2) an expected-value heuristic that compares the number and kind of values found in a document to the number and kind expected by the application ontology, and (3) a grouping heuristic that considers whether the values of the document appear to be grouped as application-ontology records. Then, based on machine-learned rules over these heuristic measurements, we determine whether a Web document is applicable for a given ontology. Our experimental results show that we have been able to achieve over 90% for both recall and precision, with an F-measure of about 95%.