C4.5: programs for machine learning
C4.5: programs for machine learning
Information extraction as a basis for high-precision text classification
ACM Transactions on Information Systems (TOIS)
Distributional clustering of words for text classification
Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval
Record-boundary discovery in Web documents
SIGMOD '99 Proceedings of the 1999 ACM SIGMOD international conference on Management of data
Focused crawling: a new approach to topic-specific Web resource discovery
WWW '99 Proceedings of the eighth international conference on World Wide Web
Automatic Document Classification
Journal of the ACM (JACM)
Conceptual-model-based data extraction from multiple-record Web pages
Data & Knowledge Engineering
Modern Information Retrieval
Introduction to Modern Information Retrieval
Introduction to Modern Information Retrieval
Record Location and Reconfiguration in Unstructured Multiple-Record Web Documents
Selected papers from the Third International Workshop WebDB 2000 on The World Wide Web and Databases
Wiccap Data Model: Mapping Physical Websites to Logical Views
ER '02 Proceedings of the 21st International Conference on Conceptual Modeling
Learning Rules for Conceptual Structure on the Web
Journal of Intelligent Information Systems
Toward semantic understanding: an approach based on information extraction ontologies
ADC '04 Proceedings of the 15th Australasian database conference - Volume 27
Automating the extraction of data from HTML tables with unknown structure
Data & Knowledge Engineering - Special issue: ER 2002
Categorisation of web documents using extraction ontologies
International Journal of Metadata, Semantics and Ontologies
Hi-index | 0.00 |
Automatically recognizing which Web documents are "of interest" for some specified application is non-trivial. As a step toward solving this problem, we propose a technique for recognizing which multiplere-cord Web documents apply to an ontologically specified application. Given the values and kinds of values recognized by an ontological specification in an unstructuredWeb document, we apply three heuristics: (1) a density heuristic that measures the percent of the document that appears to apply to an application ontology, (2) an expected-value heuristic that compares the number and kind of values found in a document to the number and kind expected by the application ontology, and (3) a grouping heuristic that considers whether the values of the document appear to be grouped as application-ontology records. Then, based on machine-learned rules over these heuristic measurements, we determine whether a Web document is applicable for a given ontology. Our experimental results show that we have been able to achieve over 90% for both recall and precision, with an F-measure of about 95%.