Categorisation of web documents using extraction ontologies

Authors:
Li Xu;David W. Embley
Affiliations:
Department of Computer Science, University of Arizona South, 1140 N Colombo Ave., Sierra Vista, AZ 85635, USA.;Department of Computer Science, Brigham Young University, Provo, Utah, USA
Venue:
International Journal of Metadata, Semantics and Ontologies
Year:
2008

Citing 30
Cited 1

Object-oriented systems analysis: a model-driven approach

Object-oriented systems analysis: a model-driven approach
C4.5: programs for machine learning

C4.5: programs for machine learning
A translation approach to portable ontology specifications

Knowledge Acquisition - Special issue: Current issues in knowledge modeling
Distributional clustering of words for text classification

Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval
Learning to extract symbolic knowledge from the World Wide Web

AAAI '98/IAAI '98 Proceedings of the fifteenth national/tenth conference on Artificial intelligence/Innovative applications of artificial intelligence
Record-boundary discovery in Web documents

SIGMOD '99 Proceedings of the 1999 ACM SIGMOD international conference on Management of data
Focused crawling: a new approach to topic-specific Web resource discovery

WWW '99 Proceedings of the eighth international conference on World Wide Web
Conceptual-model-based data extraction from multiple-record Web pages

Data & Knowledge Engineering
Recent results in automatic Web resource discovery

ACM Computing Surveys (CSUR)
Information retrieval on the web

ACM Computing Surveys (CSUR)
Machine learning in automated text categorization

ACM Computing Surveys (CSUR)
Modern Information Retrieval

Modern Information Retrieval
DEByE - Date extraction by example

Data & Knowledge Engineering
Incorporating Prior Knowledge into Boosting

ICML '02 Proceedings of the Nineteenth International Conference on Machine Learning
Record Location and Reconfiguration in Unstructured Multiple-Record Web Documents

Selected papers from the Third International Workshop WebDB 2000 on The World Wide Web and Databases
A Binary-Categorization Approach for Classifying Multiple-Record Web Documents Using Application Ontologies and a Probabilistic Model

DASFAA '01 Proceedings of the 7th International Conference on Database Systems for Advanced Applications
On the Automatic Extraction of Data from the Hidden Web

Revised Papers from the HUMACS, DASWIS, ECOMO, and DAMA on ER 2001 Workshops
Recognizing Ontology-Applicable Multiple-Record Web Documents

ER '01 Proceedings of the 20th International Conference on Conceptual Modeling: Conceptual Modeling
Performing Binary-Categorization on Multiple-Record Web Documents Using Information Retrieval Models and Application Ontologies

World Wide Web
An Ontology-Based Binary-Categorization Approach for Recognizing Multiple-Record Web Documents Using a Probabilistic Retrieval Model

Information Retrieval
Incorporating prior knowledge with weighted margin support vector machines

Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining
Swoogle: a search and metadata engine for the semantic web

Proceedings of the thirteenth ACM international conference on Information and knowledge management
The role of knowledge in conceptual retrieval: a study in the domain of clinical medicine

SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
Semantic term matching in axiomatic approaches to information retrieval

SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
Semantic search via XML fragments: a high-precision approach to IR

SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
Constructing informative prior distributions from domain knowledge in text classification

SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
Ontology Matching

Ontology Matching
A composite approach to automating direct and indirect schema mappings

Information Systems
Semantic annotation, indexing, and retrieval

Web Semantics: Science, Services and Agents on the World Wide Web
Putting things in context: a topological approach to mapping contexts to ontologies

Journal on data semantics IX

KBB: a knowledge-bundle builder for research studies

ER'10 Proceedings of the 2010 international conference on Advances in conceptual modeling: applications and challenges

Quantified Score

Hi-index	0.00

Visualization

Abstract

Automatically recognising which HTML documents on the Webcontain items of interest for a user is non-trivial. As a steptoward solving this problem, we propose an approach based oninformation-extraction ontologies. Given HTML documents, tables,and forms, our document recognition system extracts expectedontological vocabulary (keywords and keyword phrases) and expectedontological instance data (particular values for ontologicalconcepts). We then use machine-learned rules over this extractedinformation to determine whether an HTML document contains items ofinterest. Experimental results show that our ontological approachto categorisation works well, having achieved F-measures above 90%for all applications we tried.