OntoMiner: automated metadata and instance mining from news websites

Authors:
Hasan Davulcu;Srinivas Vadrevu;Saravanakumar Nagarajan
Affiliations:
Department of Computer Science and Engineering, Arizona State University, Tempe, AZ 85287-8809, USA.;Department of Computer Science and Engineering, Arizona State University, Tempe, AZ 85287-8809, USA.;Convera Corporation, 1808 Aston Avenue, Carlsbad, CA-92008, USA
Venue:
International Journal of Web and Grid Services
Year:
2005

Citing 25
Cited 1

Template-based wrappers in the TSIMMIS system

SIGMOD '97 Proceedings of the 1997 ACM SIGMOD international conference on Management of data
Information storage and retrieval

Information storage and retrieval
A scalable comparison-shopping agent for the World-Wide Web

AGENTS '97 Proceedings of the first international conference on Autonomous agents
Extracting schema from semistructured data

SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
Ontology-based extraction and structuring of information from data-rich unstructured documents

Proceedings of the seventh international conference on Information and knowledge management
Learning to extract symbolic knowledge from the World Wide Web

AAAI '98/IAAI '98 Proceedings of the fifteenth national/tenth conference on Artificial intelligence/Innovative applications of artificial intelligence
A hierarchical approach to wrapper induction

Proceedings of the third annual conference on Autonomous Agents
DTD inference for views of XML data

PODS '00 Proceedings of the nineteenth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
XTRACT: a system for extracting document type descriptors from XML documents

SIGMOD '00 Proceedings of the 2000 ACM SIGMOD international conference on Management of data
A machine learning based approach for table detection on the web

Proceedings of the 11th international conference on World Wide Web
Learning to Match the Schemas of Data Sources: A Multistrategy Approach

Machine Learning
Ontology Learning and Its Application to Automated Terminology Translation

IEEE Intelligent Systems
WebOQL: Restructuring Documents, Databases, and Webs

ICDE '98 Proceedings of the Fourteenth International Conference on Data Engineering
RoadRunner: Towards Automatic Data Extraction from Large Web Sites

Proceedings of the 27th International Conference on Very Large Data Bases
Four Steps Towards the Widespread Adoption of a Semantic Web

ISWC '02 Proceedings of the First International Semantic Web Conference on The Semantic Web
PROMPT: Algorithm and Tool for Automated Ontology Merging and Alignment

Proceedings of the Seventeenth National Conference on Artificial Intelligence and Twelfth Conference on Innovative Applications of Artificial Intelligence
Efficiently mining frequent trees in a forest

Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
SemTag and seeker: bootstrapping the semantic web via automated semantic annotation

WWW '03 Proceedings of the 12th international conference on World Wide Web
Representing and reasoning about mappings between domain models

Eighteenth national conference on Artificial intelligence
Extracting structured data from Web pages

Proceedings of the 2003 ACM SIGMOD international conference on Management of data
HTML Page Analysis Based on Visual Cues

ICDAR '01 Proceedings of the Sixth International Conference on Document Analysis and Recognition
Reverse Engineering for Web Data: From Visual to Semantic Structures

ICDE '02 Proceedings of the 18th International Conference on Data Engineering
Web-scale information extraction in knowitall: (preliminary results)

Proceedings of the 13th international conference on World Wide Web
Mining tables from large scale HTML texts

COLING '00 Proceedings of the 18th conference on Computational linguistics - Volume 1
Using the structure of Web sites for automatic segmentation of tables

SIGMOD '04 Proceedings of the 2004 ACM SIGMOD international conference on Management of data

Incremental Ontology-Based Extraction and Alignment in Semi-structured Documents

DEXA '09 Proceedings of the 20th International Conference on Database and Expert Systems Applications

Quantified Score

Hi-index	0.00

Visualization

Abstract

RDF/XML has been widely recognised as the standard for annotating online web documents and for transforming the HTML web into the so-called Semantic Web. In order to enable widespread usability of the Semantic Web, there is a need to bootstrap large, rich and up-to-date domain ontologies that organise the most relevant concepts, their relationships and instances. In this paper, we present automated techniques for bootstrapping and populating specialised domain ontologies by organising and mining a set of relevant overlapping websites. We develop algorithms that detect and utilise HTML regularities in the web documents to turn them into hierarchical semantic structures encoded as XML. Next, we present tree-mining algorithms that identify key domain concepts and their taxonomical relationships. We also extract semi-structured concept instances annotated with their labels whenever they are available. We also report experimental evaluation for the news, travel and shopping domains to demonstrate the efficacy of our algorithms.