Towards a wrapper-driven ontology-based framework for knowledge extraction

Authors:
Jigui Sun;Xi Bai;Zehai Li;Haiyan Che;Huawen Liu
Affiliations:
College of Computer Science and Technology, Jilin University, Changchun, China and Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, Changchun, China;College of Computer Science and Technology, Jilin University, Changchun, China and Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, Changchun, China;College of Computer Science and Technology, Jilin University, Changchun, China and Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, Changchun, China;College of Computer Science and Technology, Jilin University, Changchun, China and Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, Changchun, China;College of Computer Science and Technology, Jilin University, Changchun, China and Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, Changchun, China
Venue:
KSEM'07 Proceedings of the 2nd international conference on Knowledge science, engineering and management
Year:
2007

Citing 14
Cited 2

Suffix arrays: a new method for on-line string searches

SIAM Journal on Computing
Information extraction

Communications of the ACM
Database techniques for the World-Wide Web: a survey

ACM SIGMOD Record
A flexible learning system for wrapping tables and lists in HTML documents

Proceedings of the 11th international conference on World Wide Web
Naive (Bayes) at Forty: The Independence Assumption in Information Retrieval

ECML '98 Proceedings of the 10th European Conference on Machine Learning
Table extraction using conditional random fields

Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval
Web data extraction based on partial tree alignment

WWW '05 Proceedings of the 14th international conference on World Wide Web
Annotating information structures in Chinese texts using HowNet

CLPW '00 Proceedings of the second workshop on Chinese language processing: held in conjunction with the 38th Annual Meeting of the Association for Computational Linguistics - Volume 12
L-tree match: a new data extraction model and algorithm for huge text stream with noises

Journal of Computer Science and Technology
Web wrapper validation

APWeb'03 Proceedings of the 5th Asia-Pacific web conference on Web technologies and applications
WetDL: a web information extraction language

ADVIS'04 Proceedings of the Third international conference on Advances in Information Systems
Ontology-driven information extraction with ontosyphon

ISWC'06 Proceedings of the 5th international conference on The Semantic Web
Towards knowledge acquisition from information extraction

ISWC'06 Proceedings of the 5th international conference on The Semantic Web
Integrating data from the web by machine-learning tree-pattern queries

ODBASE'06/OTM'06 Proceedings of the 2006 Confederated international conference on On the Move to Meaningful Internet Systems: CoopIS, DOA, GADA, and ODBASE - Volume Part I

Free-text search versus complex web forms

ECIR'11 Proceedings of the 33rd European conference on Advances in information retrieval
Free-text search over complex web forms

IRFC'11 Proceedings of the Second international conference on Multidisciplinary information retrieval facility

Quantified Score

Hi-index	0.00

Visualization

Abstract

Since Web resources are formatted in diverse ways for human viewing, the accuracy of extracting information is not satisfactory and, further, it is not convenient for users to query information extracted by traditional techniques. This paper proposes WebKER, a wrapper-driven system for extracting knowledge from Web pages in Chinese based on domain ontologies. Wrappers are first learned through suffix arrays. Based on HowNet, a novel approach is proposed to automatically align the raw data extracted by wrappers. Then knowledge is generated and described with Resource Description Framework (RDF) statements. After merged, knowledge is finally added to the Knowledge Base (KB). A prototype of WebKER is implemented and in the experiments, the performance of our system and the comparison between querying information stored in the KB and querying information extracted with traditional techniques are given, indicating the superiority of our system. In addition, the evaluation of the outstanding wrapper and the method for merging knowledge are also presented.