Record extraction based on user feedback and document selection

Authors:
Jianwei Zhang;Yoshiharu Ishikawa;Hiroyuki Kitagawa
Affiliations:
Department of Computer Science, Graduate School of Systems and Information Engineering, University of Tsukuba, Tsukuba, Ibaraki, Japan;Information Technology Center, Nagoya University, Nagoya, Aichi, Japan;Department of Computer Science, Graduate School of Systems and Information Engineering, University of Tsukuba, Tsukuba, Ibaraki, Japan and Center for Computational Sciences, University of Tsukuba, ...
Venue:
APWeb/WAIM'07 Proceedings of the joint 9th Asia-Pacific web and 8th international conference on web-age information management conference on Advances in data and web management
Year:
2007

Citing 8
Cited 1

Focused crawling: a new approach to topic-specific Web resource discovery

WWW '99 Proceedings of the eighth international conference on World Wide Web
Wrapper induction: efficiency and expressiveness

Artificial Intelligence - Special issue on Intelligent internet systems
Snowball: a prototype system for extracting relations from large text collections

SIGMOD '01 Proceedings of the 2001 ACM SIGMOD international conference on Management of data
Accelerated focused crawling through online relevance feedback

Proceedings of the 11th international conference on World Wide Web
QProber: A system for automatic classification of hidden-Web databases

ACM Transactions on Information Systems (TOIS)
Visual Web Information Extraction with Lixto

Proceedings of the 27th International Conference on Very Large Data Bases
Extracting Patterns and Relations from the World Wide Web

WebDB '98 Selected papers from the International Workshop on The World Wide Web and Databases
Extracting relational data from HTML repositories

ACM SIGKDD Explorations Newsletter

Extracting XML data from the web

Proceedings of the 12th International Conference on Information Integration and Web-based Applications & Services

Quantified Score

Hi-index	0.00

Visualization

Abstract

In recent years, the research of record extraction from large document data is becoming popular. However there still exist some problems in record extraction. 1) when large document data is used for the target of information extraction, the process usually becomes very expensive. 2) it is also likely that extracted records may not pertain to the user's interest on the aspect of the topic. To address these problems, in this paper we propose a method to efficiently extract those records whose topics agree with the user's interest. To improve the efficiency of the information extraction system, our method identifies documents from which useful records are probably extracted. We make use of user feed-back on extraction results to find topic-related documents and records. Our experiments show that our system achieves high extraction accuracy across different extraction targets.