Testbed for information extraction from deep web

Authors:
Yasuhiro Yamada;Nick Craswell;Tetsuya Nakatoh;Sachio Hirokawa
Affiliations:
Kyushu University, Fukuoka, Japan;CSIRO Mathematical and Information Sciences, Canberra, Australia;Kyushu University, Fukuoka, Japan;Kyushu University, Fukuoka, Japan
Venue:
Proceedings of the 13th international World Wide Web conference on Alternate track papers & posters
Year:
2004

Citing 2
Cited 10

A brief survey of web data extraction tools

ACM SIGMOD Record
Automated discovery of search interfaces on the web

ADC '03 Proceedings of the 14th Australasian database conference - Volume 17

ViPER: augmenting automatic information extraction with visual perceptions

Proceedings of the 14th ACM international conference on Information and knowledge management
Extracting data records from the web using tag path clustering

Proceedings of the 18th international conference on World wide web
An approach to assess the quality of web pages in the deep web

DASFAA'11 Proceedings of the 16th international conference on Database systems for advanced applications
Towards a unified solution: data record region detection and segmentation

Proceedings of the 20th ACM international conference on Information and knowledge management
Peer matrix alignment: a new algorithm

PAKDD'12 Proceedings of the 16th Pacific-Asia conference on Advances in Knowledge Discovery and Data Mining - Volume Part II
Extracting data records from web using suffix tree

Proceedings of the ACM SIGKDD Workshop on Mining Data Semantics
Clustering visually similar web page elements for structured web data extraction

ICWE'12 Proceedings of the 12th international conference on Web Engineering
Towards web-scale structured web data extraction

Proceedings of the sixth ACM international conference on Web search and data mining
Robust detection of semi-structured web records using a DOM structure-knowledge-driven model

ACM Transactions on the Web (TWEB)
A learning classifier-based approach to aligning data items and labels

BNCOD'13 Proceedings of the 29th British National conference on Big Data

Quantified Score

Hi-index	0.00

Visualization

Abstract

Search results generated by searchable databases are served dynamically and far larger than the static documents on the Web. These results pages have been referred to as the Deep Web. We need to extract the target data in results pages to integrate them on different searchable databases. We propose a test bed for information extraction from search results. We chose 100 databases randomly from 114,540 pages with search forms. Therefore, these databases have a good variety. We selected 51 databases which include URLs in a results pageand manually identify target information to be extracted. We also suggest evaluation measures for comparing extraction methods and methods for extending the target data.