Extracting data records from query result pages based on visual features

Authors:
Daiyue Weng;Jun Hong;David A. Bell
Affiliations:
School of Electronics, Electrical Engineering and Computer Science, Queen's University Belfast, Belfast, UK;School of Electronics, Electrical Engineering and Computer Science, Queen's University Belfast, Belfast, UK;School of Electronics, Electrical Engineering and Computer Science, Queen's University Belfast, Belfast, UK
Venue:
BNCOD'11 Proceedings of the 28th British national conference on Advances in databases
Year:
2011

Citing 18
Cited 0

IEPAD: information extraction based on pattern discovery

Proceedings of the 10th international conference on World Wide Web
RoadRunner: Towards Automatic Data Extraction from Large Web Sites

Proceedings of the 27th International Conference on Very Large Data Bases
Data extraction and label assignment for web databases

WWW '03 Proceedings of the 12th international conference on World Wide Web
Extracting structured data from Web pages

Proceedings of the 2003 ACM SIGMOD international conference on Management of data
Mining data records in Web pages

Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
Fully automatic wrapper generation for search engines

WWW '05 Proceedings of the 14th international conference on World Wide Web
Web data extraction based on partial tree alignment

WWW '05 Proceedings of the 14th international conference on World Wide Web
ViPER: augmenting automatic information extraction with visual perceptions

Proceedings of the 14th ACM international conference on Information and knowledge management
Simultaneous record detection and attribute labeling in web data extraction

Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining
Automatic extraction of dynamic record sections from search engine result pages

VLDB '06 Proceedings of the 32nd international conference on Very large data bases
Structured Data Extraction from the Web Based on Partial Tree Alignment

IEEE Transactions on Knowledge and Data Engineering
Towards domain-independent information extraction from web tables

Proceedings of the 16th international conference on World Wide Web
Instance-based schema matching for web databases by domain-specific query probing

VLDB '04 Proceedings of the Thirtieth international conference on Very large data bases - Volume 30
Extracting data records from the web using tag path clustering

Proceedings of the 18th international conference on World wide web
Table extraction using spatial reasoning on the CSS2 visual box model

AAAI'06 proceedings of the 21st national conference on Artificial intelligence - Volume 2
ViDE: A Vision-Based Approach for Deep Web Data Extraction

IEEE Transactions on Knowledge and Data Engineering
Extracting content structure for web pages based on visual representation

APWeb'03 Proceedings of the 5th Asia-Pacific web conference on Web technologies and applications
NET – a system for extracting web data from flat and nested data records

WISE'05 Proceedings of the 6th international conference on Web Information Systems Engineering

Quantified Score

Hi-index	0.00

Visualization

Abstract

Web databases contain a large amount of structured data which are accessible via their query interfaces only. Query results are presented in dynamically generated web pages, usually in the form of data records, for human use. The problem of automatically extracting data records from query result pages is critical for web data integration applications, such as comparison shopping sites, meta-search engines, etc. A number of approaches to query result extraction have been proposed. As the structures of web pages become more complex, these approaches start to fail. Query result pages usually also contain other types of information in addition to query results, e.g., advertisements, navigation bar, etc. Most of the existing approaches do not remove such irrelevant contents which may affect the accuracy of data record extraction. We have observed that query results are usually displayed in regular visual patterns and terms used in a query often re-appear in query results. We propose a novel approach that makes use of visual features and query terms to identify the data section and extract data records from it. We also use several content and visual features of visual blocks in a data section to filter out noisy blocks. The results of our experiments on a large set of query result pages in different domains show that our proposed approach is highly effective.