ViDE: A Vision-Based Approach for Deep Web Data Extraction

Authors:
Wei Liu;Xiaofeng Meng;Weiyi Meng
Affiliations:
Renmin University of China, Beijing;Renmin University of China, Beijing;Binghamton University, Binghamton
Venue:
IEEE Transactions on Knowledge and Data Engineering
Year:
2010

Citing 0
Cited 24

Information extraction for search engines using fast heuristic techniques

Data & Knowledge Engineering
Web data extracion using visual features

Proceedings of the International Conference and Workshop on Emerging Trends in Technology
Mining process models with prime invisible tasks

Data & Knowledge Engineering
On-line web database integration

Proceedings of the International Conference on Management of Emergent Digital EcoSystems
HyLiEn: a hybrid approach to general list extraction on the web

Proceedings of the 20th international conference companion on World wide web
Federated Search

Foundations and Trends in Information Retrieval
Extracting general lists from web documents: a hybrid approach

IEA/AIE'11 Proceedings of the 24th international conference on Industrial engineering and other applications of applied intelligent systems conference on Modern approaches in applied intelligence - Volume Part I
Towards a spatial instance learning method for deep web pages

ICDM'11 Proceedings of the 11th international conference on Advances in data mining: applications and theoretical aspects
An indent shape based approach for web lists mining

WISM'11 Proceedings of the 2011 international conference on Web information systems and mining - Volume Part II
Concluding pattern of web page based on string pattern matching

WISM'11 Proceedings of the 2011 international conference on Web information systems and mining - Volume Part II
Towards a unified solution: data record region detection and segmentation

Proceedings of the 20th ACM international conference on Information and knowledge management
SILA: a spatial instance learning approach for deep webpages

Proceedings of the 20th ACM international conference on Information and knowledge management
Extracting data records from query result pages based on visual features

BNCOD'11 Proceedings of the 28th British national conference on Advances in databases
Data extraction for search engine using safe matching

AI'11 Proceedings of the 24th international conference on Advances in Artificial Intelligence
TEX: An efficient and effective unsupervised Web information extractor

Knowledge-Based Systems
Multiple sections extraction using visual cue

ICONIP'12 Proceedings of the 19th international conference on Neural Information Processing - Volume Part V
Towards web-scale structured web data extraction

Proceedings of the sixth ACM international conference on Web search and data mining
Cluster-based page segmentation-a fast and precise method for web page pre-processing

Proceedings of the 3rd International Conference on Web Intelligence, Mining and Semantics
Visually extracting data records from the deep web

Proceedings of the 22nd international conference on World Wide Web companion
Robust detection of semi-structured web records using a DOM structure-knowledge-driven model

ACM Transactions on the Web (TWEB)
The parallel path framework for entity discovery on the web

ACM Transactions on the Web (TWEB)
A learning classifier-based approach to aligning data items and labels

BNCOD'13 Proceedings of the 29th British National conference on Big Data
Architecture specification of rule-based deep web crawler with indexer

International Journal of Knowledge and Web Intelligence
Formal concept analysis approach for data extraction from a limited deep web database

Journal of Intelligent Information Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

Deep Web contents are accessed by queries submitted to Web databases and the returned data records are enwrapped in dynamically generated Web pages (they will be called deep Web pages in this paper). Extracting structured data from deep Web pages is a challenging problem due to the underlying intricate structures of such pages. Until now, a large number of techniques have been proposed to address this problem, but all of them have inherent limitations because they are Web-page-programming-language-dependent. As the popular two-dimensional media, the contents on Web pages are always displayed regularly for users to browse. This motivates us to seek a different way for deep Web data extraction to overcome the limitations of previous works by utilizing some interesting common visual features on the deep Web pages. In this paper, a novel vision-based approach that is Web-page-programming-language-independent is proposed. This approach primarily utilizes the visual features on the deep Web pages to implement deep Web data extraction, including data record extraction and data item extraction. We also propose a new evaluation measure revision to capture the amount of human effort needed to produce perfect extraction. Our experiments on a large set of Web databases show that the proposed vision-based approach is highly effective for deep Web data extraction.