A learning classifier-based approach to aligning data items and labels

Authors:
Neil Anderson;Jun Hong
Affiliations:
School of Electronics, Electrical Engineering and Computer Science, Queen's University Belfast, UK;School of Electronics, Electrical Engineering and Computer Science, Queen's University Belfast, UK
Venue:
BNCOD'13 Proceedings of the 29th British National conference on Big Data
Year:
2013

Citing 12
Cited 0

Visual Web Information Extraction with Lixto

Proceedings of the 27th International Conference on Very Large Data Bases
Data extraction and label assignment for web databases

WWW '03 Proceedings of the 12th international conference on World Wide Web
Wrapper induction for information extraction

Wrapper induction for information extraction
Testbed for information extraction from deep web

Proceedings of the 13th international World Wide Web conference on Alternate track papers & posters
Fully automatic wrapper generation for search engines

WWW '05 Proceedings of the 14th international conference on World Wide Web
Web data extraction based on partial tree alignment

WWW '05 Proceedings of the 14th international conference on World Wide Web
ViPER: augmenting automatic information extraction with visual perceptions

Proceedings of the 14th ACM international conference on Information and knowledge management
The WEKA data mining software: an update

ACM SIGKDD Explorations Newsletter
ViDE: A Vision-Based Approach for Deep Web Data Extraction

IEEE Transactions on Knowledge and Data Engineering
Automatic wrappers for large scale web extraction

Proceedings of the VLDB Endowment
Little knowledge rules the web: domain-centric result page extraction

RR'11 Proceedings of the 5th international conference on Web reasoning and rule systems
Automatic Extraction of Structured Web Data with Domain Knowledge

ICDE '12 Proceedings of the 2012 IEEE 28th International Conference on Data Engineering

Quantified Score

Hi-index	0.00

Visualization

Abstract

Web databases are now pervasive. Query result pages are dynamically generated from these databases in response to user-submitted queries. A query result page contains a number of data records, each of which consists of data items and their labels. In this paper, we focus on the data alignment problem, in which individual data items and labels from different data records on a query page are aligned into separate columns, each representing a group of semantically similar data items or labels from each of these data records. We present a new approach to the data alignment problem, in which learning classifiers are trained using supervised learning to align data items and labels. Previous approaches to this problem have relied on heuristics and manually-crafted rules, which are difficult to be adapted to new page layouts and designs. In contrast we are motivated to develop learning classifiers which can be easily adapted. We have implemented the proposed learning classifier-based approach in a software prototype, rAligner, and our experimental results have shown that the approach is highly effective.