Wikipedia driven autonomous label assignment in wrapper induced tables with missing column names

Authors:
Mohammad Shafkat Amin;Anupam Bhattacharjee;Hasan Jamil
Affiliations:
Wayne State University;Wayne State University;Wayne State University
Venue:
Proceedings of the 2010 ACM Symposium on Applied Computing
Year:
2010

Citing 12
Cited 1

Conceptual-model-based data extraction from multiple-record Web pages

Data & Knowledge Engineering
IEPAD: information extraction based on pattern discovery

Proceedings of the 10th international conference on World Wide Web
Visual Web Information Extraction with Lixto

Proceedings of the 27th International Conference on Very Large Data Bases
RoadRunner: Towards Automatic Data Extraction from Large Web Sites

Proceedings of the 27th International Conference on Very Large Data Bases
Data extraction and label assignment for web databases

WWW '03 Proceedings of the 12th international conference on World Wide Web
Extracting structured data from Web pages

Proceedings of the 2003 ACM SIGMOD international conference on Management of data
A Fully Automated Object Extraction System for the World Wide Web

ICDCS '01 Proceedings of the The 21st International Conference on Distributed Computing Systems
Scientific workflow management and the Kepler system: Research Articles

Concurrency and Computation: Practice & Experience - Workflow in Grid Systems
BioFlow: A Web-Based Declarative Workflow Language for Life Sciences

SERVICES '08 Proceedings of the 2008 IEEE Congress on Services - Part I
Learning to link with wikipedia

Proceedings of the 17th ACM conference on Information and knowledge management
FastWrap: an efficient wrapper for tabular data extraction from the web

IRI'09 Proceedings of the 10th IEEE international conference on Information Reuse & Integration
DBpedia: a nucleus for a web of open data

ISWC'07/ASWC'07 Proceedings of the 6th international The semantic web and 2nd Asian conference on Asian semantic web conference

Integrating large and distributed life sciences resources for systems biology research: progress and new challenges

Transactions on large-scale data- and knowledge-centered systems III

Quantified Score

Hi-index	0.00

Visualization

Abstract

As the volume of information available on the internet is growing exponentially, it is clear that most of this information will have to be processed and digested by computers to produce useful information for human consumption. Unfortunately, most web contents are currently designed for direct human consumption in which it is assumed that a human will decipher the information presented to him in some context and will be able to connect the missing dots, if any. In particular, information presented in some tabular form often does not accompany descriptive titles or column names similar to attribute names in tables. While such omissions are not really an issue for humans, it is truly hard to extract information in autonomous systems in which a machine is expected to understand the meaning of the table presented and extract the right information in the context of the query. It is even more difficult when the information needed is distributed across the globe and involve semantic heterogeneity. In this paper, our goal is to address the issue of how to interpret tables with missing column names by developing a method for the assignment of attributes names in an arbitrary table extracted from the web in a fully autonomous manner. We propose a novel approach by leveraging Wikipedia for the first time for column name discovery for the purpose of table annotation. We show that this leads to an improved likelihood of capturing the context and interpretation of the table accurately and producing a semantically meaningful query response.