Schema extraction for tabular data on the web

Authors:
Marco D. Adelfio;Hanan Samet
Affiliations:
Center for Automation Research, Institute for Advanced Computer Studies, Department of Computer Science, University of Maryland, College Park, MD;Center for Automation Research, Institute for Advanced Computer Studies, Department of Computer Science, University of Maryland, College Park, MD
Venue:
Proceedings of the VLDB Endowment
Year:
2013

Citing 20
Cited 1

A relational model of data for large shared data banks

Communications of the ACM
A machine learning based approach for table detection on the web

Proceedings of the 11th international conference on World Wide Web
Use of the SAND spatial browser for digital government applications

Communications of the ACM
Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data

ICML '01 Proceedings of the Eighteenth International Conference on Machine Learning
The Spatial Spreadsheet

VISUAL '99 Proceedings of the Third International Conference on Visual Information and Information Systems
Table extraction using conditional random fields

Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval
Mining tables from large scale HTML texts

COLING '00 Proceedings of the 18th conference on Computational linguistics - Volume 1
A survey of table recognition: Models, observations, transformations, and inferences

International Journal on Document Analysis and Recognition
Shallow parsing with conditional random fields

NAACL '03 Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology - Volume 1
Towards domain-independent information extraction from web tables

Proceedings of the 16th international conference on World Wide Web
TableSeer: automatic table metadata extraction and searching in digital libraries

Proceedings of the 7th ACM/IEEE-CS joint conference on Digital libraries
WebTables: exploring the power of tables on the web

Proceedings of the VLDB Endowment
Automated ontology instantiation from tabular web sources-The AllRight system

Web Semantics: Science, Services and Agents on the World Wide Web
Spatio-textual spreadsheets: geotagging via spatial coherence

Proceedings of the 17th ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems
Data integration for the relational web

Proceedings of the VLDB Endowment
Annotating and searching web tables using entities, types and relationships

Proceedings of the VLDB Endowment
Recovering semantics of tables on the web

Proceedings of the VLDB Endowment
InfoGather: entity augmentation and attribute discovery by holistic matching with web tables

SIGMOD '12 Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data
Finding related tables

SIGMOD '12 Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data
Answering table queries on the web using column keywords

Proceedings of the VLDB Endowment

Structured toponym resolution using combined hierarchical place categories

Proceedings of the 7th Workshop on Geographic Information Retrieval

Quantified Score

Hi-index	0.00

Visualization

Abstract

Tabular data is an abundant source of information on the Web, but remains mostly isolated from the latter's interconnections since tables lack links and computer-accessible descriptions of their structure. In other words, the schemas of these tables -- attribute names, values, data types, etc. -- are not explicitly stored as table metadata. Consequently, the structure that these tables contain is not accessible to the crawlers that power search engines and thus not accessible to user search queries. We address this lack of structure with a new method for leveraging the principles of table construction in order to extract table schemas. Discovering the schema by which a table is constructed is achieved by harnessing the similarities and differences of nearby table rows through the use of a novel set of features and a feature processing scheme. The schemas of these data tables are determined using a classification technique based on conditional random fields in combination with a novel feature encoding method called logarithmic binning, which is specifically designed for the data table extraction task. Our method provides considerable improvement over the well-known WebTables schema extraction method. In contrast with previous work that focuses on extracting individual relations, our method excels at correctly interpreting full tables, thereby being capable of handling general tables such as those found in spreadsheets, instead of being restricted to HTML tables as is the case with the WebTables method. We also extract additional schema characteristics, such as row groupings, which are important for supporting information retrieval tasks on tabular data.