Learning table extraction from examples

Authors:
Ashwin Tengli;Yiming Yang;Nian Li Ma
Affiliations:
Carnegie Mellon University, Pittsburgh, PA;Carnegie Mellon University, Pittsburgh, PA;Carnegie Mellon University, Pittsburgh, PA
Venue:
COLING '04 Proceedings of the 20th international conference on Computational Linguistics
Year:
2004

Citing 8
Cited 8

TINTIN: a system for retrieval in text tables

DL '97 Proceedings of the second ACM international conference on Digital libraries
A flexible learning system for wrapping tables and lists in HTML documents

Proceedings of the 11th international conference on World Wide Web
A machine learning based approach for table detection on the web

Proceedings of the 11th international conference on World Wide Web
QuASM: a system for question answering using semi-structured data

Proceedings of the 2nd ACM/IEEE-CS joint conference on Digital libraries
Table extraction using conditional random fields

Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval
Layout & language: preliminary experiments in assigning logical structure to table cells

ANLC '97 Proceedings of the fifth conference on Applied natural language processing
Mining tables from large scale HTML texts

COLING '00 Proceedings of the 18th conference on Computational linguistics - Volume 1
Layout and language: integrating spatial and linguistic knowledge for layout understanding tasks

COLING '00 Proceedings of the 18th conference on Computational linguistics - Volume 1

Transforming arbitrary tables into logical form with TARTAR

Data & Knowledge Engineering
Towards domain-independent information extraction from web tables

Proceedings of the 16th international conference on World Wide Web
Automatic hidden-web table interpretation, conceptualization, and semantic annotation

Data & Knowledge Engineering
From tables to frames

Web Semantics: Science, Services and Agents on the World Wide Web
Mining for attributes and values in tables

Proceedings of the International Conference on Management of Emergent Digital EcoSystems
FACTO: a fact lookup engine based on web tables

Proceedings of the 20th international conference on World wide web
Table detection from plain text using machine learning and document structure

APWeb'06 Proceedings of the 8th Asia-Pacific Web conference on Frontiers of WWW Research and Development
Web table discrimination with composition of rich structural and content information

Applied Soft Computing

Quantified Score

Hi-index	0.00

Visualization

Abstract

Information extraction from tables in web pages is a challenging problem due to the diverse nature of table formats and the vocabulary variants in attribute names. This paper presents a new approach to automated table extraction that exploits formatting cues in semi-structured HTML tables, learns lexical variants from training examples and uses a vector space model to deal with non-exact matches among labels. We conducted experiments with this method on a set of tables collected from 157 university web sites, and obtained the information extraction performance of 91.4% in the Fl-measure, showing the effectiveness of the combined use of structural table parsing and example-based label learning.