Towards domain-independent information extraction from web tables

Authors:
Wolfgang Gatterbauer;Paul Bohunsky;Marcus Herzog;Bernhard Krüpl;Bernhard Pollak
Affiliations:
Vienna University of Technology, Vienna, Austria;Vienna University of Technology, Vienna, Austria;Vienna University of Technology, Vienna, Austria;Vienna University of Technology, Vienna, Austria;Vienna University of Technology, Vienna, Austria
Venue:
Proceedings of the 16th international conference on World Wide Web
Year:
2007

Citing 33
Cited 53

Snowball: extracting relations from large plain-text collections

DL '00 Proceedings of the fifth ACM conference on Digital libraries
A flexible learning system for wrapping tables and lists in HTML documents

Proceedings of the 11th international conference on World Wide Web
A machine learning based approach for table detection on the web

Proceedings of the 11th international conference on World Wide Web
Foundations of Databases: The Logical Level

Foundations of Databases: The Logical Level
A framework for web table mining

Proceedings of the 4th international workshop on Web information and data management
Visual Based Content Understanding towards Web Adaptation

AH '02 Proceedings of the Second International Conference on Adaptive Hypermedia and Adaptive Web-Based Systems
Recognition of Common Areas in a Web Page Using Visual Information: a possible application in a page classification

ICDM '02 Proceedings of the 2002 IEEE International Conference on Data Mining
Flexible Web Document Analysis for Delivery to Narrow-Bandwidth Devices

ICDAR '01 Proceedings of the Sixth International Conference on Document Analysis and Recognition
Why Table Ground-Truthing is Hard

ICDAR '01 Proceedings of the Sixth International Conference on Document Analysis and Recognition
HTML Page Analysis Based on Visual Cues

ICDAR '01 Proceedings of the Sixth International Conference on Document Analysis and Recognition
Tabular abstraction, editing, and formatting

Tabular abstraction, editing, and formatting
Mining tables from large scale HTML texts

COLING '00 Proceedings of the 18th conference on Computational linguistics - Volume 1
Using the structure of Web sites for automatic segmentation of tables

SIGMOD '04 Proceedings of the 2004 ACM SIGMOD international conference on Management of data
Meaning and the semantic web

Proceedings of the 13th international World Wide Web conference on Alternate track papers & posters
A survey of table recognition: Models, observations, transformations, and inferences

International Journal on Document Analysis and Recognition
Editorial: special issue on web content mining

ACM SIGKDD Explorations Newsletter
Fully automatic wrapper generation for search engines

WWW '05 Proceedings of the 14th international conference on World Wide Web
Web data extraction based on partial tree alignment

WWW '05 Proceedings of the 14th international conference on World Wide Web
Using visual cues for extraction of tabular data from arbitrary HTML documents

WWW '05 Special interest tracks and posters of the 14th international conference on World Wide Web
Answering queries from statistics and probabilistic views

VLDB '05 Proceedings of the 31st international conference on Very large data bases
ViPER: augmenting automatic information extraction with visual perceptions

Proceedings of the 14th ACM international conference on Information and knowledge management
Towards Ontology Generation from Tables

World Wide Web
Adaptive Product Normalization: Using Online Learning for Record Linkage in Comparison Shopping

ICDM '05 Proceedings of the Fifth IEEE International Conference on Data Mining
Visually guided bottom-up table detection and segmentation in web documents

Proceedings of the 15th international conference on World Wide Web
Visual information extraction

Knowledge and Information Systems
Learning table extraction from examples

COLING '04 Proceedings of the 20th international conference on Computational Linguistics
Integrating probabilistic extraction models and data mining to discover relations and patterns in text

HLT-NAACL '06 Proceedings of the main conference on Human Language Technology Conference of the North American Chapter of the Association of Computational Linguistics
Methods for domain-independent information extraction from the web: an experimental comparison

AAAI'04 Proceedings of the 19th national conference on Artifical intelligence
Table extraction using spatial reasoning on the CSS2 visual box model

AAAI'06 proceedings of the 21st national conference on Artificial intelligence - Volume 2
Open information extraction from the web

IJCAI'07 Proceedings of the 20th international joint conference on Artifical intelligence
From tables to frames

Web Semantics: Science, Services and Agents on the World Wide Web
Extracting content structure for web pages based on visual representation

APWeb'03 Proceedings of the 5th Asia-Pacific web conference on Web technologies and applications
Notes on contemporary table recognition

DAS'06 Proceedings of the 7th international conference on Document Analysis Systems

Ontology-driven, unsupervised instance population

Web Semantics: Science, Services and Agents on the World Wide Web
Hunting for headings: sighted labeling vs. automatic classification of headings

Proceedings of the 10th international ACM SIGACCESS conference on Computers and accessibility
Dynamic Hierarchical Markov Random Fields for Integrated Web Data Extraction

The Journal of Machine Learning Research
WebTables: exploring the power of tables on the web

Proceedings of the VLDB Endowment
Information Extraction

Foundations and Trends in Databases
Using Wikipedia to bootstrap open information extraction

ACM SIGMOD Record
Automatic hidden-web table interpretation, conceptualization, and semantic annotation

Data & Knowledge Engineering
From Tessellations to Table Interpretation

Calculemus '09/MKM '09 Proceedings of the 16th Symposium, 8th International Conference. Held as Part of CICM '09 on Intelligent Computer Mathematics
Enabling Interactive Access to Web Tables

Proceedings of the 13th International Conference on Human-Computer Interaction. Part I: New Trends
Automated ontology instantiation from tabular web sources-The AllRight system

Web Semantics: Science, Services and Agents on the World Wide Web
Scalable web data extraction for online market intelligence

Proceedings of the VLDB Endowment
Visual extraction of information from web pages

Journal of Visual Languages and Computing
Visual structure-based web page clustering and retrieval

Proceedings of the 19th international conference on World wide web
Web-scale knowledge extraction from semi-structured tables

Proceedings of the 19th international conference on World wide web
A unified ontology-based web page model for improving accessibility

Proceedings of the 19th international conference on World wide web
Automatic hidden-web table interpretation by sibling page comparison

ER'07 Proceedings of the 26th international conference on Conceptual modeling
Web data extraction system based on label library

FSKD'09 Proceedings of the 6th international conference on Fuzzy systems and knowledge discovery - Volume 7
Information extraction from web tables

Proceedings of the 11th International Conference on Information Integration and Web-based Applications & Services
Analysis and taxonomy of column header categories for web tables

DAS '10 Proceedings of the 9th IAPR International Workshop on Document Analysis Systems
Method combination for information extraction

Proceedings of the 11th International Conference on Computer Systems and Technologies and Workshop for PhD Students in Computing on International Conference on Computer Systems and Technologies
A fine-grained taxonomy of tables on the web

CIKM '10 Proceedings of the 19th ACM international conference on Information and knowledge management
Enhancing browsing experience of table and image elements in web pages

International Conference on Multimodal Interfaces and the Workshop on Machine Learning for Multimodal Interaction
Structured data on the web

Communications of the ACM
SXPath: extending XPath towards spatial querying on web documents

Proceedings of the VLDB Endowment
KBB: a knowledge-bundle builder for research studies

ER'10 Proceedings of the 2010 international conference on Advances in conceptual modeling: applications and challenges
Web-scale table census and classification

Proceedings of the fourth ACM international conference on Web search and data mining
Link-based hidden attribute discovery for objects on Web

Proceedings of the 14th International Conference on Extending Database Technology
HyLiEn: a hybrid approach to general list extraction on the web

Proceedings of the 20th international conference companion on World wide web
FACTO: a fact lookup engine based on web tables

Proceedings of the 20th international conference on World wide web
Unexpected results in automatic list extraction on the web

ACM SIGKDD Explorations Newsletter
Building Mashups by Demonstration

ACM Transactions on the Web (TWEB)
OSD-DB: a military logistics mobile database

APWeb'11 Proceedings of the 13th Asia-Pacific web conference on Web technologies and applications
Extracting general lists from web documents: a hybrid approach

IEA/AIE'11 Proceedings of the 24th international conference on Industrial engineering and other applications of applied intelligent systems conference on Modern approaches in applied intelligence - Volume Part I
Enabling efficient browsing and manipulation of web tables on smartphone

HCII'11 Proceedings of the 14th international conference on Human-computer interaction: towards mobile and intelligent interaction environments - Volume Part III
A versatile model for web page representation, information extraction and content re-packaging

Proceedings of the 11th ACM symposium on Document engineering
An indent shape based approach for web lists mining

WISM'11 Proceedings of the 2011 international conference on Web information systems and mining - Volume Part II
Towards a unified solution: data record region detection and segmentation

Proceedings of the 20th ACM international conference on Information and knowledge management
Extracting data records from query result pages based on visual features

BNCOD'11 Proceedings of the 28th British national conference on Advances in databases
WebSets: extracting sets of entities from the web using unsupervised information extraction

Proceedings of the fifth ACM international conference on Web search and data mining
Chapter 6: web data extraction for service creation

Search Computing
Datalog-Related aspects in lixto visual developer

Datalog'10 Proceedings of the First international conference on Datalog Reloaded
A system for extracting top-K lists from the web

Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining
Automatic transformation of multi-dimensional web tables into data cubes

DaWaK'12 Proceedings of the 14th international conference on Data Warehousing and Knowledge Discovery
A general theory of spatial relations to support a graphical tool for visual information extraction

Journal of Visual Languages and Computing
Feature-based object identification for web automation

Proceedings of the 28th Annual ACM Symposium on Applied Computing
Methods for exploring and mining tables on Wikipedia

Proceedings of the ACM SIGKDD Workshop on Interactive Data Exploration and Analytics
Robust detection of semi-structured web records using a DOM structure-knowledge-driven model

ACM Transactions on the Web (TWEB)
Using natural language to integrate, evaluate, and optimize extracted knowledge bases

Proceedings of the 2013 workshop on Automated knowledge base construction
Towards generic framework for tabular data extraction and management in documents

Proceedings of the sixth workshop on Ph.D. students in information and knowledge management
The parallel path framework for entity discovery on the web

ACM Transactions on the Web (TWEB)
Schema extraction for tabular data on the web

Proceedings of the VLDB Endowment
Web table taxonomy and formalization

ACM SIGMOD Record
Leveraging spatial join for robust tuple extraction from web pages

Information Sciences: an International Journal

Quantified Score

Hi-index	0.02

Visualization

Abstract

Traditionally, information extraction from web tables has focused on small, more or less homogeneous corpora, often based on assumptions about the use of tags. A multitude of different HTML implementations of web tables make these approaches difficult to scale. In this paper, we approach the problem of domain-independent information extraction from web tables by shifting our attention from the tree-based representation of webpages to a variation of the two-dimensional visual box model used by web browsers to display the information on the screen. The there by obtained topological and style information allows us to fill the gap created by missing domain-specific knowledge about content and table templates. We believe that, in a future step, this approach can become the basis for a new way of large-scale knowledge acquisition from the current "Visual Web.