A machine learning based approach for table detection on the web

Authors:
Yalin Wang;Jianying Hu
Affiliations:
Univ. of Washington, Seattle, WA;Avaya Labs Research, Basking Ridge, NJ
Venue:
Proceedings of the 11th international conference on World Wide Web
Year:
2002

Citing 11
Cited 48

The nature of statistical learning theory

The nature of statistical learning theory
Support-Vector Networks

Machine Learning
A re-examination of text categorization methods

Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval
Machine Learning

Machine Learning
The Java Tutorial: A Short Course on the Basics

The Java Tutorial: A Short Course on the Basics
Computer and Robot Vision

Computer and Robot Vision
Automating the Construction of Internet Portals with Machine Learning

Information Retrieval
A Probabilistic Analysis of the Rocchio Algorithm with TFIDF for Text Categorization

ICML '97 Proceedings of the Fourteenth International Conference on Machine Learning
Flexible Web Document Analysis for Delivery to Narrow-Bandwidth Devices

ICDAR '01 Proceedings of the Sixth International Conference on Document Analysis and Recognition
Why Table Ground-Truthing is Hard

ICDAR '01 Proceedings of the Sixth International Conference on Document Analysis and Recognition
Mining tables from large scale HTML texts

COLING '00 Proceedings of the 18th conference on Computational linguistics - Volume 1

A framework for web table mining

Proceedings of the 4th international workshop on Web information and data management
Detecting web page structure for adaptive viewing on small form factor devices

WWW '03 Proceedings of the 12th international conference on World Wide Web
Using the structure of Web sites for automatic segmentation of tables

SIGMOD '04 Proceedings of the 2004 ACM SIGMOD international conference on Management of data
Tree-Structured Template Generation for Web Pages

WI '04 Proceedings of the 2004 IEEE/WIC/ACM International Conference on Web Intelligence
Web data extraction based on partial tree alignment

WWW '05 Proceedings of the 14th international conference on World Wide Web
Email data cleaning

Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining
A Scalable Hybrid Approach for Extracting Head Components from Web Tables

IEEE Transactions on Knowledge and Data Engineering
Learning table extraction from examples

COLING '04 Proceedings of the 20th international conference on Computational Linguistics
Transforming arbitrary tables into logical form with TARTAR

Data & Knowledge Engineering
Vertical Navigation of Layout Adapted Web Documents

World Wide Web
Towards domain-independent information extraction from web tables

Proceedings of the 16th international conference on World Wide Web
Automatic searching of tables in digital libraries

Proceedings of the 16th international conference on World Wide Web
TableSeer: automatic table metadata extraction and searching in digital libraries

Proceedings of the 7th ACM/IEEE-CS joint conference on Digital libraries
OntoMiner: automated metadata and instance mining from news websites

International Journal of Web and Grid Services
WebTables: exploring the power of tables on the web

Proceedings of the VLDB Endowment
Identifying table boundaries in digital documents via sparse line detection

Proceedings of the 17th ACM conference on Information and knowledge management
Discriminating Meaningful Web Tables from Decorative Tables Using a Composite Kernel

WI-IAT '08 Proceedings of the 2008 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology - Volume 01
Information Extraction

Foundations and Trends in Databases
Automatic hidden-web table interpretation, conceptualization, and semantic annotation

Data & Knowledge Engineering
Table extraction using spatial reasoning on the CSS2 visual box model

AAAI'06 proceedings of the 21st national conference on Artificial intelligence - Volume 2
Enabling Interactive Access to Web Tables

Proceedings of the 13th International Conference on Human-Computer Interaction. Part I: New Trends
Using some web content mining techniques for Arabic text classification

DNCOCO'09 Proceedings of the 8th WSEAS international conference on Data networks, communications, computers
From tables to frames

Web Semantics: Science, Services and Agents on the World Wide Web
Web-scale knowledge extraction from semi-structured tables

Proceedings of the 19th international conference on World wide web
Automatic hidden-web table interpretation by sibling page comparison

ER'07 Proceedings of the 26th international conference on Conceptual modeling
PROSPECT: a system for screening candidates for recruitment

CIKM '10 Proceedings of the 19th ACM international conference on Information and knowledge management
A fine-grained taxonomy of tables on the web

CIKM '10 Proceedings of the 19th ACM international conference on Information and knowledge management
Enhancing browsing experience of table and image elements in web pages

International Conference on Multimodal Interfaces and the Workshop on Machine Learning for Multimodal Interaction
Structured data on the web

Communications of the ACM
Web-scale table census and classification

Proceedings of the fourth ACM international conference on Web search and data mining
Mining for attributes and values in tables

Proceedings of the International Conference on Management of Emergent Digital EcoSystems
FACTO: a fact lookup engine based on web tables

Proceedings of the 20th international conference on World wide web
An approach to assess the quality of web pages in the deep web

DASFAA'11 Proceedings of the 16th international conference on Database systems for advanced applications
An efficient pre-processing method to identify logical components from PDF documents

PAKDD'11 Proceedings of the 15th Pacific-Asia conference on Advances in knowledge discovery and data mining - Volume Part I
Enabling efficient browsing and manipulation of web tables on smartphone

HCII'11 Proceedings of the 14th international conference on Human-computer interaction: towards mobile and intelligent interaction environments - Volume Part III
Bipartite Graph Based Entity Ranking for Related Entity Finding

WI-IAT '11 Proceedings of the 2011 IEEE/WIC/ACM International Conferences on Web Intelligence and Intelligent Agent Technology - Volume 01
Hybrid approach to extracting information from web-tables

ICCPOL'06 Proceedings of the 21st international conference on Computer Processing of Oriental Languages: beyond the orient: the research challenges ahead
Image description mining and hierarchical clustering on data records using HR-Tree

APWeb'06 Proceedings of the 8th Asia-Pacific Web conference on Frontiers of WWW Research and Development
A machine learning based approach for separating head from body in web-tables

CICLing'06 Proceedings of the 7th international conference on Computational Linguistics and Intelligent Text Processing
Concept-Based search on semi-structured data exploiting mined semantic relations

EDBT'04 Proceedings of the 2004 international conference on Current Trends in Database Technology
Improving web browsing on small devices based on table classification

PCM'04 Proceedings of the 5th Pacific Rim Conference on Advances in Multimedia Information Processing - Volume Part II
Structure detection system from web documents through backpropagation network learning

AI'06 Proceedings of the 19th Australian joint conference on Artificial Intelligence: advances in Artificial Intelligence
Web table discrimination with composition of rich structural and content information

Applied Soft Computing
Understanding tables on the web

ER'12 Proceedings of the 31st international conference on Conceptual Modeling
Adapting data table to improve web accessibility

Proceedings of the 10th International Cross-Disciplinary Conference on Web Accessibility
Schema extraction for tabular data on the web

Proceedings of the VLDB Endowment
Web table taxonomy and formalization

ACM SIGMOD Record
Using linked data to mine RDF from wikipedia's tables

Proceedings of the 7th ACM international conference on Web search and data mining

Quantified Score

Hi-index	0.02

Visualization

Abstract

Table is a commonly used presentation scheme, especially for describing relational information. However, table understanding remains an open problem. In this paper, we consider the problem of table detection in web documents. Its potential applications include web mining, knowledge management, and web content summarization and delivery to narrow-bandwidth devices. We describe a machine learning based approach to classify each given table entity as either genuine or non-genuine. Various features reflecting the layout as well as content characteristics of tables are studied.In order to facilitate the training and evaluation of our table classifier, we designed a novel web document table ground truthing protocol and used it to build a large table ground truth database. The database consists of 1,393 HTML files collected from hundreds of different web sites and contains 11,477 leaf TABLE elements, out of which 1,740 are genuine tables. Experiments were conducted using the cross validation method and an F-measure of 95.89% was achieved.