Using the structure of Web sites for automatic segmentation of tables

Authors:
Kristina Lerman;Lise Getoor;Steven Minton;Craig Knoblock
Affiliations:
USC Information Sciences Institute, Marina del Rey, CA;University of Maryland, College Park, MD;Fetch Technologies, Manhattan Beach, CA;USC Information Sciences Institute, Marina del Rey, CA
Venue:
SIGMOD '04 Proceedings of the 2004 ACM SIGMOD international conference on Management of data
Year:
2004

Citing 23
Cited 53

A tutorial on hidden Markov models and selected applications in speech recognition

Readings in speech recognition
TINTIN: a system for retrieval in text tables

DL '97 Proceedings of the second ACM international conference on Digital libraries
IEPAD: information extraction based on pattern discovery

Proceedings of the 10th international conference on World Wide Web
Automatic segmentation of text into structured records

SIGMOD '01 Proceedings of the 2001 ACM SIGMOD international conference on Management of data
A flexible learning system for wrapping tables and lists in HTML documents

Proceedings of the 11th international conference on World Wide Web
A machine learning based approach for table detection on the web

Proceedings of the 11th international conference on World Wide Web
Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data

ICML '01 Proceedings of the Eighteenth International Conference on Machine Learning
Active + Semi-supervised Learning = Robust Multi-View Learning

ICML '02 Proceedings of the Nineteenth International Conference on Machine Learning
Crawling the Hidden Web

Proceedings of the 27th International Conference on Very Large Data Bases
RoadRunner: Towards Automatic Data Extraction from Large Web Sites

Proceedings of the 27th International Conference on Very Large Data Bases
Layout and Language: Preliminary Investigations in Recognizing the Structure of Tables

ICDAR '97 Proceedings of the 4th International Conference on Document Analysis and Recognition
Learning the Common Structure of Data

Proceedings of the Seventeenth National Conference on Artificial Intelligence and Twelfth Conference on Innovative Applications of Artificial Intelligence
Automatic Web Information Extraction in the ROADRUNNER System

Revised Papers from the HUMACS, DASWIS, ECOMO, and DAMA on ER 2001 Workshops
Detecting Tables in HTML Documents

DAS '02 Proceedings of the 5th International Workshop on Document Analysis Systems V
Table Detection via Probability Optimization

DAS '02 Proceedings of the 5th International Workshop on Document Analysis Systems V
Table extraction using conditional random fields

Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval
Extracting structured data from Web pages

Proceedings of the 2003 ACM SIGMOD international conference on Management of data
Dynamic bayesian networks: representation, inference and learning

Dynamic bayesian networks: representation, inference and learning
Mining tables from large scale HTML texts

COLING '00 Proceedings of the 18th conference on Computational linguistics - Volume 1
Learning to recognize tables in free text

ACL '99 Proceedings of the 37th annual meeting of the Association for Computational Linguistics on Computational Linguistics
Wrapper maintenance: a machine learning approach

Journal of Artificial Intelligence Research
Integer optimization by local search: a domain-independent approach

Integer optimization by local search: a domain-independent approach
Adaptive information extraction: core technologies for information agents

Intelligent information agents

Editorial: special issue on web content mining

ACM SIGKDD Explorations Newsletter
Web data extraction based on partial tree alignment

WWW '05 Proceedings of the 14th international conference on World Wide Web
AutoFeed: an unsupervised learning system for generating webfeeds

Proceedings of the 3rd international conference on Knowledge capture
AggregateRank: bringing order to web sites

SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
Simultaneous record detection and attribute labeling in web data extraction

Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining
A Survey of Web Information Extraction Systems

IEEE Transactions on Knowledge and Data Engineering
Enabling web browsers to augment web sites' filtering and sorting functionalities

UIST '06 Proceedings of the 19th annual ACM symposium on User interface software and technology
Structured Data Extraction from the Web Based on Partial Tree Alignment

IEEE Transactions on Knowledge and Data Engineering
Automatically maintaining wrappers for semi-structured web sources

Data & Knowledge Engineering
Towards domain-independent information extraction from web tables

Proceedings of the 16th international conference on World Wide Web
Web object retrieval

Proceedings of the 16th international conference on World Wide Web
Extracting Web Data Using Instance-Based Learning

World Wide Web
Extraction of flat and nested data records from web pages

AusDM '06 Proceedings of the fifth Australasian conference on Data mining and analystics - Volume 61
Webpage understanding: an integrated approach

Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining
Context-aware wrapping: synchronized data extraction

VLDB '07 Proceedings of the 33rd international conference on Very large data bases
Extracting lists of data records from semi-structured web pages

Data & Knowledge Engineering
From dirt to shovels: fully automatic tool generation from ad hoc data

Proceedings of the 35th annual ACM SIGPLAN-SIGACT symposium on Principles of programming languages
Flint: Google-basing the Web

EDBT '08 Proceedings of the 11th international conference on Extending database technology: Advances in database technology
OntoMiner: automated metadata and instance mining from news websites

International Journal of Web and Grid Services
Dynamic Hierarchical Markov Random Fields for Integrated Web Data Extraction

The Journal of Machine Learning Research
WRAPPER INFERENCE FOR AMBIGUOUS WEB PAGES

Applied Artificial Intelligence
Automated Semantic Analysis of Schematic Data

World Wide Web
Query based optimal web site clustering using simulated annealing

Proceedings of the 10th International Conference on Information Integration and Web-based Applications & Services
Incorporating site-level knowledge to extract structured data from web forums

Proceedings of the 18th international conference on World wide web
Extracting data records from the web using tag path clustering

Proceedings of the 18th international conference on World wide web
OntoPortal: An ontology-supported portal architecture with linguistically enhanced and focused crawler technologies

Expert Systems with Applications: An International Journal
ODE: Ontology-assisted data extraction

ACM Transactions on Database Systems (TODS)
Automatic hidden-web table interpretation, conceptualization, and semantic annotation

Data & Knowledge Engineering
Extraction of named entities from tables in gene mutation literature

BioNLP '09 Proceedings of the Workshop on Current Trends in Biomedical Natural Language Processing
Using semantics to identify web objects

AAAI'06 proceedings of the 21st national conference on Artificial intelligence - Volume 2
Mining and re-ranking for answering biographical queries on the web

AAAI'06 proceedings of the 21st national conference on Artificial intelligence - Volume 2
Automatically labeling the inputs and outputs of web services

AAAI'06 proceedings of the 21st national conference on Artificial intelligence - Volume 2
Overview of autofeed: an unsupervised learning system for generating webfeeds

AAAI'06 proceedings of the 21st national conference on Artificial intelligence - Volume 2
Automatic wrapper generation using tree matching and partial tree alignment

AAAI'06 proceedings of the 21st national conference on Artificial intelligence - Volume 2
Algorithm for Extracting Loosely Structured Data Records Through Digging Strict Patterns

World Wide Web
Harvesting relational tables from lists on the web

Proceedings of the VLDB Endowment
Extracting Structured Data from Web Pages with Maximum Entropy Segmental Markov Model

WISE '09 Proceedings of the 10th International Conference on Web Information Systems Engineering
An ontology-supported and fully-automatic annotation technology for semantic portals

IEA/AIE'07 Proceedings of the 20th international conference on Industrial, engineering, and other applications of applied intelligent systems
Fixing weakly annotated web data using relational models

ICWE'07 Proceedings of the 7th international conference on Web engineering
Web page DOM node characterization and its application to page segmentation

IMSAA'09 Proceedings of the 3rd IEEE international conference on Internet multimedia services architecture and applications
Link-based hidden attribute discovery for objects on Web

Proceedings of the 14th International Conference on Extending Database Technology
HyLiEn: a hybrid approach to general list extraction on the web

Proceedings of the 20th international conference companion on World wide web
Harvesting relational tables from lists on the web

The VLDB Journal — The International Journal on Very Large Data Bases
Incremental structured web database crawling via history versions

WISE'10 Proceedings of the 11th international conference on Web information systems engineering
Building Mashups by Demonstration

ACM Transactions on the Web (TWEB)
Extracting general lists from web documents: a hybrid approach

IEA/AIE'11 Proceedings of the 24th international conference on Industrial engineering and other applications of applied intelligent systems conference on Modern approaches in applied intelligence - Volume Part I
Extracting web data using instance-based learning

WISE'05 Proceedings of the 6th international conference on Web Information Systems Engineering
NET – a system for extracting web data from flat and nested data records

WISE'05 Proceedings of the 6th international conference on Web Information Systems Engineering
An XML approach to semantically extract data from HTML tables

DEXA'05 Proceedings of the 16th international conference on Database and Expert Systems Applications
Maximum rooted spanning trees for the web

OTM'06 Proceedings of the 2006 international conference on On the Move to Meaningful Internet Systems: AWeSOMe, CAMS, COMINF, IS, KSinBIT, MIOS-CIAO, MONET - Volume Part II
Chapter 6: web data extraction for service creation

Search Computing
Retrieving informative content from web pages with conditional learning of support vector machines and semantic analysis

ICAISC'12 Proceedings of the 11th international conference on Artificial Intelligence and Soft Computing - Volume Part II
The parallel path framework for entity discovery on the web

ACM Transactions on the Web (TWEB)

Quantified Score

Hi-index	0.01

Visualization

Abstract

Many Web sites, especially those that dynamically generate HTML pages to display the results of a user's query, present information in the form of list or tables. Current tools that allow applications to programmatically extract this information rely heavily on user input, often in the form of labeled extracted records. The sheer size and rate of growth of the Web make any solution that relies primarily on user input is infeasible in the long term. Fortunately, many Web sites contain much explicit and implicit structure, both in layout and content, that we can exploit for the purpose of information extraction. This paper describes an approach to automatic extraction and segmentation of records from Web tables. Automatic methods do not require any user input, but rely solely on the layout and content of the Web source. Our approach relies on the common structure of many Web sites, which present information as a list or a table, with a link in each entry leading to a detail page containing additional information about that item. We describe two algorithms that use redundancies in the content of table and detail pages to aid in information extraction. The first algorithm encodes additional information provided by detail pages as constraints and finds the segmentation by solving a constraint satisfaction problem. The second algorithm uses probabilistic inference to find the record segmentation. We show how each approach can exploit the web site structure in a general, domain-independent manner, and we demonstrate the effectiveness of each algorithm on a set of twelve Web sites.