Joint unsupervised structure discovery and information extraction

Authors:
Eli Cortez;Daniel Oliveira;Altigran S. da Silva;Edleno S. de Moura;Alberto H.F. Laender
Affiliations:
Universidade Federal do Amazonas, Manaus, Brazil;Universidade Federal do Amazonas, Manaus, Brazil;Universidade Federal do Amazonas, Manaus, Brazil;Universidade Federal do Amazonas, Manaus, Brazil;Universidade Federal de Minas Gerais, Belo Horizonte, Brazil
Venue:
Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
Year:
2011

Citing 17
Cited 2

Probabilistic reasoning in intelligent systems: networks of plausible inference

Probabilistic reasoning in intelligent systems: networks of plausible inference
Learning Information Extraction Rules for Semi-Structured and Free Text

Machine Learning - Special issue on natural language learning
Conceptual-model-based data extraction from multiple-record Web pages

Data & Knowledge Engineering
Automatic segmentation of text into structured records

SIGMOD '01 Proceedings of the 2001 ACM SIGMOD international conference on Management of data
A mathematical theory of communication

ACM SIGMOBILE Mobile Computing and Communications Review
Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data

ICML '01 Proceedings of the Eighteenth International Conference on Machine Learning
Information Extraction with HMM Structures Learned by Stochastic Optimization

Proceedings of the Seventeenth National Conference on Artificial Intelligence and Twelfth Conference on Innovative Applications of Artificial Intelligence
A Fully Automated Object Extraction System for the World Wide Web

ICDCS '01 Proceedings of the The 21st International Conference on Distributed Computing Systems
Mining reference tables for automatic text segmentation

Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining
Exploiting dictionaries in named entity extraction: combining semi-Markov extraction processes and data integration methods

Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining
Integrating Unstructured Data into Relational Databases

ICDE '06 Proceedings of the 22nd International Conference on Data Engineering
Information extraction from research papers using conditional random fields

Information Processing and Management: an International Journal
LABRADOR: Efficiently publishing relational databases on the web by using keyword-based query interfaces

Information Processing and Management: an International Journal
Information Extraction

Foundations and Trends in Databases
Information extraction challenges in managing unstructured data

ACM SIGMOD Record
ONDUX: on-demand unsupervised learning for information extraction

Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
Using latent-structure to detect objects on the web

Procceedings of the 13th International Workshop on the Web and Databases

Evaluation of BILBO reference parsing in digital humanities via a comparison of different tools

Proceedings of the 2012 ACM symposium on Document engineering
An evolutionary approach to complex schema matching

Information Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

In this paper we present JUDIE (Joint Unsupervised Structure Discovery and Information Extraction), a new method for automatically extracting semi-structured data records in the form of continuous text (e.g., bibliographic citations, postal addresses, classified ads, etc.) and having no explicit delimiters between them. While in state-of-the-art Information Extraction methods the structure of the data records is manually supplied the by user as a training step, JUDIE is capable of detecting the structure of each individual record being extracted without any user assistance. This is accomplished by a novel Structure Discovery algorithm that, given a sequence of labels representing attributes assigned to potential values, groups these labels into individual records by looking for frequent patterns of label repetitions among the given sequence. We also show how to integrate this algorithm in the information extraction process by means of successive refinement steps that alternate information extraction and structure discovery. Through an extensively experimental evaluation with different datasets in distinct domains, we compare JUDIE with state-of-the-art information extraction methods and conclude that, even without any user intervention, it is able to achieve high quality results on the tasks of discovering the structure of the records and extracting information from them.