ONDUX: on-demand unsupervised learning for information extraction

Authors:
Eli Cortez;Altigran S. da Silva;Marcos André Gonçalves;Edleno S. de Moura
Affiliations:
Universidade Federal do Amazonas, Manaus, Brazil;Universidade Federal do Amazonas, Manaus, Brazil;Universidade Federal de Minas Gerais, Belo Horizonte, Brazil;Universidade Federal do Amazonas, Manaus, Brazil
Venue:
Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
Year:
2010

Citing 15
Cited 7

Probabilistic reasoning in intelligent systems: networks of plausible inference

Probabilistic reasoning in intelligent systems: networks of plausible inference
Automatic segmentation of text into structured records

SIGMOD '01 Proceedings of the 2001 ACM SIGMOD international conference on Management of data
Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data

ICML '01 Proceedings of the Eighteenth International Conference on Machine Learning
Transductive Inference for Text Classification using Support Vector Machines

ICML '99 Proceedings of the Sixteenth International Conference on Machine Learning
Information Extraction with HMM Structures Learned by Stochastic Optimization

Proceedings of the Seventeenth National Conference on Artificial Intelligence and Twelfth Conference on Innovative Applications of Artificial Intelligence
Mining reference tables for automatic text segmentation

Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining
Integrating Unstructured Data into Relational Databases

ICDE '06 Proceedings of the 22nd International Conference on Data Engineering
Answering Imprecise Queries over Autonomous Web Databases

ICDE '06 Proceedings of the 22nd International Conference on Data Engineering
Information extraction from research papers using conditional random fields

Information Processing and Management: an International Journal
LABRADOR: Efficiently publishing relational databases on the web by using keyword-based query interfaces

Information Processing and Management: an International Journal
FLUX-CIM: flexible unsupervised extraction of citation metadata

Proceedings of the 7th ACM/IEEE-CS joint conference on Digital libraries
Context-aware wrapping: synchronized data extraction

VLDB '07 Proceedings of the 33rd international conference on Very large data bases
Information Extraction

Foundations and Trends in Databases
A flexible approach for extracting metadata from bibliographic citations

Journal of the American Society for Information Science and Technology
Reinforcement learning: a survey

Journal of Artificial Intelligence Research

Unsupervised strategies for information extraction by text segmentation

Proceedings of the Fourth SIGMOD PhD Workshop on Innovative Database Research
A probabilistic approach for automatically filling form-based web interfaces

Proceedings of the VLDB Endowment
Harvesting relational tables from lists on the web

The VLDB Journal — The International Journal on Very Large Data Bases
Joint unsupervised structure discovery and information extraction

Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
Semi-supervised multi-task learning of structured prediction models for web information extraction

Proceedings of the 20th ACM international conference on Information and knowledge management
Web-based closed-domain data extraction on online advertisements

Information Systems
Exploiting a proximity-based positional model to improve the quality of information extraction by text segmentation

ADC '13 Proceedings of the Twenty-Fourth Australasian Database Conference - Volume 137

Quantified Score

Hi-index	0.00

Visualization

Abstract

Information extraction by text segmentation (IETS) applies to cases in which data values of interest are organized in implicit semi-structured records available in textual sources (e.g. postal addresses, bibliographic information, ads). It is an important practical problem that has been frequently addressed in the recent literature. In this paper we introduce ONDUX (On Demand Unsupervised Information Extraction), a new unsupervised probabilistic approach for IETS. As other unsupervised IETS approaches, ONDUX relies on information available on pre-existing data to associate segments in the input string with attributes of a given domain. Unlike other approaches, we rely on very effective matching strategies instead of explicit learning strategies. The effectiveness of this matching strategy is also exploited to disambiguate the extraction of certain attributes through a reinforcement step that explores sequencing and positioning of attribute values directly learned on-demand from test data, with no previous human-driven training, a feature unique to ONDUX. This assigns to ONDUX a high degree of flexibility and results in superior effectiveness, as demonstrated by the experimental evaluation we report with textual sources from different domains, in which ONDUX is compared with a state-of-art IETS approach.