Enabling information extraction by inference of regular expressions from sample entities

Authors:
Falk Brauer;Robert Rieger;Adrian Mocan;Wojciech M. Barczynski
Affiliations:
SAP AG, Dresden, Germany;SAP AG, Dresden, Germany;SAP AG, Dresden, Germany;SAP AG, Dresden, Germany
Venue:
Proceedings of the 20th ACM international conference on Information and knowledge management
Year:
2011

Citing 17
Cited 3

Learning Information Extraction Rules for Semi-Structured and Free Text

Machine Learning - Special issue on natural language learning
Two dimensional generalization in information extraction

AAAI '99/IAAI '99 Proceedings of the sixteenth national conference on Artificial intelligence and the eleventh Innovative applications of artificial intelligence conference innovative applications of artificial intelligence
XTRACT: a system for extracting document type descriptors from XML documents

SIGMOD '00 Proceedings of the 2000 ACM SIGMOD international conference on Management of data
Machine Learning for Information Extraction in Informal Domains

Machine Learning - Special issue on information retrieval
Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data

ICML '01 Proceedings of the Eighteenth International Conference on Machine Learning
Potter's Wheel: An Interactive Data Cleaning System

Proceedings of the 27th International Conference on Very Large Data Bases
Discovering patterns to extract protein–protein interactions from the literature: Part II

Bioinformatics
Integrating Unstructured Data into Relational Databases

ICDE '06 Proceedings of the 22nd International Conference on Data Engineering
Adaptive information extraction

ACM Computing Surveys (CSUR)
NAGA: harvesting, searching and ranking knowledge

Proceedings of the 2008 ACM SIGMOD international conference on Management of data
Information Extraction

Foundations and Trends in Databases
Algorithms for learning regular expressions from positive data

Information and Computation
High-performance information extraction with AliBaba

Proceedings of the 12th International Conference on Extending Database Technology: Advances in Database Technology
A context pattern induction method for named entity extraction

CoNLL-X '06 Proceedings of the Tenth Conference on Computational Natural Language Learning
Regular expression learning for information extraction

EMNLP '08 Proceedings of the Conference on Empirical Methods in Natural Language Processing
Adaptive information extraction from text by rule induction and generalisation

IJCAI'01 Proceedings of the 17th international joint conference on Artificial intelligence - Volume 2
Inference of concise regular expressions and DTDs

ACM Transactions on Database Systems (TODS)

WizIE: a best practices guided development environment for information extraction

ACL '12 Proceedings of the ACL 2012 System Demonstrations
Automatic string replace by examples

Proceedings of the 15th annual conference on Genetic and evolutionary computation
I can do text analytics!: designing development tools for novice developers

Proceedings of the SIGCHI Conference on Human Factors in Computing Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

Regular expressions are the dominant technique to extract business relevant entities (e.g., invoice numbers or product names) from text data (e.g., invoices), since these entity types often follow a strict underlying syntactical pattern. However, the manual construction of regular expressions that guarantee a high recall and precision is a tedious manual task and requires expert knowledge. In this paper, we propose an approach that automatically infers regular expressions from a set of (positive) sample entities, which in turn can be derived either from enterprise databases (e.g., a product catalog) or annotated documents (e.g., historical invoices). The main innovation of our approach is that it learns effective regular expressions that can be easily interpreted and modified by a user. The effectiveness is obtained by a novel method that weights dependent entity features of different granularity (i.e. on character and token level) against each other and selects the most suitable ones to form a regular expression.