Regular expression learning for information extraction

Authors:
Yunyao Li;Rajasekar Krishnamurthy;Sriram Raghavan;Shivakumar Vaithyanathan;H. V. Jagadish
Affiliations:
IBM Almaden Research Center, San Jose, CA;IBM Almaden Research Center, San Jose, CA;IBM Almaden Research Center, San Jose, CA;IBM Almaden Research Center, San Jose, CA;University of Michigan, Ann Arbor, MI
Venue:
EMNLP '08 Proceedings of the Conference on Empirical Methods in Natural Language Processing
Year:
2008

Citing 20
Cited 13

Learning Information Extraction Rules for Semi-Structured and Free Text

Machine Learning - Special issue on natural language learning
XTRACT: a system for extracting document type descriptors from XML documents

SIGMOD '00 Proceedings of the 2000 ACM SIGMOD international conference on Management of data
Learning Regular Languages from Simple Positive Examples

Machine Learning
Incremental regular inference

ICG! '96 Proceedings of the 3rd International Colloquium on Grammatical Inference: Learning Syntax from Sentences
Maximum Entropy Markov Models for Information Extraction and Segmentation

ICML '00 Proceedings of the Seventeenth International Conference on Machine Learning
Learning regular languages using RFSAs

Theoretical Computer Science - Special issue: Algorithmic learning theory
A semi-supervised active learning algorithm for information extraction from textual data: Research Articles

Journal of the American Society for Information Science and Technology - Intelligence and Security Informatics
Pattern-based disambiguation for natural language processing

EMNLP '00 Proceedings of the 2000 Joint SIGDAT conference on Empirical methods in natural language processing and very large corpora: held in conjunction with the 38th Annual Meeting of the Association for Computational Linguistics - Volume 13
Named entity recognition with character-level models

CONLL '03 Proceedings of the seventh conference on Natural language learning at HLT-NAACL 2003 - Volume 4
Getting work done on the web: supporting transactional queries

SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
Inference of concise DTDs from XML data

VLDB '06 Proceedings of the 32nd international conference on Very large data bases
An effective two-stage model for exploiting non-local dependencies in named entity recognition

ACL-44 Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics
Extracting personal names from email: applying named entity recognition to informal text

HLT '05 Proceedings of the conference on Human Language Technology and Empirical Methods in Natural Language Processing
Navigating the intranet with high precision

Proceedings of the 16th international conference on World Wide Web
Empirical study on the performance stability of named entity recognition model across domains

EMNLP '06 Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing
Adaptive information extraction from text by rule induction and generalisation

IJCAI'01 Proceedings of the 17th international joint conference on Artificial intelligence - Volume 2
Learning to understand web site update requests

IJCAI'05 Proceedings of the 19th international joint conference on Artificial intelligence
Self-supervised relation extraction from the web

ISMIS'06 Proceedings of the 16th international conference on Foundations of Intelligent Systems
Algorithms for learning regular expressions

ALT'05 Proceedings of the 16th international conference on Algorithmic Learning Theory
Learning regular expressions from noisy sequences

SARA'05 Proceedings of the 6th international conference on Abstraction, Reformulation and Approximation

Domain adaptation of rule-based annotators for named-entity recognition tasks

EMNLP '10 Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing
Clustering based approach to learning regular expressions over large alphabet for noisy unstructured text

AND '10 Proceedings of the fourth workshop on Analytics for noisy unstructured text data
Automatic rule refinement for information extraction

Proceedings of the VLDB Endowment
The SystemT IDE: an integrated development environment for information extraction rules

Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
Efficient schema extraction from a large collection of XML documents

Proceedings of the 49th Annual Southeast Regional Conference
Enabling information extraction by inference of regular expressions from sample entities

Proceedings of the 20th ACM international conference on Information and knowledge management
Automatic generation of regular expressions from examples with genetic programming

Proceedings of the 14th annual conference companion on Genetic and evolutionary computation
WizIE: a best practices guided development environment for information extraction

ACL '12 Proceedings of the ACL 2012 System Demonstrations
Towards efficient named-entity rule induction for customizability

EMNLP-CoNLL '12 Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning
Improving recall of regular expressions for information extraction

WISE'12 Proceedings of the 13th international conference on Web Information Systems Engineering
Automatic string replace by examples

Proceedings of the 15th annual conference on Genetic and evolutionary computation
I can do text analytics!: designing development tools for novice developers

Proceedings of the SIGCHI Conference on Human Factors in Computing Systems
Learning regular expressions to template-based FAQ retrieval systems

Knowledge-Based Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

Regular expressions have served as the dominant workhorse of practical information extraction for several years. However, there has been little work on reducing the manual effort involved in building high-quality, complex regular expressions for information extraction tasks. In this paper, we propose ReLIE, a novel transformation-based algorithm for learning such complex regular expressions. We evaluate the performance of our algorithm on multiple datasets and compare it against the CRF algorithm. We show that ReLIE, in addition to being an order of magnitude faster, outperforms CRF under conditions of limited training data and cross-domain data. Finally, we show how the accuracy of CRF can be improved by using features extracted by ReLIE.