Improving recall of regular expressions for information extraction

Authors:
Karin Murthy;Deepak P.;Prasad M. Deshpande
Affiliations:
IBM Research - India, Bangalore, India;IBM Research - India, Bangalore, India;IBM Research - India, Bangalore, India
Venue:
WISE'12 Proceedings of the 13th international conference on Web Information Systems Engineering
Year:
2012

Citing 12
Cited 0

Heuristics: intelligent search strategies for computer problem solving

Heuristics: intelligent search strategies for computer problem solving
XTRACT: a system for extracting document type descriptors from XML documents

SIGMOD '00 Proceedings of the 2000 ACM SIGMOD international conference on Management of data
Learning Regular Languages from Simple Positive Examples

Machine Learning
A semi-supervised active learning algorithm for information extraction from textual data: Research Articles

Journal of the American Society for Information Science and Technology - Intelligence and Security Informatics
2D Conditional Random Fields for Web information extraction

ICML '05 Proceedings of the 22nd international conference on Machine learning
Inference of concise DTDs from XML data

VLDB '06 Proceedings of the 32nd international conference on Very large data bases
Introduction to information extraction

AI Communications
Algorithms for learning regular expressions from positive data

Information and Computation
Regular expression learning for information extraction

EMNLP '08 Proceedings of the Conference on Empirical Methods in Natural Language Processing
Adaptive information extraction from text by rule induction and generalisation

IJCAI'01 Proceedings of the 17th international joint conference on Artificial intelligence - Volume 2
SystemT: an algebraic approach to declarative information extraction

ACL '10 Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics
Clustering based approach to learning regular expressions over large alphabet for noisy unstructured text

AND '10 Proceedings of the fourth workshop on Analytics for noisy unstructured text data

Quantified Score

Hi-index	0.00

Visualization

Abstract

Learning or writing regular expressions to identify instances of a specific concept within text documents with a high precision and recall is challenging. It is relatively easy to improve the precision of an initial regular expression by identifying false positives covered and tweaking the expression to avoid the false positives. However, modifying the expression to improve recall is difficult since false negatives can only be identified by manually analyzing all documents, in the absence of any tools to identify the missing instances. We focus on partially automating the discovery of missing instances by soliciting minimal user feedback. We present a technique to identify good generalizations of a regular expression that have improved recall while retaining high precision. We empirically demonstrate the effectiveness of the proposed technique as compared to existing methods and show results for a variety of tasks such as identification of dates, phone numbers, product names, and course numbers on real world datasets.