Heuristics: intelligent search strategies for computer problem solving
Heuristics: intelligent search strategies for computer problem solving
XTRACT: a system for extracting document type descriptors from XML documents
SIGMOD '00 Proceedings of the 2000 ACM SIGMOD international conference on Management of data
Learning Regular Languages from Simple Positive Examples
Machine Learning
Journal of the American Society for Information Science and Technology - Intelligence and Security Informatics
2D Conditional Random Fields for Web information extraction
ICML '05 Proceedings of the 22nd international conference on Machine learning
Inference of concise DTDs from XML data
VLDB '06 Proceedings of the 32nd international conference on Very large data bases
Introduction to information extraction
AI Communications
Algorithms for learning regular expressions from positive data
Information and Computation
Regular expression learning for information extraction
EMNLP '08 Proceedings of the Conference on Empirical Methods in Natural Language Processing
Adaptive information extraction from text by rule induction and generalisation
IJCAI'01 Proceedings of the 17th international joint conference on Artificial intelligence - Volume 2
SystemT: an algebraic approach to declarative information extraction
ACL '10 Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics
AND '10 Proceedings of the fourth workshop on Analytics for noisy unstructured text data
Hi-index | 0.00 |
Learning or writing regular expressions to identify instances of a specific concept within text documents with a high precision and recall is challenging. It is relatively easy to improve the precision of an initial regular expression by identifying false positives covered and tweaking the expression to avoid the false positives. However, modifying the expression to improve recall is difficult since false negatives can only be identified by manually analyzing all documents, in the absence of any tools to identify the missing instances. We focus on partially automating the discovery of missing instances by soliciting minimal user feedback. We present a technique to identify good generalizations of a regular expression that have improved recall while retaining high precision. We empirically demonstrate the effectiveness of the proposed technique as compared to existing methods and show results for a variety of tasks such as identification of dates, phone numbers, product names, and course numbers on real world datasets.