Clustering based approach to learning regular expressions over large alphabet for noisy unstructured text

Authors:
Rohit Babbar;Nidhi Singh
Affiliations:
Chennai Mathematical Institute, Chennai, India;IBM India Software Labs, Bangalore, India
Venue:
AND '10 Proceedings of the fourth workshop on Analytics for noisy unstructured text data
Year:
2010

Citing 17
Cited 2

A subquadratic algorithm for approximate regular expression matching

Journal of Algorithms
Learning Information Extraction Rules for Semi-Structured and Free Text

Machine Learning - Special issue on natural language learning
Machine Learning for Information Extraction in Informal Domains

Machine Learning - Special issue on information retrieval
Flexible pattern matching in strings: practical on-line search algorithms for texts and biological sequences

Flexible pattern matching in strings: practical on-line search algorithms for texts and biological sequences
What Is the Search Space of the Regular Inference?

ICGI '94 Proceedings of the Second International Colloquium on Grammatical Inference and Applications
Incremental regular inference

ICG! '96 Proceedings of the 3rd International Colloquium on Grammatical Inference: Learning Syntax from Sentences
A semi-supervised active learning algorithm for information extraction from textual data: Research Articles

Journal of the American Society for Information Science and Technology - Intelligence and Security Informatics
Pattern-based disambiguation for natural language processing

EMNLP '00 Proceedings of the 2000 Joint SIGDAT conference on Empirical methods in natural language processing and very large corpora: held in conjunction with the 38th Annual Meeting of the Association for Computational Linguistics - Volume 13
Web Data Mining: Exploring Hyperlinks, Contents, and Usage Data (Data-Centric Systems and Applications)

Web Data Mining: Exploring Hyperlinks, Contents, and Usage Data (Data-Centric Systems and Applications)
An effective two-stage model for exploiting non-local dependencies in named entity recognition

ACL-44 Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics
Extracting personal names from email: applying named entity recognition to informal text

HLT '05 Proceedings of the conference on Human Language Technology and Empirical Methods in Natural Language Processing
Rule based synonyms for entity extraction from noisy text

Proceedings of the second workshop on Analytics for noisy unstructured text data
Opinion mining from noisy text data

Proceedings of the second workshop on Analytics for noisy unstructured text data
An Algebraic Approach to Rule-Based Information Extraction

ICDE '08 Proceedings of the 2008 IEEE 24th International Conference on Data Engineering
Regular expression learning for information extraction

EMNLP '08 Proceedings of the Conference on Empirical Methods in Natural Language Processing
Adaptive information extraction from text by rule induction and generalisation

IJCAI'01 Proceedings of the 17th international joint conference on Artificial intelligence - Volume 2
Meta-level information extraction

KI'09 Proceedings of the 32nd annual German conference on Advances in artificial intelligence

Improving recall of regular expressions for information extraction

WISE'12 Proceedings of the 13th international conference on Web Information Systems Engineering
Automatic string replace by examples

Proceedings of the 15th annual conference on Genetic and evolutionary computation

Quantified Score

Hi-index	0.00

Visualization

Abstract

Regular Expressions have been used for Information Extraction tasks in a variety of domains. The alphabet of the regular expression can either be the relevant tokens corresponding to the entity of interest or individual characters in which case the alphabet size becomes very large. The presence of noise in unstructured text documents along with increased alphabet size of the regular expressions poses a significant challenge for entity extraction tasks, and also for algorithmically learning complex regular expressions. In this paper, we present a novel algorithm for regular expression learning which clusters similar matches to obtain the corresponding regular expressions, identifies and eliminates noisy clusters, and finally uses weighted disjunction of the most promising candidate regular expressions to obtain the final expression. The experimental results demonstrate high value of both precision and recall of this final expression, which reinforces the applicability of our approach in entity extraction tasks of practical importance.