A semi-supervised algorithm for pattern discovery in information extraction from textual data

Authors:
Tianhao Wu;William M. Pottenger
Affiliations:
Computer Science and Engineering, Lehigh University;Computer Science and Engineering, Lehigh University
Venue:
PAKDD'03 Proceedings of the 7th Pacific-Asia conference on Advances in knowledge discovery and data mining
Year:
2003

Citing 6
Cited 2

Learning Information Extraction Rules for Semi-Structured and Free Text

Machine Learning - Special issue on natural language learning
Foundations of statistical natural language processing

Foundations of statistical natural language processing
Information Retrieval

Information Retrieval
Introduction To Automata Theory, Languages, And Computation

Introduction To Automata Theory, Languages, And Computation
Pattern-based disambiguation for natural language processing

EMNLP '00 Proceedings of the 2000 Joint SIGDAT conference on Empirical methods in natural language processing and very large corpora: held in conjunction with the 38th Annual Meeting of the Association for Computational Linguistics - Volume 13
Extracting meaningful entities from police narrative reports

dg.o '02 Proceedings of the 2002 annual national conference on Digital government research

Mining chat conversations for sex identification

PAKDD'07 Proceedings of the 2007 international conference on Emerging technologies in knowledge discovery and data mining
Web content mining for market intelligence acquiring from b2c websites

WISE'06 Proceedings of the 7th international conference on Web Information Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

In this article we present a semi-supervised algorithm for pattern discovery in information extraction from textual data. The patterns that are discovered take the form of regular expressions that generate regular languages. We term our approach 'semi-supervised' because it requires significantly less effort to develop a training set than other approaches. From the training data our algorithm automatically generates regular expressions that can be used on previously unseen data for information extraction. Our experiments show that the algorithm has good testing performance on many features that are important in the fight against terrorism.