TEG: a hybrid approach to information extraction

Authors:
Benjamin Rosenfeld;Ronen Feldman;Moshe Fresko;Jonathan Schler;Yonatan Aumann
Affiliations:
Bar-Ilan University, Ramat Gan, ISRAEL;Bar-Ilan University, Ramat Gan, ISRAEL;Bar-Ilan University, Ramat Gan, ISRAEL;Bar-Ilan University, Ramat Gan, ISRAEL;Bar-Ilan University, Ramat Gan, ISRAEL
Venue:
Proceedings of the thirteenth ACM international conference on Information and knowledge management
Year:
2004

Citing 14
Cited 5

Information extraction from HTML: application of a general machine learning approach

AAAI '98/IAAI '98 Proceedings of the fifteenth national/tenth conference on Artificial intelligence/Innovative applications of artificial intelligence
An Algorithm that Learns What‘s in a Name

Machine Learning - Special issue on natural language learning
Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data

ICML '01 Proceedings of the Eighteenth International Conference on Machine Learning
Maximum Entropy Markov Models for Information Extraction and Segmentation

ICML '00 Proceedings of the Seventeenth International Conference on Machine Learning
Information Extraction with HMM Structures Learned by Stochastic Optimization

Proceedings of the Seventeenth National Conference on Artificial Intelligence and Twelfth Conference on Innovative Applications of Artificial Intelligence
A Comparative Study of Information Extraction Strategies

CICLing '02 Proceedings of the Third International Conference on Computational Linguistics and Intelligent Text Processing
Background and overview for KDD Cup 2002 task 1: information extraction from biomedical articles

ACM SIGKDD Explorations Newsletter
Evaluating message understanding systems: an analysis of the third message understanding conference (MUC-3)

Computational Linguistics
A maximum-entropy-inspired parser

NAACL 2000 Proceedings of the 1st North American chapter of the Association for Computational Linguistics conference
Nymble: a high-performance learning name-finder

ANLC '97 Proceedings of the fifth conference on Applied natural language processing
Three generative, lexicalised models for statistical parsing

ACL '98 Proceedings of the 35th Annual Meeting of the Association for Computational Linguistics and Eighth Conference of the European Chapter of the Association for Computational Linguistics
Efficient probabilistic top-down and left-corner parsing

ACL '99 Proceedings of the 37th annual meeting of the Association for Computational Linguistics on Computational Linguistics
Named entity recognition: a maximum entropy approach using global information

COLING '02 Proceedings of the 19th international conference on Computational linguistics - Volume 1
Using support vector machines for terrorism information extraction

ISI'03 Proceedings of the 1st NSF/NIJ conference on Intelligence and security informatics

A hybrid approach to NER by MEMM and manual rules

Proceedings of the 14th ACM international conference on Information and knowledge management
URES: an unsupervised web relation extraction system

COLING-ACL '06 Proceedings of the COLING/ACL on Main conference poster sessions
Entity categorization over large document collections

Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining
A robust web personal name information extraction system

Expert Systems with Applications: An International Journal
A systematic comparison of feature-rich probabilistic classifiers for NER tasks

PKDD'05 Proceedings of the 9th European conference on Principles and Practice of Knowledge Discovery in Databases

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper describes a hybrid statistical and knowledge-based information extraction model, able to extract entities and relations at the sentence level. The model attempts to retain and improve the high accuracy levels of knowledge-based systems while drastically reducing the amount of manual labor by relying on statistics drawn from a training corpus. The implementation of the model, called TEG (Trainable Extraction Grammar), can be adapted to any IE domain by writing a suitable set of rules in a SCFG (Stochastic Context Free Grammar) based extraction language, and training them using an annotated corpus. The system does not contain any purely linguistic components, such as PoS tagger or parser. We demonstrate the performance of the system on several named entity extraction and relation extraction tasks. The experiments show that our hybrid approach outperforms both purely statistical and purely knowledge-based systems, while requiring orders of magnitude less manual rule writing and smaller amount of training data. The improvement in accuracy is slight for named entity extraction task and more pronounced for relation extraction.