TEG—a hybrid approach to information extraction

Authors:
Ronen Feldman;Benjamin Rosenfeld;Moshe Fresko
Affiliations:
Computer Science Department, Bar-Ilan University, 52900, Ramat Gan, Israel;Computer Science Department, Bar-Ilan University, 52900, Ramat Gan, Israel;Computer Science Department, Bar-Ilan University, 52900, Ramat Gan, Israel
Venue:
Knowledge and Information Systems
Year:
2006

Citing 0
Cited 9

Review article: A review of structured document retrieval (SDR) technology to improve information access performance in engineering document management

Computers in Industry
Natural language processing and e-Government: crime information extraction from heterogeneous data sources

dg.o '08 Proceedings of the 2008 international conference on Digital government research
Self-supervised relation extraction from the Web

Knowledge and Information Systems
Information Extraction

Foundations and Trends in Databases
Querying parse trees of stochastic context-free grammars

Proceedings of the 13th International Conference on Database Theory
On the complexity of regular-grammars with integer attributes

Journal of Computer and System Sciences
Ontology based information extraction from text

Knowledge-driven multimedia information extraction and ontology evolution
The HiLeX system for semantic information extraction

Transactions on Large-Scale Data- and Knowledge-Centered Systems V
GAT: Platform for automatic context-aware mobile services for m-tourism

Expert Systems with Applications: An International Journal

Quantified Score

Hi-index	0.01

Visualization

Abstract

This paper describes a hybrid statistical and knowledge-based information extraction model, able to extract entities and relations at the sentence level. The model attempts to retain and improve the high accuracy levels of knowledge-based systems while drastically reducing the amount of manual labour by relying on statistics drawn from a training corpus. The implementation of the model, called TEG (trainable extraction grammar), can be adapted to any IE domain by writing a suitable set of rules in a SCFG (stochastic context-free grammar)-based extraction language and training them using an annotated corpus. The system does not contain any purely linguistic components, such as PoS tagger or shallow parser, but allows to using external linguistic components if necessary. We demonstrate the performance of the system on several named entity extraction and relation extraction tasks. The experiments show that our hybrid approach outperforms both purely statistical and purely knowledge-based systems, while requiring orders of magnitude less manual rule writing and smaller amounts of training data. We also demonstrate the robustness of our system under conditions of poor training-data quality.