SystemT: a system for declarative information extraction

Authors:
Rajasekar Krishnamurthy;Yunyao Li;Sriram Raghavan;Frederick Reiss;Shivakumar Vaithyanathan;Huaiyu Zhu
Affiliations:
IBM Almaden Research Center;IBM Almaden Research Center;IBM Almaden Research Center;IBM Almaden Research Center;IBM Almaden Research Center;IBM Almaden Research Center
Venue:
ACM SIGMOD Record
Year:
2009

Citing 6
Cited 24

Access path selection in a relational database management system

SIGMOD '79 Proceedings of the 1979 ACM SIGMOD international conference on Management of data
Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data

ICML '01 Proceedings of the Eighteenth International Conference on Machine Learning
Multistrategy Learning for Information Extraction

ICML '98 Proceedings of the Fifteenth International Conference on Machine Learning
Managing information extraction: state of the art and research directions

Proceedings of the 2006 ACM SIGMOD international conference on Management of data
Declarative information extraction using datalog with embedded extraction predicates

VLDB '07 Proceedings of the 33rd international conference on Very large data bases
An Algebraic Approach to Rule-Based Information Extraction

ICDE '08 Proceedings of the 2008 IEEE 24th International Conference on Data Engineering

Information extraction challenges in managing unstructured data

ACM SIGMOD Record
Uncertainty management in rule-based information extraction systems

Proceedings of the 2009 ACM SIGMOD International Conference on Management of data
Enabling enterprise mashups over unstructured text feeds with InfoSphere MashupHub and SystemT

Proceedings of the 2009 ACM SIGMOD International Conference on Management of data
From information to knowledge: harvesting entities and relationships from web sources

Proceedings of the twenty-ninth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Understanding queries in a search database system

Proceedings of the twenty-ninth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Midas: integrating public financial data

Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
Enterprise information extraction: recent developments and open challenges

Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
Is formalizing events necessary for full exploitation

ESAIR '10 Proceedings of the third workshop on Exploiting semantic annotations in information retrieval
Automatic rule refinement for information extraction

Proceedings of the VLDB Endowment
Enterprise data classification using semantic web technologies

ISWC'10 Proceedings of the 9th international semantic web conference on The semantic web - Volume Part II
Rewrite rules for search database systems

Proceedings of the thirtieth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
The SystemT IDE: an integrated development environment for information extraction rules

Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
SystemT: a declarative information extraction system

HLT '11 Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies: Systems Demonstrations
Querying versioned software repositories

ADBIS'11 Proceedings of the 15th international conference on Advances in databases and information systems
A probability model for related entity retrieval using relation pattern

KSEM'11 Proceedings of the 5th international conference on Knowledge Science, Engineering and Management
Building user-defined runtime adaptation routines for stream processing applications

Proceedings of the VLDB Endowment
Spanners: a formal framework for information extraction

Proceedings of the 32nd symposium on Principles of database systems
Provenance-based dictionary refinement in information extraction

Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data
A CRM system for social media: challenges and experiences

Proceedings of the 22nd international conference on World Wide Web
INDREX: in-database distributional relation extraction

Proceedings of the sixteenth international workshop on Data warehousing and OLAP
When speed has a price: fast information extraction using approximate algorithms

Proceedings of the VLDB Endowment
PREDOSE: A semantic web platform for drug abuse epidemiology using social media

Journal of Biomedical Informatics
Understanding system design for big data workloads

IBM Journal of Research and Development
A platform for eXtreme analytics

IBM Journal of Research and Development

Quantified Score

Hi-index	0.00

Visualization

Abstract

As applications within and outside the enterprise encounter increasing volumes of unstructured data, there has been renewed interest in the area of information extraction (IE) -- the discipline concerned with extracting structured information from unstructured text. Classical IE techniques developed by the NLP community were based on cascading grammars and regular expressions. However, due to the inherent limitations of grammarbased extraction, these techniques are unable to: (i) scale to large data sets, and (ii) support the expressivity requirements of complex information tasks. At the IBM Almaden Research Center, we are developing SystemT, an IE system that addresses these limitations by adopting an algebraic approach. By leveraging well-understood database concepts such as declarative queries and costbased optimization, SystemT enables scalable execution of complex information extraction tasks. In this paper, we motivate the SystemT approach to information extraction. We describe our extraction algebra and demonstrate the effectiveness of our optimization techniques in providing orders of magnitude reduction in the running time of complex extraction tasks.