An Algebraic Approach to Rule-Based Information Extraction

Authors:
Frederick Reiss;Sriram Raghavan;Rajasekar Krishnamurthy;Huaiyu Zhu;Shivakumar Vaithyanathan
Affiliations:
IBM Almaden Research Center, San Jose, CA, USA. frreiss@us.ibm.com;IBM Almaden Research Center, San Jose, CA, USA. rsriram@us.ibm.com;IBM Almaden Research Center, San Jose, CA, USA. rajase@us.ibm.com;IBM Almaden Research Center, San Jose, CA, USA. huaiyu@us.ibm.com;IBM Almaden Research Center, San Jose, CA, USA. shiv@us.ibm.com
Venue:
ICDE '08 Proceedings of the 2008 IEEE 24th International Conference on Data Engineering
Year:
2008

Citing 0
Cited 33

Harvesting, searching, and ranking knowledge on the web: invited talk

Proceedings of the Second ACM International Conference on Web Search and Data Mining
Information Extraction

Foundations and Trends in Databases
SystemT: a system for declarative information extraction

ACM SIGMOD Record
Building query optimizers for information extraction: the SQoUT project

ACM SIGMOD Record
The YAGO-NAGA approach to knowledge discovery

ACM SIGMOD Record
SOFIE: a self-organizing framework for information extraction

Proceedings of the 18th international conference on World wide web
Efficiently incorporating user feedback into information extraction and integration programs

Proceedings of the 2009 ACM SIGMOD International Conference on Management of data
Uncertainty management in rule-based information extraction systems

Proceedings of the 2009 ACM SIGMOD International Conference on Management of data
Optimizing complex extraction programs over evolving text data

Proceedings of the 2009 ACM SIGMOD International Conference on Management of data
Enabling enterprise mashups over unstructured text feeds with InfoSphere MashupHub and SystemT

Proceedings of the 2009 ACM SIGMOD International Conference on Management of data
RankIE: document retrieval on ranked entity graphs

Proceedings of the VLDB Endowment
Data-oriented content query system: searching for data into text on the web

Proceedings of the third ACM international conference on Web search and data mining
From information to knowledge: harvesting entities and relationships from web sources

Proceedings of the twenty-ninth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Understanding queries in a search database system

Proceedings of the twenty-ninth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
SystemT: an algebraic approach to declarative information extraction

ACL '10 Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics
Find your advisor: robust knowledge gathering from the web

Procceedings of the 13th International Workshop on the Web and Databases
Clustering based approach to learning regular expressions over large alphabet for noisy unstructured text

AND '10 Proceedings of the fourth workshop on Analytics for noisy unstructured text data
Automatic rule refinement for information extraction

Proceedings of the VLDB Endowment
Querying probabilistic information extraction

Proceedings of the VLDB Endowment
Scalable knowledge harvesting with high precision and high recall

Proceedings of the fourth ACM international conference on Web search and data mining
Rewrite rules for search database systems

Proceedings of the thirtieth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Hybrid in-database inference for declarative information extraction

Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
The SystemT IDE: an integrated development environment for information extraction rules

Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
SystemT: a declarative information extraction system

HLT '11 Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies: Systems Demonstrations
Facilitating pattern discovery for relation extraction with semantic-signature-based clustering

Proceedings of the 20th ACM international conference on Information and knowledge management
Just-in-time information extraction using extraction views

SIGMOD '12 Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data
Towards efficient named-entity rule induction for customizability

EMNLP-CoNLL '12 Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning
A performance comparison of parallel DBMSs and MapReduce on large-scale text analytics

Proceedings of the 16th International Conference on Extending Database Technology
Spanners: a formal framework for information extraction

Proceedings of the 32nd symposium on Principles of database systems
Provenance-based dictionary refinement in information extraction

Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data
Concept adjustment for description logics

Proceedings of the seventh international conference on Knowledge capture
Efficient parsing-based search over structured data

Proceedings of the 22nd ACM international conference on Conference on information & knowledge management
When speed has a price: fast information extraction using approximate algorithms

Proceedings of the VLDB Endowment

Quantified Score

Hi-index	0.00

Visualization

Abstract

Traditional approaches to rule-based information extraction (IE) have primarily been based on regular expression grammars. However, these grammar-based systems have difficulty scaling to large data sets and large numbers of rules. Inspired by traditional database research, we propose an algebraic approach to rule-based IE that addresses these scalability issues through query optimization. The operators of our algebra are motivated by our experience in building several rule-based extraction programs over diverse data sets. We present the operators of our algebra and propose several optimization strategies motivated by the text-specific characteristics of our operators. Finally we validate the potential benefits of our approach by extensive experiments over real-world blog data.