An Algebraic Approach to Rule-Based Information Extraction

  • Authors:
  • Frederick Reiss;Sriram Raghavan;Rajasekar Krishnamurthy;Huaiyu Zhu;Shivakumar Vaithyanathan

  • Affiliations:
  • IBM Almaden Research Center, San Jose, CA, USA. frreiss@us.ibm.com;IBM Almaden Research Center, San Jose, CA, USA. rsriram@us.ibm.com;IBM Almaden Research Center, San Jose, CA, USA. rajase@us.ibm.com;IBM Almaden Research Center, San Jose, CA, USA. huaiyu@us.ibm.com;IBM Almaden Research Center, San Jose, CA, USA. shiv@us.ibm.com

  • Venue:
  • ICDE '08 Proceedings of the 2008 IEEE 24th International Conference on Data Engineering
  • Year:
  • 2008

Quantified Score

Hi-index 0.00

Visualization

Abstract

Traditional approaches to rule-based information extraction (IE) have primarily been based on regular expression grammars. However, these grammar-based systems have difficulty scaling to large data sets and large numbers of rules. Inspired by traditional database research, we propose an algebraic approach to rule-based IE that addresses these scalability issues through query optimization. The operators of our algebra are motivated by our experience in building several rule-based extraction programs over diverse data sets. We present the operators of our algebra and propose several optimization strategies motivated by the text-specific characteristics of our operators. Finally we validate the potential benefits of our approach by extensive experiments over real-world blog data.