Integrating geometrical and linguistic analysis for email signature block parsing

  • Authors:
  • Hao Chen;Jianying Hu;Richard W. Sproat

  • Affiliations:
  • Univ. of California at Berkeley, Berkeley;Lucent Technologies Bell Labs, Murray Hill, NJ;AT&T Labs—Research, Florham Park, NJ

  • Venue:
  • ACM Transactions on Information Systems (TOIS)
  • Year:
  • 1999

Quantified Score

Hi-index 0.00

Visualization

Abstract

The signature block is a common structured component found in email messages. Accurate identification and analysis of signature blocks is important in many multimedia messaging and information retrieval applications such as email text-to-speech rendering, automatic construction of personal address databases, and interactive message retrieval. It is also a very challenging task, because signature blocks often appear in complex two-dimensional layouts which are guided only by loose conventions. Traditional text analysis methods designed to deal with sequential text cannot handle two-dimensional structures, while the highly unconstrained nature of signature blocks makes the application of two-dimensional grammars very difficult. In this article, we describe an algorithm for signature block analysis which combines two-dimensional structural segmentation with one-dimensional grammatical constraints. The information obtained from both layout and linguistic analysis is integrated in the form of weighted finite-state transducers. The algorithm is currently implemented as a component in a preprocessing system for email text-to-speech rendering.