Machine Learning for Information Extraction in Informal Domains

  • Authors:
  • Dayne Freitag

  • Affiliations:
  • Justsystem Pittsburgh Research Center, 4616 Henry Street, Pittsburgh, PA 15213, USA. dayne@justresearch.com

  • Venue:
  • Machine Learning - Special issue on information retrieval
  • Year:
  • 2000

Quantified Score

Hi-index 0.00

Visualization

Abstract

We consider the problem of learning to performinformation extraction in domains where linguistic processingis problematic, such as Usenet posts, email, and finger plan files.In place of syntactic and semantic information, other sources ofinformation can be used, such as term frequency, typography,formatting, and mark-up. We describe four learning approaches to thisproblem, each drawn from a different paradigm: a rote learner, aterm-space learner based on Naive Bayes, an approach using grammaticalinduction, and a relational rule learner. Experiments on 14information extraction problems defined over four diverse documentcollections demonstrate the effectiveness of these approaches.Finally, we describe a multistrategy approach which combines theselearners and yields performance competitive with or better than thebest of them. This technique is modular and flexible, and could findapplication in other machine learning problems.