On privacy preservation in text and document-based active learning for named entity recognition

  • Authors:
  • Fredrik Olsson

  • Affiliations:
  • Swedish Institute of Computer Science, Kista, Sweden

  • Venue:
  • Proceedings of the ACM first international workshop on Privacy and anonymity for very large databases
  • Year:
  • 2009

Quantified Score

Hi-index 0.00

Visualization

Abstract

The preservation of the privacy of persons mentioned in text requires the ability to automatically recognize and identify names. Named entity recognition is a mature field and most current approaches are based on supervised machine learning techniques. Such learning requires the presence of labeled examples on which to train; training examples are usually provided to the learner on the form of annotated corpora. Creating and annotating corpora is a tedious, meticulous and error prone process; obtaining good training examples is a hard task in itself. This paper describes the development and in-depth empirical investigation of a method, called BootMark, for bootstrapping the marking up of named entities in textual documents. Experimental results show that BootMark requires a human annotator to manually annotate fewer documents in order to produce a named entity recognizer with a given performance, than would be needed if the documents forming the basis for the recognizer were randomly drawn from the same corpus. The investigation further indicates that the primary gain obtained by BootMark compared to passive learning is in terms of higher recall. Thus, it is argued, the recognizers are suitable for use in privacy preservation applications.