Detecting errors within a corpus using anomaly detection

  • Authors:
  • Eleazar Eskin

  • Affiliations:
  • Columbia University

  • Venue:
  • NAACL 2000 Proceedings of the 1st North American chapter of the Association for Computational Linguistics conference
  • Year:
  • 2000

Quantified Score

Hi-index 0.00

Visualization

Abstract

We present a method for automatically detecting errors in a manually marked corpus using anomaly detection. Anomaly detection is a method for determining which elements of a large data set do not conform to the whole. This method fits a probability distribution over the data and applies a statistical test to detect anomalous elements. In the corpus error detection problem, anomalous elements are typically marking errors. We present the results of applying this method to the tagged portion of the Penn Treebank corpus.