Detecting errors within a corpus using anomaly detection

Authors:
Eleazar Eskin
Affiliations:
Columbia University
Venue:
NAACL 2000 Proceedings of the 1st North American chapter of the Association for Computational Linguistics conference
Year:
2000

Citing 9
Cited 12

An Intrusion-Detection Model

IEEE Transactions on Software Engineering - Special issue on computer security and privacy
Some advances in transformation-based part of speech tagging

AAAI '94 Proceedings of the twelfth national conference on Artificial intelligence (vol. 1)
Adaptive mixtures of probabilistic transducers

Neural Computation
An Efficient Extension to Mixture Techniques for Prediction and Decision Trees

Machine Learning
Machine Learning

Machine Learning
Anomaly Detection over Noisy Data using Learned Probability Distributions

ICML '00 Proceedings of the Seventeenth International Conference on Machine Learning
Building a large annotated corpus of English: the penn treebank

Computational Linguistics - Special issue on using large corpora: II
Coping with ambiguity and unknown words through probabilistic models

Computational Linguistics - Special issue on using large corpora: II
Classifier combination for improved lexical disambiguation

COLING '98 Proceedings of the 17th international conference on Computational linguistics - Volume 1

On-Line Error Detection of Annotated Corpus Using Modular Neural Networks

ICANN '01 Proceedings of the International Conference on Artificial Neural Networks
Correction of errors in a verb modality corpus for machine translation with a machine-learning method

ACM Transactions on Asian Language Information Processing (TALIP)
Detecting errors in corpora using support vector machines

COLING '02 Proceedings of the 19th international conference on Computational linguistics - Volume 1
An active approach to spoken language processing

ACM Transactions on Speech and Language Processing (TSLP)
Correcting category errors in text classification

COLING '04 Proceedings of the 20th international conference on Computational Linguistics
Morphological annotation of a large spontaneous speech corpus in Japanese

IJCAI'07 Proceedings of the 20th international joint conference on Artifical intelligence
Evaluating classifiers by means of test data with noisy labels

IJCAI'03 Proceedings of the 18th international joint conference on Artificial intelligence
Correcting errors in a treebank based on synchronous tree substitution grammar

ACLShort '10 Proceedings of the ACL 2010 Conference Short Papers
Measuring the interestingness of articles in a limited user environment

Information Processing and Management: an International Journal
Collaborative data cleaning for sentiment classification with noisy training corpus

PAKDD'11 Proceedings of the 15th Pacific-Asia conference on Advances in knowledge discovery and data mining - Volume Part I
Reducing the need for double annotation

LAW V '11 Proceedings of the 5th Linguistic Annotation Workshop
Improving Text Classification Accuracy by Training Label Cleaning

ACM Transactions on Information Systems (TOIS)

Quantified Score

Hi-index	0.00

Visualization

Abstract

We present a method for automatically detecting errors in a manually marked corpus using anomaly detection. Anomaly detection is a method for determining which elements of a large data set do not conform to the whole. This method fits a probability distribution over the data and applies a statistical test to detect anomalous elements. In the corpus error detection problem, anomalous elements are typically marking errors. We present the results of applying this method to the tagged portion of the Penn Treebank corpus.