Detecting errors in corpora using support vector machines

Authors:
Tetsuji Nakagawa;Yuji Matsumoto
Affiliations:
Nara Institute of Science and Technology, Ikoma, Nara, Japan;Nara Institute of Science and Technology, Ikoma, Nara, Japan
Venue:
COLING '02 Proceedings of the 19th international conference on Computational linguistics - Volume 1
Year:
2002

Citing 7
Cited 7

The nature of statistical learning theory

The nature of statistical learning theory
Transformation-based error-driven learning and natural language processing: a case study in part-of-speech tagging

Computational Linguistics
Reducing Multiclass to Binary: A Unifying Approach for Margin Classifiers

ICML '00 Proceedings of the Seventeenth International Conference on Machine Learning
Improving accuracy in word class tagging through the combination of machine learning systems

Computational Linguistics
Detecting errors within a corpus using anomaly detection

NAACL 2000 Proceedings of the 1st North American chapter of the Association for Computational Linguistics conference
On the evaluation and comparison of taggers: the effect of noise in testing corpora

COLING '98 Proceedings of the 17th international conference on Computational linguistics - Volume 2
Revision learning and its application to part-of-speech tagging

ACL '02 Proceedings of the 40th Annual Meeting on Association for Computational Linguistics

Correcting category errors in text classification

COLING '04 Proceedings of the 20th international conference on Computational Linguistics
Correcting a PoS-tagged corpus using three complementary methods

EACL '09 Proceedings of the 12th Conference of the European Chapter of the Association for Computational Linguistics
Training Data Cleaning for Text Classification

ICTIR '09 Proceedings of the 2nd International Conference on Theory of Information Retrieval: Advances in Information Retrieval Theory
Correcting errors in a treebank based on synchronous tree substitution grammar

ACLShort '10 Proceedings of the ACL 2010 Conference Short Papers
Collaborative data cleaning for sentiment classification with noisy training corpus

PAKDD'11 Proceedings of the 15th Pacific-Asia conference on Advances in knowledge discovery and data mining - Volume Part I
Reducing the need for double annotation

LAW V '11 Proceedings of the 5th Linguistic Annotation Workshop
Improving Text Classification Accuracy by Training Label Cleaning

ACM Transactions on Information Systems (TOIS)

Quantified Score

Hi-index	0.00

Visualization

Abstract

While the corpus-based research relies on human annotated corpora, it is often said that a non-negligible amount of errors remain even in frequently used corpora such as Penn Treebank. Detection of errors in annotated corpora is important for corpus-based natural language processing. In this paper, we propose a method to detect errors in corpora using support vector machines (SVMs). This method is based on the idea of extracting exceptional elements that violate consistency. We propose a method of using SVMs to assign a weight to each element and to find errors in a POS tagged corpus. We apply the method to English and Japanese POS-tagged corpora and achieve high precision in detecting errors.