The nature of statistical learning theory
The nature of statistical learning theory
Reducing Multiclass to Binary: A Unifying Approach for Margin Classifiers
ICML '00 Proceedings of the Seventeenth International Conference on Machine Learning
Improving accuracy in word class tagging through the combination of machine learning systems
Computational Linguistics
Detecting errors within a corpus using anomaly detection
NAACL 2000 Proceedings of the 1st North American chapter of the Association for Computational Linguistics conference
On the evaluation and comparison of taggers: the effect of noise in testing corpora
COLING '98 Proceedings of the 17th international conference on Computational linguistics - Volume 2
Revision learning and its application to part-of-speech tagging
ACL '02 Proceedings of the 40th Annual Meeting on Association for Computational Linguistics
Correcting category errors in text classification
COLING '04 Proceedings of the 20th international conference on Computational Linguistics
Correcting a PoS-tagged corpus using three complementary methods
EACL '09 Proceedings of the 12th Conference of the European Chapter of the Association for Computational Linguistics
Training Data Cleaning for Text Classification
ICTIR '09 Proceedings of the 2nd International Conference on Theory of Information Retrieval: Advances in Information Retrieval Theory
Correcting errors in a treebank based on synchronous tree substitution grammar
ACLShort '10 Proceedings of the ACL 2010 Conference Short Papers
Collaborative data cleaning for sentiment classification with noisy training corpus
PAKDD'11 Proceedings of the 15th Pacific-Asia conference on Advances in knowledge discovery and data mining - Volume Part I
Reducing the need for double annotation
LAW V '11 Proceedings of the 5th Linguistic Annotation Workshop
Improving Text Classification Accuracy by Training Label Cleaning
ACM Transactions on Information Systems (TOIS)
Hi-index | 0.00 |
While the corpus-based research relies on human annotated corpora, it is often said that a non-negligible amount of errors remain even in frequently used corpora such as Penn Treebank. Detection of errors in annotated corpora is important for corpus-based natural language processing. In this paper, we propose a method to detect errors in corpora using support vector machines (SVMs). This method is based on the idea of extracting exceptional elements that violate consistency. We propose a method of using SVMs to assign a weight to each element and to find errors in a POS tagged corpus. We apply the method to English and Japanese POS-tagged corpora and achieve high precision in detecting errors.