Reducing Noise in Labels and Features for a Real World Dataset: Application of NLP Corpus Annotation Methods

Authors:
Rebecca J. Passonneau;Cynthia Rudin;Axinia Radeva;Zhi An Liu
Affiliations:
Columbia University, New York, USA NY 10027;Columbia University, New York, USA NY 10027;Columbia University, New York, USA NY 10027;Columbia University, New York, USA NY 10027
Venue:
CICLing '09 Proceedings of the 10th International Conference on Computational Linguistics and Intelligent Text Processing
Year:
2009

Citing 6
Cited 0

An efficient boosting algorithm for combining preferences

The Journal of Machine Learning Research
A support vector method for multivariate performance measures

ICML '05 Proceedings of the 22nd international conference on Machine learning
Inter-coder agreement for computational linguistics

Computational Linguistics
Predicting electricity distribution feeder failures using machine learning susceptibility analysis

IAAI'06 Proceedings of the 18th conference on Innovative applications of artificial intelligence - Volume 2
Classification of aeronautics system health and safety documents

IEEE Transactions on Systems, Man, and Cybernetics, Part C: Applications and Reviews
The P-Norm Push: A Simple Convex Ranking Algorithm that Concentrates at the Top of the List

The Journal of Machine Learning Research

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper illustrates how a combination of information extraction, machine learning, and NLP corpus annotation practice was applied to a problem of ranking vulnerability of structures (service boxes, manholes) in the Manhattan electrical grid. By adapting NLP corpus annotation methods to the task of knowledge transfer from domain experts, we compensated for the lack of operational definitions of components of the model, such as serious event . The machine learning depended on the ticket classes, but it was not the end goal. Rather, our rule-based document classification determines both the labels of examples and their feature representations. Changes in our classification of events led to improvements in our model, as reflected in the AUC scores for the full ranked list of over 51K structures. The improvements for the very top of the ranked list, which is of most importance for prioritizing work on the electrical grid, affected one in every four or five structures.