Improving classification accuracy using automatically extracted training data

Authors:
Ariel Fuxman;Anitha Kannan;Andrew B. Goldberg;Rakesh Agrawal;Panayiotis Tsaparas;John Shafer
Affiliations:
Microsoft Research, Mountain View, CA, USA;Microsoft Research, Mountain View, CA, USA;Univ. of Wisconsin-Madison, Madison, WI, USA;Microsoft Research, Mountain View, CA, USA;Microsoft Research, Mountain View, CA, USA;Microsoft Research, Mountain View, CA, USA
Venue:
Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining
Year:
2009

Citing 14
Cited 6

Elements of information theory

Elements of information theory
Optimizing search engines using clickthrough data

Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
Identifying and Handling Mislabelled Instances

Journal of Intelligent Information Systems
Unsupervised word sense disambiguation rivaling supervised methods

ACL '95 Proceedings of the 33rd annual meeting on Association for Computational Linguistics
Scaling to very very large corpora for natural language disambiguation

ACL '01 Proceedings of the 39th Annual Meeting on Association for Computational Linguistics
Web-based models for natural language processing

ACM Transactions on Speech and Language Processing (TSLP)
Automatic web query classification using labeled and unlabeled training data

Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval
Detecting online commercial intention (OCI)

Proceedings of the 15th international conference on World Wide Web
Using the web as an implicit training set: application to structural ambiguity resolution

HLT '05 Proceedings of the conference on Human Language Technology and Empirical Methods in Natural Language Processing
Evaluating the accuracy of implicit feedback from clicks and query reformulations in Web search

ACM Transactions on Information Systems (TOIS)
Scalable training of L1-regularized log-linear models

Proceedings of the 24th international conference on Machine learning
Learning query intent from regularized click graphs

Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval
Generating labels from clicks

Proceedings of the Second ACM International Conference on Web Search and Data Mining
The intention behind web queries

SPIRE'06 Proceedings of the 13th international conference on String Processing and Information Retrieval

Precomputing search features for fast and accurate query classification

Proceedings of the third ACM international conference on Web search and data mining
Learning to combine discriminative classifiers: confidence based

Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining
Shopping for products you don't know you need

Proceedings of the fourth ACM international conference on Web search and data mining
A transfer approach to detecting disease reporting events in blog social media

Proceedings of the 22nd ACM conference on Hypertext and hypermedia
A Naïve-Bayesian methodology to classify echo cardiographic images through SQL

KICSS'10 Proceedings of the 5th international conference on Knowledge, information, and creativity support systems
SQL based cardiovascular ultrasound image classification

International Journal of Data Mining and Bioinformatics

Quantified Score

Hi-index	0.00

Visualization

Abstract

Classification is a core task in knowledge discovery and data mining, and there has been substantial research effort in developing sophisticated classification models. In a parallel thread, recent work from the NLP community suggests that for tasks such as natural language disambiguation even a simple algorithm can outperform a sophisticated one, if it is provided with large quantities of high quality training data. In those applications, training data occurs naturally in text corpora, and high quality training data sets running into billions of words have been reportedly used. We explore how we can apply the lessons from the NLP community to KDD tasks. Specifically, we investigate how to identify data sources that can yield training data at low cost and study whether the quantity of the automatically extracted training data can compensate for its lower quality. We carry out this investigation for the specific task of inferring whether a search query has commercial intent. We mine toolbar and click logs to extract queries from sites that are predominantly commercial (e.g., Amazon) and non-commercial (e.g., Wikipedia). We compare the accuracy obtained using such training data against manually labeled training data. Our results show that we can have large accuracy gains using automatically extracted training data at much lower cost.