Elements of information theory
Elements of information theory
Optimizing search engines using clickthrough data
Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
Identifying and Handling Mislabelled Instances
Journal of Intelligent Information Systems
Unsupervised word sense disambiguation rivaling supervised methods
ACL '95 Proceedings of the 33rd annual meeting on Association for Computational Linguistics
Scaling to very very large corpora for natural language disambiguation
ACL '01 Proceedings of the 39th Annual Meeting on Association for Computational Linguistics
Web-based models for natural language processing
ACM Transactions on Speech and Language Processing (TSLP)
Automatic web query classification using labeled and unlabeled training data
Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval
Detecting online commercial intention (OCI)
Proceedings of the 15th international conference on World Wide Web
Using the web as an implicit training set: application to structural ambiguity resolution
HLT '05 Proceedings of the conference on Human Language Technology and Empirical Methods in Natural Language Processing
Evaluating the accuracy of implicit feedback from clicks and query reformulations in Web search
ACM Transactions on Information Systems (TOIS)
Scalable training of L1-regularized log-linear models
Proceedings of the 24th international conference on Machine learning
Learning query intent from regularized click graphs
Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval
Proceedings of the Second ACM International Conference on Web Search and Data Mining
The intention behind web queries
SPIRE'06 Proceedings of the 13th international conference on String Processing and Information Retrieval
Precomputing search features for fast and accurate query classification
Proceedings of the third ACM international conference on Web search and data mining
Learning to combine discriminative classifiers: confidence based
Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining
Shopping for products you don't know you need
Proceedings of the fourth ACM international conference on Web search and data mining
A transfer approach to detecting disease reporting events in blog social media
Proceedings of the 22nd ACM conference on Hypertext and hypermedia
A Naïve-Bayesian methodology to classify echo cardiographic images through SQL
KICSS'10 Proceedings of the 5th international conference on Knowledge, information, and creativity support systems
SQL based cardiovascular ultrasound image classification
International Journal of Data Mining and Bioinformatics
Hi-index | 0.00 |
Classification is a core task in knowledge discovery and data mining, and there has been substantial research effort in developing sophisticated classification models. In a parallel thread, recent work from the NLP community suggests that for tasks such as natural language disambiguation even a simple algorithm can outperform a sophisticated one, if it is provided with large quantities of high quality training data. In those applications, training data occurs naturally in text corpora, and high quality training data sets running into billions of words have been reportedly used. We explore how we can apply the lessons from the NLP community to KDD tasks. Specifically, we investigate how to identify data sources that can yield training data at low cost and study whether the quantity of the automatically extracted training data can compensate for its lower quality. We carry out this investigation for the specific task of inferring whether a search query has commercial intent. We mine toolbar and click logs to extract queries from sites that are predominantly commercial (e.g., Amazon) and non-commercial (e.g., Wikipedia). We compare the accuracy obtained using such training data against manually labeled training data. Our results show that we can have large accuracy gains using automatically extracted training data at much lower cost.