Learning term-weighting functions for similarity measures

Authors:
Wen-tau Yih
Affiliations:
Microsoft Research, Redmond, WA
Venue:
EMNLP '09 Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume 2 - Volume 2
Year:
2009

Citing 16
Cited 4

MetaCost: a general method for making classifiers cost-sensitive

KDD '99 Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining
Neural Networks for Pattern Recognition

Neural Networks for Pattern Recognition
Combining Statistical Learning with a Knowledge-Based Approach - A Case Study in Intensive Care Monitoring

ICML '99 Proceedings of the Sixteenth International Conference on Machine Learning
Adaptive duplicate detection using learnable string similarity measures

Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
Automatic retrieval and clustering of similar words

COLING '98 Proceedings of the 17th international conference on Computational linguistics - Volume 2
Learning a Similarity Metric Discriminatively, with Application to Face Verification

CVPR '05 Proceedings of the 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05) - Volume 1 - Volume 01
Learning to rank using gradient descent

ICML '05 Proceedings of the 22nd international conference on Machine learning
Finding advertising keywords on web pages

Proceedings of the 15th international conference on World Wide Web
A web-based kernel function for measuring the similarity of short text snippets

Proceedings of the 15th international conference on World Wide Web
Generating query substitutions

Proceedings of the 15th international conference on World Wide Web
Introduction to Information Retrieval

Introduction to Information Retrieval
Get another label? improving data quality and data mining using multiple, noisy labelers

Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining
Corpus-based and knowledge-based measures of text semantic similarity

AAAI'06 Proceedings of the 21st national conference on Artificial intelligence - Volume 1
Improving similarity measures for short segments of text

AAAI'07 Proceedings of the 22nd national conference on Artificial intelligence - Volume 2
Similarity measures for short segments of text

ECIR'07 Proceedings of the 29th European conference on IR research
Here or there: preference judgments for relevance

ECIR'08 Proceedings of the IR research, 30th European conference on Advances in information retrieval

Adaptive near-duplicate detection via similarity learning

Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval
Learning similarity function for rare queries

Proceedings of the fourth ACM international conference on Web search and data mining
A supervised method of feature weighting for measuring semantic relatedness

Canadian AI'11 Proceedings of the 24th Canadian conference on Advances in artificial intelligence
An efficient Particle Swarm Optimization approach to cluster short texts

Information Sciences: an International Journal

Quantified Score

Hi-index	0.00

Visualization

Abstract

Measuring the similarity between two texts is a fundamental problem in many NLP and IR applications. Among the existing approaches, the cosine measure of the term vectors representing the original texts has been widely used, where the score of each term is often determined by a TFIDF formula. Despite its simplicity, the quality of such cosine similarity measure is usually domain dependent and decided by the choice of the term-weighting function. In this paper, we propose a novel framework that learns the term-weighting function. Given the labeled pairs of texts as training data, the learning procedure tunes the model parameters by minimizing the specified loss function of the similarity score. Compared to traditional TFIDF term-weighting schemes, our approach shows a significant improvement on tasks such as judging the quality of query suggestions and filtering irrelevant ads for online advertising.