Similarity measures for short segments of text

Authors:
Donald Metzler;Susan Dumais;Christopher Meek
Affiliations:
University of Massachusetts, Amherst, MA;Microsoft Research, Redmond, WA;Microsoft Research, Redmond, WA
Venue:
ECIR'07 Proceedings of the 29th European conference on IR research
Year:
2007

Citing 9
Cited 61

Viewing morphology as an inference process

SIGIR '93 Proceedings of the 16th annual international ACM SIGIR conference on Research and development in information retrieval
Information retrieval as statistical translation

Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval
Relevance based language models

Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval
A study of smoothing methods for language models applied to Ad Hoc information retrieval

Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval
Model-based feedback in the language modeling approach to information retrieval

Proceedings of the tenth international conference on Information and knowledge management
Similarity measures for tracking information flow

Proceedings of the 14th ACM international conference on Information and knowledge management
A web-based kernel function for measuring the similarity of short text snippets

Proceedings of the 15th international conference on World Wide Web
Generating query substitutions

Proceedings of the 15th international conference on World Wide Web
A translation model for sentence retrieval

HLT '05 Proceedings of the conference on Human Language Technology and Empirical Methods in Natural Language Processing

World knowledge in broad-coverage information filtering

SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
Learning to classify short and sparse text & web with hidden topics from large-scale data collections

Proceedings of the 17th international conference on World Wide Web
Identifying Quotations in Reference Works and Primary Materials

ECDL '08 Proceedings of the 12th European conference on Research and Advanced Technology for Digital Libraries
The Evaluation of Sentence Similarity Measures

DaWaK '08 Proceedings of the 10th international conference on Data Warehousing and Knowledge Discovery
Inferring semantic query relations from collective user behavior

Proceedings of the 17th ACM conference on Information and knowledge management
To swing or not to swing: learning when (not) to advertise

Proceedings of the 17th ACM conference on Information and knowledge management
Search advertising using web relevance feedback

Proceedings of the 17th ACM conference on Information and knowledge management
Utilizing Semantic, Syntactic, and Question Category Information for Automated Digital Reference Services

ICADL 08 Proceedings of the 11th International Conference on Asian Digital Libraries: Universal and Ubiquitous Access to Information
Integration of news content into web results

Proceedings of the Second ACM International Conference on Web Search and Data Mining
Consistent phrase relevance measures

Proceedings of the 2nd International Workshop on Data Mining and Audience Intelligence for Advertising
A survey on session detection methods in query logs and a proposal for future evaluation

Information Sciences: an International Journal
Addressing the Variability of Natural Language Expression in Sentence Similarity with Semantic Structure of the Sentences

PAKDD '09 Proceedings of the 13th Pacific-Asia Conference on Advances in Knowledge Discovery and Data Mining
Collecting fragmentary authors in a digital library

Proceedings of the 9th ACM/IEEE-CS joint conference on Digital libraries
Adaptation of offline vertical selection predictions in the presence of user feedback

Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval
Improving similarity measures for short segments of text

AAAI'07 Proceedings of the 22nd national conference on Artificial intelligence - Volume 2
Wikipedia-based semantic interpretation for natural language processing

Journal of Artificial Intelligence Research
Exploiting internal and external semantics for the clustering of short texts using world knowledge

Proceedings of the 18th ACM conference on Information and knowledge management
Learning term-weighting functions for similarity measures

EMNLP '09 Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume 2 - Volume 2
Precomputing search features for fast and accurate query classification

Proceedings of the third ACM international conference on Web search and data mining
Time is of the essence: improving recency ranking using Twitter data

Proceedings of the 19th international conference on World wide web
Mining Historic Query Trails to Label Long and Rare Search Engine Queries

ACM Transactions on the Web (TWEB)
Growing related words from seed via user behaviors: a re-ranking based approach

ACLstudent '10 Proceedings of the ACL 2010 Student Research Workshop
Efficient set-correlation operator inside databases

CIKM '10 Proceedings of the 19th ACM international conference on Information and knowledge management
Probabilistic first pass retrieval for search advertising: from theory to practice

CIKM '10 Proceedings of the 19th ACM international conference on Information and knowledge management
Identifying topical authorities in microblogs

Proceedings of the fourth ACM international conference on Web search and data mining
Learning similarity function for rare queries

Proceedings of the fourth ACM international conference on Web search and data mining
Query suggestion for E-commerce sites

Proceedings of the fourth ACM international conference on Web search and data mining
Generating phrasal and sentential paraphrases: A survey of data-driven methods

Computational Linguistics
Location specific summarization of climatic and agricultural trends

Proceedings of the 20th international conference companion on World wide web
A word at a time: computing word relatedness using temporal semantic analysis

Proceedings of the 20th international conference on World wide web
Detecting outlier sections in us congressional legislation

Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval
Out of sight, not out of mind: on the effect of social and physical detachment on information need

Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval
User Behaviors in Related Word Retrieval and New Word Detection: A Collaborative Perspective

ACM Transactions on Asian Language Information Processing (TALIP)
Web-Based Verification on the Representativeness of Terms Extracted from Single Short Documents

WI-IAT '11 Proceedings of the 2011 IEEE/WIC/ACM International Conferences on Web Intelligence and Intelligent Agent Technology - Volume 03
ETree: Effective and Efficient Event Modeling for Real-Time Online Social Media Networks

WI-IAT '11 Proceedings of the 2011 IEEE/WIC/ACM International Conferences on Web Intelligence and Intelligent Agent Technology - Volume 01
Query session detection as a cascade

Proceedings of the 20th ACM international conference on Information and knowledge management
Unveiling locations in geo-spatial documents

Proceedings of the 19th ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems
Quality-aware similarity assessment for entity matching in Web data

Information Systems
Medical event coreference resolution using the UMLS metathesaurus and temporal reasoning

Proceedings of the 2nd ACM SIGHIT International Health Informatics Symposium
Summarizing and extracting online public opinion from blog search results

DASFAA'10 Proceedings of the 15th international conference on Database Systems for Advanced Applications - Volume Part I
Supporting collaboration in Wikipedia between language communities

Proceedings of the 4th international conference on Intercultural Collaboration
Investigating the statistical properties of user-generated documents

FQAS'11 Proceedings of the 9th international conference on Flexible Query Answering Systems
Optimizing index for taxonomy keyword search

SIGMOD '12 Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data
Building subjectivity lexicon(s) from scratch for essay data

CICLing'12 Proceedings of the 13th international conference on Computational Linguistics and Intelligent Text Processing - Volume Part I
Finding related micro-blogs based on wordnet

DASFAA'12 Proceedings of the 17th international conference on Database Systems for Advanced Applications
Cognos: crowdsourcing search for topic experts in microblogs

SIGIR '12 Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval
Improving retrieval of short texts through document expansion

SIGIR '12 Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval
A preference learning approach to sentence ordering for multi-document summarization

Information Sciences: an International Journal
Towards efficient similar sentences extraction

IDEAL'12 Proceedings of the 13th international conference on Intelligent Data Engineering and Automated Learning
Measuring semantic relatedness using multilingual representations

SemEval '12 Proceedings of the First Joint Conference on Lexical and Computational Semantics - Volume 1: Proceedings of the main conference and the shared task, and Volume 2: Proceedings of the Sixth International Workshop on Semantic Evaluation
Joint topic modeling for event summarization across news and social media streams

Proceedings of the 21st ACM international conference on Information and knowledge management
Collaborative ranking: improving the relevance for tail queries

Proceedings of the 21st ACM international conference on Information and knowledge management
Cross domain similarity mining: research issues and potential applications including supporting research by analogy

ACM SIGKDD Explorations Newsletter
Improving recency ranking using twitter data

ACM Transactions on Intelligent Systems and Technology (TIST) - Special section on twitter and microblogging services, social recommender systems, and CAMRa2010: Movie recommendation in context
The Effect of Social and Physical Detachment on Information Need

ACM Transactions on Information Systems (TOIS)
Multimodal alignment of scholarly documents and their presentations

Proceedings of the 13th ACM/IEEE-CS joint conference on Digital libraries
Computing semantic relatedness using word frequency and layout information of Wikipedia

Proceedings of the 28th Annual ACM Symposium on Applied Computing
From search session detection to search mission detection

Proceedings of the 10th Conference on Open Research Areas in Information Retrieval
Probabilistic semantic similarity measurements for noisy short texts using Wikipedia entities

Proceedings of the 22nd ACM international conference on Conference on information & knowledge management
How fresh do you want your search results?

Proceedings of the 22nd ACM international conference on Conference on information & knowledge management
An unsupervised transfer learning approach to discover topics for online reputation management

Proceedings of the 22nd ACM international conference on Conference on information & knowledge management

Quantified Score

Hi-index	0.00

Visualization

Abstract

Measuring the similarity between documents and queries has been extensively studied in information retrieval. However, there are a growing number of tasks that require computing the similarity between two very short segments of text. These tasks include query reformulation, sponsored search, and image retrieval. Standard text similarity measures perform poorly on such tasks because of data sparseness and the lack of context. In this work, we study this problem from an information retrieval perspective, focusing on text representations and similarity measures. We examine a range of similarity measures, including purely lexical measures, stemming, and language modeling-based measures. We formally evaluate and analyze the methods on a query-query similarity task using 363,822 queries from a web search log. Our analysis provides insights into the strengths and weaknesses of each method, including important tradeoffs between effectiveness and efficiency.