Measuring semantic similarity between words by removing noise and redundancy in web snippets

  • Authors:
  • Zheng Xu;Xiangfeng Luo;Jie Yu;Weimin Xu

  • Affiliations:
  • School of Computer Engineering and Science, High Performance Computing Center, Shanghai University, Shanghai, 200072, China;School of Computer Engineering and Science, High Performance Computing Center, Shanghai University, Shanghai, 200072, China;School of Computer Engineering and Science, High Performance Computing Center, Shanghai University, Shanghai, 200072, China;School of Computer Engineering and Science, High Performance Computing Center, Shanghai University, Shanghai, 200072, China

  • Venue:
  • Concurrency and Computation: Practice & Experience
  • Year:
  • 2011

Quantified Score

Hi-index 0.00

Visualization

Abstract

Semantic similarity measures play important roles in many Web-related tasks such as Web browsing and query suggestion. Because taxonomy-based methods can not deal with continually emerging words, recently Web-based methods have been proposed to solve this problem. Because of the noise and redundancy hidden in the Web data, robustness and accuracy are still challenges. In this paper, we propose a method integrating page counts and snippets returned by Web search engines. Then, the semantic snippets and the number of search results are used to remove noise and redundancy in the Web snippets (‘Web-snippet’ includes the title, summary, and URL of a Web page returned by a search engine). After that, a method integrating page counts, semantics snippets, and the number of already displayed search results are proposed. The proposed method does not need any human annotated knowledge (e.g., ontologies), and can be applied Web-related tasks (e.g., query suggestion) easily. A correlation coefficient of 0.851 against Rubenstein–Goodenough benchmark dataset shows that the proposed method outperforms the existing Web-based methods by a wide margin. Moreover, the proposed semantic similarity measure significantly improves the quality of query suggestion against some page counts based methods. Copyright © 2011 John Wiley & Sons, Ltd.