Using word clusters to detect similar web documents

Authors:
Jonathan Koberstein;Yiu-Kai Ng
Affiliations:
Computer Science Department, Brigham Young University, Provo, UT;Computer Science Department, Brigham Young University, Provo, UT
Venue:
KSEM'06 Proceedings of the First international conference on Knowledge Science, Engineering and Management
Year:
2006

Citing 9
Cited 14

Probabilistic reasoning in intelligent systems: networks of plausible inference

Probabilistic reasoning in intelligent systems: networks of plausible inference
Copy detection mechanisms for digital documents

SIGMOD '95 Proceedings of the 1995 ACM SIGMOD international conference on Management of data
Modern Information Retrieval

Modern Information Retrieval
Detecting similar documents using salient terms

Proceedings of the eleventh international conference on Information and knowledge management
Using Dempster-Shafer's Theory of Evidence to Combine Aspects of Information Use

Journal of Intelligent Information Systems
Finding the Most Similar Documents across Multiple Text Databases

ADL '99 Proceedings of the IEEE Forum on Research and Technology Advances in Digital Libraries
Finding similar files in a large file system

WTEC'94 Proceedings of the USENIX Winter 1994 Technical Conference on USENIX Winter 1994 Technical Conference
Retrieving similar documents from the web

Journal of Web Engineering
A sentence-based copy detection approach for web documents

FSKD'05 Proceedings of the Second international conference on Fuzzy Systems and Knowledge Discovery - Volume Part I

Using word similarity to eradicate junk emails

Proceedings of the sixteenth ACM conference on Conference on information and knowledge management
SimPaD: A word-similarity sentence-based plagiarism detection tool on Web documents

Web Intelligence and Agent Systems
An unsupervised sentiment classifier on summarized or full reviews

WISE'10 Proceedings of the 11th international conference on Web information systems engineering
A community question-answering refinement system

Proceedings of the 22nd ACM conference on Hypertext and hypermedia
Personalized book recommendations created by using social media data

WISS'10 Proceedings of the 2010 international conference on Web information systems engineering
With a Little Help from My Friends: Generating Personalized Book Recommendations Using Data Extracted from a Social Website

WI-IAT '11 Proceedings of the 2011 IEEE/WIC/ACM International Conferences on Web Intelligence and Intelligent Agent Technology - Volume 01
A query-based multi-document sentiment summarizer

Proceedings of the 20th ACM international conference on Information and knowledge management
A personalized recommendation system on scholarly publications

Proceedings of the 20th ACM international conference on Information and knowledge management
Generating exact- and ranked partially-matched answers to questions in advertisements

Proceedings of the VLDB Endowment
Predicting the ratings of multimedia items for making personalized recommendations

SIGIR '12 Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval
BReK12: a book recommender for K-12 users

SIGIR '12 Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval
Using maximal spanning trees and word similarity to generate hierarchical clusters of non-redundant RSS news articles

Journal of Intelligent Information Systems
A group recommender for movies based on content similarity and popularity

Information Processing and Management: an International Journal
What to read next?: making personalized book recommendations for K-12 users

Proceedings of the 7th ACM conference on Recommender systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

It is relatively easy to detect exact matches in Web documents; however, detecting similar content in distinct Web documents with different words and sentence structures is a much more difficult task. A reliable tool for determining the degree of similarity between any two Web documents could help filter or retain Web documents with similar content. Most methods for detecting similarity between documents rely on some kind of textual fingerprinting or a process of looking for exactly matched substrings. This may not be sufficient as changing the sentence structure or replacing words with synonyms can cause sentences with similar/same content to be treated as different. In this paper, we develop a sentence-based Fuzzy Set Information Retrieval (IR) approach, using word clusters that capture the similarity between different words for discovering similar documents. Our approach has the advantages of detecting documents with similar, but not necessarily the same, sentences based on fuzzy-word sets. The three different fuzzy-word clustering techniques that we have considered include the correlation cluster, the association cluster, and the metric cluster, which generate the word-to-word correlation values. Experimental results show that by adopting the metric cluster, our similarity detection approach has high accurate rate in detecting similar documents and improves previous Fuzzy Set IR approaches based solely on the correlation cluster.