Utilizing phrase-similarity measures for detecting and clustering informative RSS news articles

Authors:
Maria Soledad Pera;Yiu-Kai Ng
Affiliations:
Computer Science Department, Brigham Young University, Provo, UT, USA;(Correspd. E-mail: ng@cs.byu.edu) Computer Science Department, Brigham Young University, Provo, UT, USA
Venue:
Integrated Computer-Aided Engineering
Year:
2008

Citing 28
Cited 10

Probabilistic reasoning in intelligent systems: networks of plausible inference

Probabilistic reasoning in intelligent systems: networks of plausible inference
A fuzzy document retrieval system using the keyword connection matrix and a learning method

Fuzzy Sets and Systems - Special issue on applications of fuzzy systems theory, Iizuka '88
Elements of information theory

Elements of information theory
Retrieval effectiveness of proper name search methods

Information Processing and Management: an International Journal
Fuzzy set theory: foundations and applications

Fuzzy set theory: foundations and applications
Distributional clustering of words for text classification

Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval
The use of phrases from query texts in information retrieval (poster session)

SIGIR '00 Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval
Fast and flexible word searching on compressed text

ACM Transactions on Information Systems (TOIS)
Normalized Cuts and Image Segmentation

IEEE Transactions on Pattern Analysis and Machine Intelligence
Co-clustering documents and words using bipartite spectral graph partitioning

Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining
Evaluating strategies for similarity search on the web

Proceedings of the 11th international conference on World Wide Web
Artificial Intelligence: Structures and Strategies for Complex Problem Solving

Artificial Intelligence: Structures and Strategies for Complex Problem Solving
Document clustering with cluster refinement and model selection capabilities

SIGIR '02 Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval
Document clustering based on non-negative matrix factorization

Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval
PIX: exact and approximate phrase matching in XML

Proceedings of the 2003 ACM SIGMOD international conference on Management of data
Clustering binary data streams with K-means

DMKD '03 Proceedings of the 8th ACM SIGMOD workshop on Research issues in data mining and knowledge discovery
RCV1: A New Benchmark Collection for Text Categorization Research

The Journal of Machine Learning Research
Document clustering by concept factorization

Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval
Efficient Phrase-Based Document Indexing for Web Document Clustering

IEEE Transactions on Knowledge and Data Engineering
Event threading within news topics

Proceedings of the thirteenth ACM international conference on Information and knowledge management
Generative model-based document clustering: a comparative study

Knowledge and Information Systems
Improving Web Clustering by Cluster Selection

WI '05 Proceedings of the 2005 IEEE/WIC/ACM International Conference on Web Intelligence
Text document clustering based on frequent word sequences

Proceedings of the 14th ACM international conference on Information and knowledge management
Incremental hierarchical clustering of text documents

CIKM '06 Proceedings of the 15th ACM international conference on Information and knowledge management
Answering relationship queries on the web

Proceedings of the 16th international conference on World Wide Web
A novel clustering-based RSS aggregator

Proceedings of the 16th international conference on World Wide Web
Query Directed Web Page Clustering

WI '06 Proceedings of the 2006 IEEE/WIC/ACM International Conference on Web Intelligence
Boosting web retrieval through query operations

ECIR'05 Proceedings of the 27th European conference on Advances in Information Retrieval Research

Alignment-based surface patterns for factoid question answering systems

Integrated Computer-Aided Engineering - Selected papers from the IEEE Conference on Information Reuse and Integration (IRI), July 13-15, 2008
Survey of modular ontology techniques and their applications in the biomedical domain

Integrated Computer-Aided Engineering - Selected papers from the IEEE Conference on Information Reuse and Integration (IRI), July 13-15, 2008
A supervised learning approach to biological question answering

Integrated Computer-Aided Engineering - Selected papers from the IEEE Conference on Information Reuse and Integration (IRI), July 13-15, 2008
A practical method for browsing a relational database using a standard search engine

Integrated Computer-Aided Engineering - Selected papers from the IEEE Conference on Information Reuse and Integration (IRI), July 13-15, 2008
Using conditional random fields for result identification in biomedical abstracts

Integrated Computer-Aided Engineering
Full duplicate candidate pruning for frequent connected subgraph mining

Integrated Computer-Aided Engineering
SimPaD: A word-similarity sentence-based plagiarism detection tool on Web documents

Web Intelligence and Agent Systems
A hybrid approach for personalized recommendation of news on the Web

Expert Systems with Applications: An International Journal
Ubiquitous web navigation through harvesting embedded semantic data: A mobile scenario

Integrated Computer-Aided Engineering - Anniversary Volume: Celebrating 20 Years of Excellence
Community detection for proximity alignment

Integrated Computer-Aided Engineering

Quantified Score

Hi-index	0.00

Visualization

Abstract

As the number of RSS news feeds continue to increase over the Internet, it becomes necessary to minimize the workload of the user who is otherwise required to scan through huge number of news articles to find related articles of interest, which is a tedious and often an impossible task. In order to solve this problem, we present a novel approach, called InFRSS, which consists of a correlation-based phrase matching (CPM) model and a fuzzy compatibility clustering (FCC) model. CPM can detect RSS news articles containing phrases that are the same as well as semantically alike, and dictate the degrees of similarity of any two articles. FCC identifies and clusters non-redundant, closely related RSS news articles based on their degrees of similarity and a fuzzy compatibility relation. Experimental results show that (i) our CPM model on matching bigrams and trigrams in RSS news articles outperforms other phrase/keyword-matching approaches and (ii) our FCC model generates high quality clusters and outperforms other well-known clustering techniques.