Web Search Clustering and Labeling with Hidden Topics

Authors:
Cam-Tu Nguyen;Xuan-Hieu Phan;Susumu Horiguchi;Thu-Trang Nguyen;Quang-Thuy Ha
Affiliations:
Tohoku University;Tohoku University;Tohoku University;Vietnam National University;Vietnam National University
Venue:
ACM Transactions on Asian Language Information Processing (TALIP)
Year:
2009

Citing 24
Cited 8

Scatter/Gather: a cluster-based approach to browsing large document collections

SIGIR '92 Proceedings of the 15th annual international ACM SIGIR conference on Research and development in information retrieval
Real life information retrieval: a study of user queries on the Web

ACM SIGIR Forum
Foundations of statistical natural language processing

Foundations of statistical natural language processing
Grouper: a dynamic clustering interface to Web search results

WWW '99 Proceedings of the eighth international conference on World Wide Web
Bringing order to the Web: automatically categorizing search results

Proceedings of the SIGCHI conference on Human Factors in Computing Systems
Text categorization by boosting automatically extracted concepts

Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval
Latent dirichlet allocation

The Journal of Machine Learning Research
Entity-based cross-document coreferencing using the Vector Space Model

COLING '98 Proceedings of the 17th international conference on Computational linguistics - Volume 1
Learning to cluster web search results

Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval
A personalized search engine based on web-snippet hierarchical clustering

WWW '05 Special interest tracks and posters of the 14th international conference on World Wide Web
A web-based kernel function for measuring the similarity of short text snippets

Proceedings of the 15th international conference on World Wide Web
Dynamic topic models

ICML '06 Proceedings of the 23rd international conference on Machine learning
Automatically labeling hierarchical clusters

dg.o '06 Proceedings of the 2006 international conference on Digital government research
Measuring semantic similarity between words using web search engines

Proceedings of the 16th international conference on World Wide Web
Identifying Document Topics Using the Wikipedia Category Network

WI '06 Proceedings of the 2006 IEEE/WIC/ACM International Conference on Web Intelligence
Clustering short texts using wikipedia

SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
Automatic labeling of multinomial topic models

Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining
Learning to classify short and sparse text & web with hidden topics from large-scale data collections

Proceedings of the 17th international conference on World Wide Web
Enhancing text clustering by leveraging Wikipedia semantics

Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval
Topical N-Grams: Phrase and Topic Discovery, with an Application to Information Retrieval

ICDM '07 Proceedings of the 2007 Seventh IEEE International Conference on Data Mining
Improving similarity measures for short segments of text

AAAI'07 Proceedings of the 22nd national conference on Artificial intelligence - Volume 2
Computing semantic relatedness using Wikipedia-based explicit semantic analysis

IJCAI'07 Proceedings of the 20th international joint conference on Artifical intelligence
The design, implementation, and use of the Ngram statistics package

CICLing'03 Proceedings of the 4th international conference on Computational linguistics and intelligent text processing
Cluster generation and cluster labelling for web snippets: a fast and accurate hierarchical solution

SPIRE'06 Proceedings of the 13th international conference on String Processing and Information Retrieval

Inducing word senses to improve web search result clustering

EMNLP '10 Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing
A feature-word-topic model for image annotation

CIKM '10 Proceedings of the 19th ACM international conference on Information and knowledge management
Clustering web search results with maximum spanning trees

AI*IA'11 Proceedings of the 12th international conference on Artificial intelligence around man and beyond
Transferring topical knowledge from auxiliary long texts for short text clustering

Proceedings of the 20th ACM international conference on Information and knowledge management
A novel approach for clustering sentiments in Chinese blogs based on graph similarity

Computers & Mathematics with Applications
Improving Vietnamese web page clustering by combining neighbors' content and using iterative feature selection

Proceedings of the Third Symposium on Information and Communication Technology
Navigating the topical structure of academic search results via the Wikipedia category network

Proceedings of the 22nd ACM international conference on Conference on information & knowledge management
A feature-word-topic model for image annotation and retrieval

ACM Transactions on the Web (TWEB)

Quantified Score

Hi-index	0.00

Visualization

Abstract

Web search clustering is a solution to reorganize search results (also called “snippets”) in a more convenient way for browsing. There are three key requirements for such post-retrieval clustering systems: (1) the clustering algorithm should group similar documents together; (2) clusters should be labeled with descriptive phrases; and (3) the clustering system should provide high-quality clustering without downloading the whole Web page. This article introduces a novel framework for clustering Web search results in Vietnamese which targets the three above issues. The main motivation is that by enriching short snippets with hidden topics from huge resources of documents on the Internet, it is able to cluster and label such snippets effectively in a topic-oriented manner without concerning whole Web pages. Our approach is based on recent successful topic analysis models, such as Probabilistic-Latent Semantic Analysis, or Latent Dirichlet Allocation. The underlying idea of the framework is that we collect a very large external data collection called “universal dataset,” and then build a clustering system on both the original snippets and a rich set of hidden topics discovered from the universal data collection. This can be seen as a richer representation of snippets to be clustered. We carry out careful evaluation of our method and show that our method can yield impressive clustering quality.