Web Snippet Clustering Based on Text Enrichment with Concept Hierarchy

Authors:
Supakpong Jinarat;Choochart Haruechaiyasak;Arnon Rungsawang
Affiliations:
Massive Information & Knowledge Engineering, Department of Computer Engineering, Faculty of Engineering, Kasetsart University, Bangkok, Thailand 10900;Human Language Technology Laboratory (HLT), National Electronics and Computer Technology Center (NECTEC), Pathumthani, Thailand 12120;Massive Information & Knowledge Engineering, Department of Computer Engineering, Faculty of Engineering, Kasetsart University, Bangkok, Thailand 10900
Venue:
ICONIP '09 Proceedings of the 16th International Conference on Neural Information Processing: Part II
Year:
2009

Citing 6
Cited 0

WordNet: a lexical database for English

Communications of the ACM
Web document clustering: a feasibility demonstration

Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval
Ontologies Improve Text Document Clustering

ICDM '03 Proceedings of the Third IEEE International Conference on Data Mining
Improving Web Clustering by Cluster Selection

WI '05 Proceedings of the 2005 IEEE/WIC/ACM International Conference on Web Intelligence
A Method of Web Search Result Clustering Based on Rough Sets

WI '05 Proceedings of the 2005 IEEE/WIC/ACM International Conference on Web Intelligence
Improving quality of search results clustering with approximate matrix factorisations

ECIR'06 Proceedings of the 28th European conference on Advances in Information Retrieval

Quantified Score

Hi-index	0.00

Visualization

Abstract

Clustering web snippet results returned from search engine helps facilitate browsing and navigating for users. Due to the extremely short length of web snippets, many traditional clustering techniques which adopt the bag of words model often yields unsatisfactory clustering results. In this paper, we propose a method of text enrichment for improving performance of web snippet clustering. The main idea is to expand the original snippets with some related conceptual terms. We apply the Open Directory Project (ODP), a web taxonomy organized by humans, to provide the concept hierarchy of the web contents. Using a test data set of 240 queries, we performed the experiments by using two clustering techniques: K-means clustering as the non-overlapping approach and the Suffix Tree Clustering (STC) as the overlapping approach. Using the proposed text enrichment method, the K-means clustering yielded the overall performance improvement up to 15.51% based on the F1 measure. On the other hand, the Suffix Tree Clustering with text enrichment helped improve the performance up to 53.71%.