Improving Vietnamese web page clustering by combining neighbors' content and using iterative feature selection

Authors:
Le Viet Hung;Nguyen Thi Kim Anh;Nguyen Hai Dang
Affiliations:
Hanoi University of Science and Technology;Hanoi University of Science and Technology;Hanoi University of Science and Technology
Venue:
Proceedings of the Third Symposium on Information and Communication Technology
Year:
2012

Citing 12
Cited 0

Scatter/Gather: a cluster-based approach to browsing large document collections

SIGIR '92 Proceedings of the 15th annual international ACM SIGIR conference on Research and development in information retrieval
Enhanced hypertext categorization using hyperlinks

SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
A practical hypertext catergorization method using links and incrementally available class information

SIGIR '00 Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval
Evaluating contents-link coupled web page clustering for web search results

Proceedings of the eleventh international conference on Information and knowledge management
Hypertext Categorization using Hyperlink Patterns and Meta Data

ICML '01 Proceedings of the Eighteenth International Conference on Machine Learning
Graph-based text classification: learn from your neighbors

SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
A neighborhood-based approach for clustering of linked document collections

CIKM '06 Proceedings of the 15th ACM international conference on Information and knowledge management
A novel feature selection algorithm for text categorization

Expert Systems with Applications: An International Journal
An Iterative Hybrid Filter-Wrapper Approach to Feature Selection for Document Clustering

Canadian AI '09 Proceedings of the 22nd Canadian Conference on Artificial Intelligence: Advances in Artificial Intelligence
Web Search Clustering and Labeling with Hidden Topics

ACM Transactions on Asian Language Information Processing (TALIP)
Boilerplate detection using shallow text features

Proceedings of the third ACM international conference on Web search and data mining
K-means based approaches to clustering nodes in annotated graphs

ISMIS'11 Proceedings of the 19th international conference on Foundations of intelligent systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

Web page clustering is a fundamental technique to offer a solution for data management, information locating and its interpretation of Web data and to facilitate users for navigation, discrimination and understanding. Most existing clustering algorithms can't adapt well to Web page clustering directly in terms of efficiency and effectiveness due to the problems of high dimensionality and data sparseness. Furthermore, the uncontrolled nature of web content presents additional challenges to web page clustering, whereas the interconnected characteristic of hypertext can provide useful information for the process. To address this problem, we propose a new Web page clustering method with combining neighbors' content to overcome data sparseness and using Iterative Feature Selection to remove noisy and redundant features and to improve the performance of clustering algorithm. Experimental results show that the proposed method significantly improves the performance of the Vietnamese web page clustering with a relatively small number of good descriptive features for web pages.