Link-based similarity measures for the classification of Web documents

Authors:
Pável Calado;Marco Cristo;Marcos André Gonçalves;Edleno S. de Moura;Berthier Ribeiro-Neto;Nivio Ziviani
Affiliations:
Department of Computer Science, Federal University of Minas Gerais, Belo Horizonte, MG, Brazil;Department of Computer Science, Federal University of Minas Gerais, Belo Horizonte, MG, Brazil;Department of Computer Science, Virginia Tech, Blacksburg, VA;Department of Computer Science, Federal University of Amazonas, Manaus, AM, Brazil;Department of Computer Science, Federal University of Minas Gerais, Belo Horizonte, MG, Brazil and Akwan Information Technologies, Belo Horizonte, MG, Brazil;Department of Computer Science, Federal University of Minas Gerais, Belo Horizonte, MG, Brazil
Venue:
Journal of the American Society for Information Science and Technology
Year:
2006

Citing 0
Cited 21

A comparative study of citations and links in document classification

Proceedings of the 6th ACM/IEEE-CS joint conference on Digital libraries
FLUX-CIM: flexible unsupervised extraction of citation metadata

Proceedings of the 7th ACM/IEEE-CS joint conference on Digital libraries
Automatic patent classification using citation network information: an experimental study in nanotechnology

Proceedings of the 7th ACM/IEEE-CS joint conference on Digital libraries
User-assisted similarity estimation for searching related web pages

Proceedings of the eighteenth conference on Hypertext and hypermedia
Refining Pairwise Similarity Matrix for Cluster Ensemble Problem with Cluster Relations

DS '08 Proceedings of the 11th International Conference on Discovery Science
Intelligent hybrid approach to false identity detection

Proceedings of the 12th International Conference on Artificial Intelligence and Law
Hybrid clustering for validation and improvement of subject-classification schemes

Information Processing and Management: an International Journal
Managing Knowledge in Light of Its Evolution Process: An Empirical Study on Citation Network-Based Patent Classification

Journal of Management Information Systems
Fuzzy Sets and Rough Sets for Scenario Modelling and Analysis

RSFDGrC '09 Proceedings of the 12th International Conference on Rough Sets, Fuzzy Sets, Data Mining and Granular Computing
Semi-supervised OWA aggregation for link-based similarity evaluation and alias detection

FUZZ-IEEE'09 Proceedings of the 18th international conference on Fuzzy Systems
Revisit of nearest neighbor test for direct evaluation of inter-document similarities

ECIR'08 Proceedings of the IR research, 30th European conference on Advances in information retrieval
Improving annotation categorization performance through integrated social annotation computation

Expert Systems with Applications: An International Journal
Classifying documents with link-based bibliometric measures

Information Retrieval
Disclosing false identity through hybrid link analysis

Artificial Intelligence and Law
Use of Medical Subject Headings (MeSH) in Portuguese for categorizing web-based healthcare content

Journal of Biomedical Informatics
Using internal link and social network analysis to support searches in Wikipedia: A model and its evaluation

Journal of Information Science
Detecting fake websites: the contribution of statistical learning theory

MIS Quarterly
Optimal and hierarchical clustering of large-scale hybrid networks for scientific mapping

Scientometrics
Combination of document structure and links for multimedia object retrieval

Journal of Information Science
QUBiC: An adaptive approach to query-based recommendation

Journal of Intelligent Information Systems
Pairwise similarity for cluster ensemble problem: link-based and approximate approaches

Transactions on Large-Scale Data- and Knowledge-centered systems IX

Quantified Score

Hi-index	0.00

Visualization

Abstract

Traditional text-based document classifiers tend to perform poorly on the Web. Text in Web documents is usually noisy and often does not contain enough information to determine their topic. However, the Web provides a different source that can be useful to document classification: its hyperlink structure. In this work, the authors evaluate how the link structure of the Web can be used to determine a measure of similarity appropriate for document classification. They experiment with five different similarity measures and determine their adequacy for predicting the topic of a Web page. Tests performed on a Web directory show that link information alone allows classifying documents with an average precision of 86%. Further, when combined with a traditional text-based classifier, precision increases to values of up to 90%, representing gains that range from 63 to 132% over the use of text-based classification alone. Because the measures proposed in this article are straightforward to compute, they provide a practical and effective solution for Web classification and related information retrieval tasks. Further, the authors provide an important set of guidelines on how link structure can be used effectively to classify Web documents. © 2006 Wiley Periodicals, Inc.