Fast categorization of web documents represented by graphs

Authors:
A. Markov;M. Last;A. Kandel
Affiliations:
Department of Information Systems Engineering, Ben-Gurion University of the Negev, Beer-Sheva, Israel;Department of Information Systems Engineering, Ben-Gurion University of the Negev, Beer-Sheva, Israel;Department of Computer Science and Engineering, University of South Florida, Tampa, FL
Venue:
WebKDD'06 Proceedings of the 8th Knowledge discovery on the web international conference on Advances in web mining and web usage analysis
Year:
2006

Citing 18
Cited 4

An evaluation of phrasal and clustered representations on a text categorization task

SIGIR '92 Proceedings of the 15th annual international ACM SIGIR conference on Research and development in information retrieval
C4.5: programs for machine learning

C4.5: programs for machine learning
Automated learning of decision rules for text categorization

ACM Transactions on Information Systems (TOIS)
On a relation between graph edit distance and maximum common subgraph

Pattern Recognition Letters
Learning to extract symbolic knowledge from the World Wide Web

AAAI '98/IAAI '98 Proceedings of the fifteenth national/tenth conference on Artificial intelligence/Innovative applications of artificial intelligence
Error Correcting Graph Matching: On the Influence of the Underlying Cost Function

IEEE Transactions on Pattern Analysis and Machine Intelligence
Document Categorization and Query Generation on the World Wide WebUsing WebACE

Artificial Intelligence Review - Special issue on data mining on the Internet
Theme-based retrieval of Web news (poster session)

SIGIR '00 Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval
A vector space model for automatic indexing

Communications of the ACM
Introduction to Modern Information Retrieval

Introduction to Modern Information Retrieval
Maximizing Text-Mining Performance

IEEE Intelligent Systems
Hierarchically Classifying Documents Using Very Few Words

ICML '97 Proceedings of the Fourteenth International Conference on Machine Learning
Term Weighting Approaches in Automatic Text Retrieval

Term Weighting Approaches in Automatic Text Retrieval
Evaluating adaptive user profiles for news classification

Proceedings of the 9th international conference on Intelligent user interfaces
An Efficient Algorithm for Discovering Frequent Subgraphs

IEEE Transactions on Knowledge and Data Engineering
Graph-Theoretic Techniques for Web Content Mining

Graph-Theoretic Techniques for Web Content Mining
Estimating continuous distributions in Bayesian classifiers

UAI'95 Proceedings of the Eleventh conference on Uncertainty in artificial intelligence
A simple, structure-sensitive approach for web document classification

AWIC'05 Proceedings of the Third international conference on Advances in Web Intelligence

Graph-based keyword extraction for single-document summarization

MMIES '08 Proceedings of the Workshop on Multi-source Multilingual Information Extraction and Summarization
Extraction of unexpected sentences: A sentiment classification assessed approach

Intelligent Data Analysis
Text classification using graph mining-based feature extraction

Knowledge-Based Systems
Discovering and analyzing multi-granular web search results

FQAS'11 Proceedings of the 9th international conference on Flexible Query Answering Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

Most text categorization methods are based on the vector-space model of information retrieval. One of the important advantages of this representation model is that it can be used by both instance-based and model-based classifiers for categorization. However, this popular method of document representation does not capture important structural information, such as the order and proximity of word occurrence or the location of a word within the document. It also makes no use of the mark-up information that is available from web document HTML tags. A recently developed graph-based representation of web documents can preserve the structural information. The new document model was shown to outperform the traditional vector representation, using the k-Nearest Neighbor (k-NN) classification algorithm. The problem, however, is that the eager (model-based) classifiers cannot work with this representation directly. In this chapter, three new, hybrid approaches to web document categorization are presented, built upon both graph and vector space representations, thus preserving the benefits and overcoming the limitations of each. The hybrid methods presented here are compared to vector-based models using two model-based classifiers (C4.5 decision-tree algorithm and probabilistic Naïve Bayes) and several benchmark web document collections. The results demonstrate that the hybrid methods outperform, in most cases, existing approaches in terms of classification accuracy, and in addition, achieve a significant increase in the categorization speed.