Graph-theoretic techniques for web content mining

Authors:
Adam Schenker;Abraham Kandel
Affiliations:
University of South Florida;University of South Florida
Venue:
Graph-theoretic techniques for web content mining
Year:
2003

Citing 0
Cited 7

Classification of Web Documents Using a Graph-Based Model and Structural Patterns

PKDD 2007 Proceedings of the 11th European conference on Principles and Practice of Knowledge Discovery in Databases
Graph Classification Based on Dissimilarity Space Embedding

SSPR & SPR '08 Proceedings of the 2008 Joint IAPR International Workshop on Structural, Syntactic, and Statistical Pattern Recognition
Text classification using graph mining-based feature extraction

Knowledge-Based Systems
A scalable eigensolver for large scale-free graphs using 2D graph partitioning

Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
A simple, structure-sensitive approach for web document classification

AWIC'05 Proceedings of the Third international conference on Advances in Web Intelligence
Discovering and analyzing multi-granular web search results

FQAS'11 Proceedings of the 9th international conference on Flexible Query Answering Systems
CoBAn: A context based model for data leakage prevention

Information Sciences: an International Journal

Quantified Score

Hi-index	0.00

Visualization

Abstract

In this dissertation we introduce several novel techniques for performing data mining on web documents which utilize graph representations of document content. Graphs are more robust than typical vector representations as they can model structural information that is usually lost when converting the original web document content to a vector representation. For example, we can capture information such as the location, order and proximity of term occurrence, which is discarded under the standard document vector representation models. Many machine learning methods rely on distance computations, centroid calculations, and other numerical techniques. Thus many of these methods have not been applied to data represented by graphs since no suitable graph-theoretical concepts were previously available. We introduce the novel Graph Hierarchy Construction Algorithm (GHCA), which performs topic-oriented hierarchical clustering of web search results modeled using graphs. The system we created around this new algorithm and its prior version is compared with similar web search clustering systems to gauge its usefulness. An important advantage of this approach over conventional web search systems is that the results are better organized and more easily browsed by users. Next we present extensions to classical machine learning algorithms, such as the k-means clustering algorithm and the k-Nearest Neighbors classification algorithm, which allows the use of graphs as fundamental data items instead of vectors. We perform experiments comparing the performance of the new graph-based methods to the traditional vector-based methods for three web document collections. Our experimental results show an improvement for the graph approaches over the vector approaches for both clustering and classification of web documents. An important advantage of the graph representations we propose is that they allow the computation of graph similarity in polynomial time; usually the determination of graph similarity with the techniques we use is an NP-Complete problem. In fact, there are some cases where the execution time of the graph-oriented approach was faster than the vector approaches.