A novel efficient classification algorithm for search engines

Authors:
Hanan Ahmed Hosni;Mahmoud Abd Alla
Affiliations:
Information Technology Department, College of Computer and Information Sciences, King Saud University;Information Technology Department, College of Computer and Information Sciences, King Saud University
Venue:
AIC'08 Proceedings of the 8th conference on Applied informatics and communications
Year:
2008

Citing 20
Cited 0

Fuzzy information retrieval based on a fuzzy pseudothesaurus

IEEE Transactions on Systems, Man and Cybernetics
A fuzzy document retrieval system using the keyword connection matrix and a learning method

Fuzzy Sets and Systems - Special issue on applications of fuzzy systems theory, Iizuka '88
Clustering algorithms

Information retrieval
Enhanced hypertext categorization using hyperlinks

SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
Syntactic clustering of the Web

Selected papers from the sixth international conference on World Wide Web
Bringing order to the Web: automatically categorizing search results

Proceedings of the SIGCHI conference on Human Factors in Computing Systems
A vector space model for automatic indexing

Communications of the ACM
Function-based object model towards website adaptation

Proceedings of the 10th international conference on World Wide Web
Seeing the whole in parts: text summarization for web browsing on handheld devices

Proceedings of the 10th international conference on World Wide Web
Summarization as feature selection for text categorization

Proceedings of the tenth international conference on Information and knowledge management
Transductive Inference for Text Classification using Support Vector Machines

ICML '99 Proceedings of the Sixteenth International Conference on Machine Learning
Web Mining: Information and Pattern Discovery on the World Wide Web

ICTAI '97 Proceedings of the 9th International Conference on Tools with Artificial Intelligence
Automatic text categorization using the importance of sentences

COLING '02 Proceedings of the 19th international conference on Computational linguistics - Volume 1
The SMART Retrieval System—Experiments in Automatic Document Processing

The SMART Retrieval System—Experiments in Automatic Document Processing
A text categorization based on summarization technique

RANLPIR '00 Proceedings of the ACL-2000 workshop on Recent advances in natural language processing and information retrieval: held in conjunction with the 38th Annual Meeting of the Association for Computational Linguistics - Volume 11
Integrating image data into biomedical text categorization

Bioinformatics
Substring selection for biomedical document classification

Bioinformatics
Exploring a new space of features for document classification: figure clustering

CASCON '06 Proceedings of the 2006 conference of the Center for Advanced Studies on Collaborative research
Mining longest repeating subsequences to predict world wide web surfing

USITS'99 Proceedings of the 2nd conference on USENIX Symposium on Internet Technologies and Systems - Volume 2
A survey of document image classification: problem statement, classifier architecture and performance evaluation

International Journal on Document Analysis and Recognition

Quantified Score

Hi-index	0.00

Visualization

Abstract

In this paper a new classification algorithm of Web documents into a set of categories, is proposed. The proposed technique is based on analyzing relationships between different documents and the terms they contain by producing a set of rules relating the category of the document, its terms and their frequencies. Each document is represented by a graph that correlates its most frequent combined words and its category. The relationships among these graphs and the documents' categories are captured. The proposed technique has three phases. The first phase is a training phase where human experts determines the categories of different web pages and articles and the supervised classification algorithm will combine these categories with appropriate weighted index terms according to the highest supported rules among the most frequent words. The second phase is the blind categorization phase where a web crawler will crawl through the World Wide Web to build a database that will be categorized according to the result of the first phase. This data base contains URLs and their categories. The third phase is applying the proposed graph representation technique on the whole set of documents per category to determine its final graph representation. The third phase will produce better classification rules because the sample size is larger with no additional cost of supervised categorization. Experiments using data sets collected from different Web portals are conducted.