A Web page classification system based on a genetic algorithm using tagged-terms as features

Authors:
Selma Ayşe Özel
Affiliations:
Department of Computer Engineering, Cukurova University, 01330 Balcali, Sarıçam, Adana, Turkey
Venue:
Expert Systems with Applications: An International Journal
Year:
2011

Citing 21
Cited 3

Probabilistic and genetic algorithms in document retrieval

Communications of the ACM
Information retrieval in the World-Wide Web: making client-based searching feasible

Selected papers of the first conference on World-Wide Web
Genetic algorithms + data structures = evolution programs (3rd ed.)

Genetic algorithms + data structures = evolution programs (3rd ed.)
Learning to extract symbolic knowledge from the World Wide Web

AAAI '98/IAAI '98 Proceedings of the fifteenth national/tenth conference on Artificial intelligence/Innovative applications of artificial intelligence
Focused crawling: a new approach to topic-specific Web resource discovery

WWW '99 Proceedings of the eighth international conference on World Wide Web
A vector space model for automatic indexing

Communications of the ACM
Modern Information Retrieval

Modern Information Retrieval
Genetic Approach to Query Space Exploration

Information Retrieval
Topic-Centric Querying of Web Information Resources

DEXA '01 Proceedings of the 12th International Conference on Database and Expert Systems Applications
Genetic Mining of HTML Structures for Effective Web-Document Retrieval

Applied Intelligence
Web page feature selection and classification using neural networks

Information Sciences—Informatics and Computer Science: An International Journal - Special issue: Informatics and computer science intelligent systems applications
Topical web crawlers: Evaluating adaptive algorithms

ACM Transactions on Internet Technology (TOIT)
Choosing document structure weights

Information Processing and Management: an International Journal
Data Mining: Concepts and Techniques

Data Mining: Concepts and Techniques
GANNET: a machine learning approach to document retrieval

Journal of Management Information Systems - Special section: Information technology and IT organizational impact
A Genetic Algorithm for Text Classification Rule Induction

ECML PKDD '08 Proceedings of the European conference on Machine Learning and Knowledge Discovery in Databases - Part II
Two novel feature selection approaches for web page classification

Expert Systems with Applications: An International Journal
Web page classification: Features and algorithms

ACM Computing Surveys (CSUR)
Web page classification: a soft computing approach

AWIC'03 Proceedings of the 1st international Atlantic web intelligence conference on Advances in web intelligence
Combination of rough sets and genetic algorithms for text classification

AIS-ADM'07 Proceedings of the 2nd international conference on Autonomous intelligent systems: agents and data mining
A genetic algorithm for scheduling of jobs on lines of press machines

LSSC'05 Proceedings of the 5th international conference on Large-Scale Scientific Computing

A feature-free search query classification approach using semantic distance

Expert Systems with Applications: An International Journal
A novel probabilistic feature selection method for text classification

Knowledge-Based Systems
The impact of preprocessing on text classification

Information Processing and Management: an International Journal

Quantified Score

Hi-index	12.05

Visualization

Abstract

The incredible increase in the amount of information on the World Wide Web has caused the birth of topic specific crawling of the Web. During a focused crawling process, an automatic Web page classification mechanism is needed to determine whether the page being considered is on the topic or not. In this study, a genetic algorithm (GA) based automatic Web page classification system which uses both HTML tags and terms belong to each tag as classification features and learns optimal classifier from the positive and negative Web pages in the training dataset is developed. Our system classifies Web pages by simply computing similarity between the learned classifier and the new Web pages. In the existing GA-based classifiers, only HTML tags or terms are used as features, however in this study both of them are taken together and optimal weights for the features are learned by our GA. It was found that, using both HTML tags and terms in each tag as separate features improves accuracy of classification, and the number of documents in the training dataset affects the accuracy such that if the number of negative documents is larger than the number of positive documents in the training dataset, the classification accuracy of our system increases up to 95% and becomes higher than the well known Naive Bayes and k nearest neighbor classifiers.