Probabilistic and genetic algorithms in document retrieval
Communications of the ACM
Information retrieval in the World-Wide Web: making client-based searching feasible
Selected papers of the first conference on World-Wide Web
Genetic algorithms + data structures = evolution programs (3rd ed.)
Genetic algorithms + data structures = evolution programs (3rd ed.)
Learning to extract symbolic knowledge from the World Wide Web
AAAI '98/IAAI '98 Proceedings of the fifteenth national/tenth conference on Artificial intelligence/Innovative applications of artificial intelligence
Focused crawling: a new approach to topic-specific Web resource discovery
WWW '99 Proceedings of the eighth international conference on World Wide Web
A vector space model for automatic indexing
Communications of the ACM
Modern Information Retrieval
Genetic Approach to Query Space Exploration
Information Retrieval
Topic-Centric Querying of Web Information Resources
DEXA '01 Proceedings of the 12th International Conference on Database and Expert Systems Applications
Genetic Mining of HTML Structures for Effective Web-Document Retrieval
Applied Intelligence
Web page feature selection and classification using neural networks
Information Sciences—Informatics and Computer Science: An International Journal - Special issue: Informatics and computer science intelligent systems applications
Topical web crawlers: Evaluating adaptive algorithms
ACM Transactions on Internet Technology (TOIT)
Choosing document structure weights
Information Processing and Management: an International Journal
Data Mining: Concepts and Techniques
Data Mining: Concepts and Techniques
GANNET: a machine learning approach to document retrieval
Journal of Management Information Systems - Special section: Information technology and IT organizational impact
A Genetic Algorithm for Text Classification Rule Induction
ECML PKDD '08 Proceedings of the European conference on Machine Learning and Knowledge Discovery in Databases - Part II
Two novel feature selection approaches for web page classification
Expert Systems with Applications: An International Journal
Web page classification: Features and algorithms
ACM Computing Surveys (CSUR)
Web page classification: a soft computing approach
AWIC'03 Proceedings of the 1st international Atlantic web intelligence conference on Advances in web intelligence
Combination of rough sets and genetic algorithms for text classification
AIS-ADM'07 Proceedings of the 2nd international conference on Autonomous intelligent systems: agents and data mining
A genetic algorithm for scheduling of jobs on lines of press machines
LSSC'05 Proceedings of the 5th international conference on Large-Scale Scientific Computing
A feature-free search query classification approach using semantic distance
Expert Systems with Applications: An International Journal
A novel probabilistic feature selection method for text classification
Knowledge-Based Systems
The impact of preprocessing on text classification
Information Processing and Management: an International Journal
Hi-index | 12.05 |
The incredible increase in the amount of information on the World Wide Web has caused the birth of topic specific crawling of the Web. During a focused crawling process, an automatic Web page classification mechanism is needed to determine whether the page being considered is on the topic or not. In this study, a genetic algorithm (GA) based automatic Web page classification system which uses both HTML tags and terms belong to each tag as classification features and learns optimal classifier from the positive and negative Web pages in the training dataset is developed. Our system classifies Web pages by simply computing similarity between the learned classifier and the new Web pages. In the existing GA-based classifiers, only HTML tags or terms are used as features, however in this study both of them are taken together and optimal weights for the features are learned by our GA. It was found that, using both HTML tags and terms in each tag as separate features improves accuracy of classification, and the number of documents in the training dataset affects the accuracy such that if the number of negative documents is larger than the number of positive documents in the training dataset, the classification accuracy of our system increases up to 95% and becomes higher than the well known Naive Bayes and k nearest neighbor classifiers.