Genetic Mining of HTML Structures for Effective Web-Document Retrieval

Authors:
Sun Kim;Byoung-Tak Zhang
Affiliations:
Biointelligence Laboratory, School of Computer Science and Engineering, Seoul National University, Seoul 151-742, Korea. skim@bi.snu.ac.kr;Biointelligence Laboratory, School of Computer Science and Engineering, Seoul National University, Seoul 151-742, Korea. btzhang@bi.snu.ac.kr
Venue:
Applied Intelligence
Year:
2003

Citing 0
Cited 7

A Novel Partitioning-Based Clustering Method and Generic Document Summarization

WI-IATW '06 Proceedings of the 2006 IEEE/WIC/ACM international conference on Web Intelligence and Intelligent Agent Technology
Selective dissemination of XML documents based on genetically learned user model and Support Vector Machines

Intelligent Data Analysis
A Web page classification system based on a genetic algorithm using tagged-terms as features

Expert Systems with Applications: An International Journal
GaXsearch: an XML information retrieval mechanism using genetic algorithms

AI'05 Proceedings of the 18th Australian Joint conference on Advances in Artificial Intelligence
Advanced information retrieval from web pages

FDIA'07 Proceedings of the 1st BCS IRSG conference on Future Directions in Information Access
Learning to adapt cross language information extraction wrapper

Applied Intelligence
GA on IR: Study the Effectiveness of the Developed Fitness Function on IR

International Journal of Artificial Life Research

Quantified Score

Hi-index	0.00

Visualization

Abstract

Web-documents have a number of tags indicating the structure of texts. Text segments marked by HTML tags have specific meaning which can be utilized to improve the performance of document retrieval systems. In this paper, we present a machine learning approach to mine the structure of HTML documents for effective Web-document retrieval. A genetic algorithm is described that learns the importance factors of HTML tags which are used to re-rank the documents retrieved by standard weighting schemes. The proposed method has been evaluated on artificial text sets and a large-scale TREC document collection. Experimental evidence supports that the tag weights are well trained by the proposed algorithm in accordance with the importance factors for retrieval, and indicates that the proposed approach significantly improves the performance in retrieval accuracy. In particular, the use of the document-structure mining approach tends to move relevant documents to upper ranks, which is especially important in interactive Web-information retrieval environments.