An integrated system of mining HTML texts and filtering structured documents

Authors:
Bo-Hyun Yun;Myung-Eun Lim;Soo-Hyun Park
Affiliations:
Dept. of Human Information Processing, Electronics and Telecommunications Research Institute, Daejon, Korea;Dept. of Human Information Processing, Electronics and Telecommunications Research Institute, Daejon, Korea;School of Business IT, Kookmin University, Seoul, Korea
Venue:
PAKDD'03 Proceedings of the 7th Pacific-Asia conference on Advances in knowledge discovery and data mining
Year:
2003

Citing 7
Cited 0

User models: theory, method, and practice

International Journal of Man-Machine Studies
Generating finite-state transducers for semi-structured data extraction from the Web

Information Systems - Special issue on semistructured data
Learning Information Extraction Rules for Semi-Structured and Free Text

Machine Learning - Special issue on natural language learning
Relational learning of pattern-match rules for information extraction

AAAI '99/IAAI '99 Proceedings of the sixteenth national conference on Artificial intelligence and the eleventh Innovative applications of artificial intelligence conference innovative applications of artificial intelligence
Content-based book recommending using learning for text categorization

DL '00 Proceedings of the fifth ACM conference on Digital libraries
Semi-Automatic Wrapper Generation for Internet Information Sources

COOPIS '97 Proceedings of the Second IFCIS International Conference on Cooperative Information Systems
Toward general-purpose learning for information extraction

COLING '98 Proceedings of the 17th international conference on Computational linguistics - Volume 1

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper presents a method of mining HTML documents into structured documents and of filtering structured documents by using both slot weighting and token weighting. The goal of a mining algorithm is to find slot-token patterns in HTML documents. In order to express user interests in structured document filtering, slot and token are considered. Our preference computation algorithm applies vector similarity and Bayesian probability to filter structured documents. The experimental results show that it is important to consider hyperlinking and unlablelling in mining HTML texts; slot and token weighting can enhance the performance of structured document filtering.