Mining key information of web pages: A method and its application

Authors:
Chao Wang;Jie Lu;Guangquan Zhang
Affiliations:
Faculty of Information Technology, University of Technology, Sydney (UTS), P.O. Box 123, Broadway, NSW 2007, Australia;Faculty of Information Technology, University of Technology, Sydney (UTS), P.O. Box 123, Broadway, NSW 2007, Australia;Faculty of Information Technology, University of Technology, Sydney (UTS), P.O. Box 123, Broadway, NSW 2007, Australia
Venue:
Expert Systems with Applications: An International Journal
Year:
2007

Citing 16
Cited 5

Elements of information theory

Elements of information theory
Wrapper induction: efficiency and expressiveness

Artificial Intelligence - Special issue on Intelligent internet systems
On integrating catalogs

Proceedings of the 10th international conference on World Wide Web
Ontology Learning for the Semantic Web

Ontology Learning for the Semantic Web
Innovating web page classification through reducing noise

Journal of Computer Science and Technology
Ontology Learning for the Semantic Web

IEEE Intelligent Systems
Text Categorization with Suport Vector Machines: Learning with Many Relevant Features

ECML '98 Proceedings of the 10th European Conference on Machine Learning
Object Exchange Across Heterogeneous Information Sources

ICDE '95 Proceedings of the Eleventh International Conference on Data Engineering
A Comparative Study on Feature Selection in Text Categorization

ICML '97 Proceedings of the Fourteenth International Conference on Machine Learning
Boosted Wrapper Induction

Proceedings of the Seventeenth National Conference on Artificial Intelligence and Twelfth Conference on Innovative Applications of Artificial Intelligence
The evolution of Protégé: an environment for knowledge-based systems development

International Journal of Human-Computer Studies
Discovering informative content blocks from Web documents

Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
Text categorization with many redundant features: using aggressive feature selection to make SVMs competitive with C4.5

ICML '04 Proceedings of the twenty-first international conference on Machine learning
OntoMiner: Bootstrapping and Populating Ontologies from Domain-Specific Web Sites

IEEE Intelligent Systems
Web page cleaning for web mining through feature weighting

IJCAI'03 Proceedings of the 18th international joint conference on Artificial intelligence
Information extraction from web documents based on local unranked tree automaton inference

IJCAI'03 Proceedings of the 18th international joint conference on Artificial intelligence

An ontological website models-supported search agent for web services

Expert Systems with Applications: An International Journal
Developing of an ontological interface agent with template-based linguistic processing technique for FAQ services

Expert Systems with Applications: An International Journal
Semantic enrichment of places: Ontology learning from web

International Journal of Knowledge-based and Intelligent Engineering Systems - Intelligent agents and services for smart environments
Similarity measure models and algorithms for hierarchical cases

Expert Systems with Applications: An International Journal
Mining taxonomies from web menus: rule-based concepts and algorithms

ICWE'13 Proceedings of the 13th international conference on Web Engineering

Quantified Score

Hi-index	12.06

Visualization

Abstract

Web content mining aims to discover useful information and generate desired knowledge from a large amount of web pages. Key information, such as distinctive menu items, navigation indicators, which is embedded in web pages, can help classify the main contents of web pages and reflect certain taxonomy knowledge. Therefore, mining key information is significant in helping acquire domain knowledge and build catalogue classifiers. Current web content mining methods cannot mine such key information effectively. ''Noise information'' (such as advertisements) is a problem for the performance of web mining tasks. This paper proposes a method to extract key information out of web pages which contain noisy information. The method contains two steps: to extract a list of candidate key information, and then apply entropy measure to filter noisy information and discover key information. Experiment results show that this method is effective in discovering key information. With the discovered key information that reflects taxonomy knowledge, an application is developed to help ontology generation.