Web site mining: a new way to spot competitors, customers and suppliers in the world wide web

Authors:
Martin Ester;Hans-Peter Kriegel;Matthias Schubert
Affiliations:
Simon Fraser University, Burnaby, BC, Canada;University of Munich (LMU), Munich, Germany;University of Munich (LMU), Munich, Germany
Venue:
Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
Year:
2002

Citing 9
Cited 23

C4.5: programs for machine learning

C4.5: programs for machine learning
Enhanced hypertext categorization using hyperlinks

SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
Mining features for sequence classification

KDD '99 Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining
A re-examination of text categorization methods

Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval
Data mining: practical machine learning tools and techniques with Java implementations

Data mining: practical machine learning tools and techniques with Java implementations
Learning to construct knowledge bases from the World Wide Web

Artificial Intelligence - Special issue on Intelligent internet systems
SPADE: An Efficient Algorithm for Mining Frequent Sequences

Machine Learning
Text Categorization with Suport Vector Machines: Learning with Many Relevant Features

ECML '98 Proceedings of the 10th European Conference on Machine Learning
Evaluation of Techniques for Classifying Biological Sequences

PAKDD '02 Proceedings of the 6th Pacific-Asia Conference on Advances in Knowledge Discovery and Data Mining

Web unit mining: finding and classifying subgraphs of web pages

CIKM '03 Proceedings of the twelfth international conference on Information and knowledge management
Automatic categorization of web sites based on source types

Proceedings of the fifteenth ACM conference on Hypertext and hypermedia
Mining web site's topic hierarchy

WWW '05 Special interest tracks and posters of the 14th international conference on World Wide Web
Hierarchical topic segmentation of websites

Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining
Coarse-grained classification of web sites by their structural properties

WIDM '06 Proceedings of the 8th annual ACM international workshop on Web information and data management
Two-phase Web site classification based on Hidden Markov Tree models

Web Intelligence and Agent Systems
Classifying web sites

Proceedings of the 16th international conference on World Wide Web
Accurate and efficient crawling for relevant websites

VLDB '04 Proceedings of the Thirtieth international conference on Very large data bases - Volume 30
Finding and classifying web units in websites

International Journal of Business Intelligence and Data Mining
Identifying a hierarchy of bipartite subgraphs for web site abstraction

Web Intelligence and Agent Systems
Web page classification: Features and algorithms

ACM Computing Surveys (CSUR)
Web site topic-hierarchy generation based on link structure

Journal of the American Society for Information Science and Technology
Tree-Based Method for Classifying Websites Using Extended Hidden Markov Models

PAKDD '09 Proceedings of the 13th Pacific-Asia Conference on Advances in Knowledge Discovery and Data Mining
A comparison of fraud cues and classification methods for fake escrow website detection

Information Technology and Management
A solution to the exact match on rare item searches: introducing the lost sheep algorithm

Proceedings of the International Conference on Web Intelligence, Mining and Semantics
Query-Sets++: a scalable approach for modeling web sites

SPIRE'11 Proceedings of the 18th international conference on String processing and information retrieval
Topic-based website feature analysis for enterprise search from the web

WISE'06 Proceedings of the 7th international conference on Web Information Systems
On discovering concept entities from web sites

ICCSA'05 Proceedings of the 2005 international conference on Computational Science and Its Applications - Volume Part II
Domain-specific website recognition using hybrid vector space model

WAIM'05 Proceedings of the 6th international conference on Advances in Web-Age Information Management
Detecting Fake Medical Web Sites Using Recursive Trust Labeling

ACM Transactions on Information Systems (TOIS)
Classifying websites into non-topical categories

DaWaK'12 Proceedings of the 14th international conference on Data Warehousing and Knowledge Discovery
Techniques for data-driven curriculum analysis

Proceedings of the Fourth International Conference on Learning Analytics And Knowledge
Identifying website communities in mobile internet based on affinity measurement

Computer Communications

Quantified Score

Hi-index	0.00

Visualization

Abstract

When automatically extracting information from the world wide web, most established methods focus on spotting single HTML-documents. However, the problem of spotting complete web sites is not handled adequately yet, in spite of its importance for various applications. Therefore, this paper discusses the classification of complete web sites. First, we point out the main differences to page classification by discussing a very intuitive approach and its weaknesses. This approach treats a web site as one large HTML-document and applies the well-known methods for page classification. Next, we show how accuracy can be improved by employing a preprocessing step which assigns an occurring web page to its most likely topic. The determined topics now represent the information the web site contains and can be used to classify it more accurately. We accomplish this by following two directions. First, we apply well established classification algorithms to a feature space of occurring topics. The second direction treats a site as a tree of occurring topics and uses a Markov tree model for further classification. To improve the efficiency of this approach, we additionally introduce a powerful pruning method reducing the number of considered web pages. Our experiments show the superiority of the Markov tree approach regarding classification accuracy. In particular, we demonstrate that the use of our pruning method not only reduces the processing time, but also improves the classification accuracy.