C4.5: programs for machine learning
C4.5: programs for machine learning
Enhanced hypertext categorization using hyperlinks
SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
Mining features for sequence classification
KDD '99 Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining
A re-examination of text categorization methods
Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval
Data mining: practical machine learning tools and techniques with Java implementations
Data mining: practical machine learning tools and techniques with Java implementations
Learning to construct knowledge bases from the World Wide Web
Artificial Intelligence - Special issue on Intelligent internet systems
SPADE: An Efficient Algorithm for Mining Frequent Sequences
Machine Learning
Text Categorization with Suport Vector Machines: Learning with Many Relevant Features
ECML '98 Proceedings of the 10th European Conference on Machine Learning
Evaluation of Techniques for Classifying Biological Sequences
PAKDD '02 Proceedings of the 6th Pacific-Asia Conference on Advances in Knowledge Discovery and Data Mining
Web unit mining: finding and classifying subgraphs of web pages
CIKM '03 Proceedings of the twelfth international conference on Information and knowledge management
Automatic categorization of web sites based on source types
Proceedings of the fifteenth ACM conference on Hypertext and hypermedia
Mining web site's topic hierarchy
WWW '05 Special interest tracks and posters of the 14th international conference on World Wide Web
Hierarchical topic segmentation of websites
Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining
Coarse-grained classification of web sites by their structural properties
WIDM '06 Proceedings of the 8th annual ACM international workshop on Web information and data management
Two-phase Web site classification based on Hidden Markov Tree models
Web Intelligence and Agent Systems
Proceedings of the 16th international conference on World Wide Web
Accurate and efficient crawling for relevant websites
VLDB '04 Proceedings of the Thirtieth international conference on Very large data bases - Volume 30
Finding and classifying web units in websites
International Journal of Business Intelligence and Data Mining
Identifying a hierarchy of bipartite subgraphs for web site abstraction
Web Intelligence and Agent Systems
Web page classification: Features and algorithms
ACM Computing Surveys (CSUR)
Web site topic-hierarchy generation based on link structure
Journal of the American Society for Information Science and Technology
Tree-Based Method for Classifying Websites Using Extended Hidden Markov Models
PAKDD '09 Proceedings of the 13th Pacific-Asia Conference on Advances in Knowledge Discovery and Data Mining
A comparison of fraud cues and classification methods for fake escrow website detection
Information Technology and Management
A solution to the exact match on rare item searches: introducing the lost sheep algorithm
Proceedings of the International Conference on Web Intelligence, Mining and Semantics
Query-Sets++: a scalable approach for modeling web sites
SPIRE'11 Proceedings of the 18th international conference on String processing and information retrieval
Topic-based website feature analysis for enterprise search from the web
WISE'06 Proceedings of the 7th international conference on Web Information Systems
On discovering concept entities from web sites
ICCSA'05 Proceedings of the 2005 international conference on Computational Science and Its Applications - Volume Part II
Domain-specific website recognition using hybrid vector space model
WAIM'05 Proceedings of the 6th international conference on Advances in Web-Age Information Management
Detecting Fake Medical Web Sites Using Recursive Trust Labeling
ACM Transactions on Information Systems (TOIS)
Classifying websites into non-topical categories
DaWaK'12 Proceedings of the 14th international conference on Data Warehousing and Knowledge Discovery
Techniques for data-driven curriculum analysis
Proceedings of the Fourth International Conference on Learning Analytics And Knowledge
Identifying website communities in mobile internet based on affinity measurement
Computer Communications
Hi-index | 0.00 |
When automatically extracting information from the world wide web, most established methods focus on spotting single HTML-documents. However, the problem of spotting complete web sites is not handled adequately yet, in spite of its importance for various applications. Therefore, this paper discusses the classification of complete web sites. First, we point out the main differences to page classification by discussing a very intuitive approach and its weaknesses. This approach treats a web site as one large HTML-document and applies the well-known methods for page classification. Next, we show how accuracy can be improved by employing a preprocessing step which assigns an occurring web page to its most likely topic. The determined topics now represent the information the web site contains and can be used to classify it more accurately. We accomplish this by following two directions. First, we apply well established classification algorithms to a feature space of occurring topics. The second direction treats a site as a tree of occurring topics and uses a Markov tree model for further classification. To improve the efficiency of this approach, we additionally introduce a powerful pruning method reducing the number of considered web pages. Our experiments show the superiority of the Markov tree approach regarding classification accuracy. In particular, we demonstrate that the use of our pruning method not only reduces the processing time, but also improves the classification accuracy.