On discovering concept entities from web sites

Authors:
Ming Yin;Dion Hoe-Lian Goh;Ee-Peng Lim
Affiliations:
Division of Information Studies, School of Communication and Information, Nanyang Technological University, Singapore;Division of Information Studies, School of Communication and Information, Nanyang Technological University, Singapore;Centre for Advanced Information Systems, School of Computer Engineering, Nanyang Technological University, Singapore
Venue:
ICCSA'05 Proceedings of the 2005 international conference on Computational Science and Its Applications - Volume Part II
Year:
2005

Citing 7
Cited 0

Enhanced hypertext categorization using hyperlinks

SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
Constructing, organizing, and visualizing collections of topically related Web resources

ACM Transactions on Computer-Human Interaction (TOCHI)
A practical hypertext catergorization method using links and incrementally available class information

SIGIR '00 Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval
Relational learning with statistical predicate invention: better models for hypertext

Machine Learning - Special issue on inducive logic programming
Web site mining: a new way to spot competitors, customers and suppliers in the world wide web

Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
Two-Phase Web Site Classification Based on Hidden Markov Tree Models

WI '03 Proceedings of the 2003 IEEE/WIC International Conference on Web Intelligence
Web unit mining: finding and classifying subgraphs of web pages

CIKM '03 Proceedings of the twelfth international conference on Information and knowledge management

Quantified Score

Hi-index	0.00

Visualization

Abstract

A web site usually contains a large number of concept entities, each consisting of one or more web pages connected by hyperlinks. In order to discover these concept entities for more expressive web site queries and other applications, the web unit mining problem has been proposed. Web unit mining aims to determine web pages that constitute a concept entity and classify concept entities into categories. Nevertheless, the performance of an existing web unit mining algorithm, iWUM, suffers as it may create more than one web unit (incomplete web units) from a single concept entity. This paper presents a new web unit mining algorithm, kWUM, which incorporates site-specific knowledge to discover and handle incomplete web units by merging them together and assigning correct labels. Experiments show that the overall accuracy has been significantly improved.