Enhanced hypertext categorization using hyperlinks
SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
Constructing, organizing, and visualizing collections of topically related Web resources
ACM Transactions on Computer-Human Interaction (TOCHI)
The Hierarchical Hidden Markov Model: Analysis and Applications
Machine Learning
A comparison of techniques to find mirrored hosts on the WWW
Journal of the American Society for Information Science
Effective site finding using link anchor information
Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval
SODA '02 Proceedings of the thirteenth annual ACM-SIAM symposium on Discrete algorithms
Hierarchically Classifying Documents Using Very Few Words
ICML '97 Proceedings of the Fourteenth International Conference on Machine Learning
Web site mining: a new way to spot competitors, customers and suppliers in the world wide web
Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
Mining newsgroups using networks arising from social behavior
WWW '03 Proceedings of the 12th international conference on World Wide Web
The Eigentrust algorithm for reputation management in P2P networks
WWW '03 Proceedings of the 12th international conference on World Wide Web
Web Site Analysis: Structure and Evolution
ICSM '00 Proceedings of the International Conference on Software Maintenance (ICSM'00)
Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval
Classification of HTML Documents by Hidden Tree-Markov Models
ICDAR '01 Proceedings of the Sixth International Conference on Document Analysis and Recognition
Two-Phase Web Site Classification Based on Hidden Markov Tree Models
WI '03 Proceedings of the 2003 IEEE/WIC International Conference on Web Intelligence
Web unit mining: finding and classifying subgraphs of web pages
CIKM '03 Proceedings of the twelfth international conference on Information and knowledge management
Finding similar academic web sites with links, bibliometric couplings and colinks
Information Processing and Management: an International Journal
Proceedings of the 13th international World Wide Web conference on Alternate track papers & posters
Bayesian network model for semi-structured document classification
Information Processing and Management: an International Journal - Special issue: Bayesian networks and information retrieval
The volume and evolution of web page templates
WWW '05 Special interest tracks and posters of the 14th international conference on World Wide Web
Proceedings of the twenty-fourth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Efficient implementation of large-scale multi-structural databases
VLDB '05 Proceedings of the 31st international conference on Very large data bases
Do not crawl in the DUST: different URLs with similar text
Proceedings of the 15th international conference on World Wide Web
An O(pn2) algorithm for the p -median and related problems on tree graphs
Operations Research Letters
Page-level template detection via isotonic smoothing
Proceedings of the 16th international conference on World Wide Web
Web site topic-hierarchy generation based on link structure
Journal of the American Society for Information Science and Technology
Finding effectors in social networks
Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining
ICDM'10 Proceedings of the 10th industrial conference on Advances in data mining: applications and theoretical aspects
MenuMiner: revealing the information architecture of large web sites by analyzing maximal cliques
Proceedings of the 21st international conference companion on World Wide Web
Search result presentation: supporting post-search navigation by integration of taxonomy data
Proceedings of the 22nd international conference on World Wide Web companion
Mining taxonomies from web menus: rule-based concepts and algorithms
ICWE'13 Proceedings of the 13th international conference on Web Engineering
Hi-index | 0.00 |
In this paper, we consider the problem of identifying and segmenting topically cohesive regions in the URL tree of a large website. Each page of the website is assumed to have a topic label or a distribution on topic labels generated using a standard classifier. We develop a set of cost measures characterizing the benefit accrued by introducing a segmentation of the site based on the topic labels. We propose a general framework to use these measures for describing the quality of a segmentation; we also provide an efficient algorithm to find the best segmentation in this framework. Extensive experiments on human-labeled data confirm the soundness of our framework and suggest that a judicious choice of cost measures allows the algorithm to perform surprisingly accurate topical segmentations.