Hierarchical topic segmentation of websites

Authors:
Ravi Kumar;Kunal Punera;Andrew Tomkins
Affiliations:
Yahoo! Research, Sunnyvale, CA;University of Texas at Austin, Austin, TX;Yahoo! Research, Sunnyvale, CA
Venue:
Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining
Year:
2006

Citing 23
Cited 7

Enhanced hypertext categorization using hyperlinks

SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
Constructing, organizing, and visualizing collections of topically related Web resources

ACM Transactions on Computer-Human Interaction (TOCHI)
The Hierarchical Hidden Markov Model: Analysis and Applications

Machine Learning
A comparison of techniques to find mirrored hosts on the WWW

Journal of the American Society for Information Science
Effective site finding using link anchor information

Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval
Undiscretized dynamic programming: faster algorithms for facility location and related problems on trees

SODA '02 Proceedings of the thirteenth annual ACM-SIAM symposium on Discrete algorithms
Hierarchically Classifying Documents Using Very Few Words

ICML '97 Proceedings of the Fourteenth International Conference on Machine Learning
Web site mining: a new way to spot competitors, customers and suppliers in the world wide web

Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
Mining newsgroups using networks arising from social behavior

WWW '03 Proceedings of the 12th international conference on World Wide Web
The Eigentrust algorithm for reputation management in P2P networks

WWW '03 Proceedings of the 12th international conference on World Wide Web
Web Site Analysis: Structure and Evolution

ICSM '00 Proceedings of the International Conference on Software Maintenance (ICSM'00)
Modeling annotated data

Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval
Classification of HTML Documents by Hidden Tree-Markov Models

ICDAR '01 Proceedings of the Sixth International Conference on Document Analysis and Recognition
Two-Phase Web Site Classification Based on Hidden Markov Tree Models

WI '03 Proceedings of the 2003 IEEE/WIC International Conference on Web Intelligence
Web unit mining: finding and classifying subgraphs of web pages

CIKM '03 Proceedings of the twelfth international conference on Information and knowledge management
Finding similar academic web sites with links, bibliometric couplings and colinks

Information Processing and Management: an International Journal
Surfing the web by site

Proceedings of the 13th international World Wide Web conference on Alternate track papers & posters
Bayesian network model for semi-structured document classification

Information Processing and Management: an International Journal - Special issue: Bayesian networks and information retrieval
The volume and evolution of web page templates

WWW '05 Special interest tracks and posters of the 14th international conference on World Wide Web
Multi-structural databases

Proceedings of the twenty-fourth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Efficient implementation of large-scale multi-structural databases

VLDB '05 Proceedings of the 31st international conference on Very large data bases
Do not crawl in the DUST: different URLs with similar text

Proceedings of the 15th international conference on World Wide Web
An O(pn2) algorithm for the p -median and related problems on tree graphs

Operations Research Letters

Page-level template detection via isotonic smoothing

Proceedings of the 16th international conference on World Wide Web
Web site topic-hierarchy generation based on link structure

Journal of the American Society for Information Science and Technology
Finding effectors in social networks

Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining
Web-site boundary detection

ICDM'10 Proceedings of the 10th industrial conference on Advances in data mining: applications and theoretical aspects
MenuMiner: revealing the information architecture of large web sites by analyzing maximal cliques

Proceedings of the 21st international conference companion on World Wide Web
Search result presentation: supporting post-search navigation by integration of taxonomy data

Proceedings of the 22nd international conference on World Wide Web companion
Mining taxonomies from web menus: rule-based concepts and algorithms

ICWE'13 Proceedings of the 13th international conference on Web Engineering

Quantified Score

Hi-index	0.00

Visualization

Abstract

In this paper, we consider the problem of identifying and segmenting topically cohesive regions in the URL tree of a large website. Each page of the website is assumed to have a topic label or a distribution on topic labels generated using a standard classifier. We develop a set of cost measures characterizing the benefit accrued by introducing a segmentation of the site based on the topic labels. We propose a general framework to use these measures for describing the quality of a segmentation; we also provide an efficient algorithm to find the best segmentation in this framework. Extensive experiments on human-labeled data confirm the soundness of our framework and suggest that a judicious choice of cost measures allows the algorithm to perform surprisingly accurate topical segmentations.