Entropy-based link analysis for mining web informative structures

Authors:
Hung-Yu Kao;Ming-Syan Chen;Shian-Hua Lin;Jan-Ming Ho
Affiliations:
National Taiwan University, Taipei, Taiwan, ROC;National Taiwan University, Taipei, Taiwan, ROC;Academia Sinica, Taipei, Taiwan, ROC;Academia Sinica, Taipei, Taiwan, ROC
Venue:
Proceedings of the eleventh international conference on Information and knowledge management
Year:
2002

Citing 20
Cited 12

Automatic text processing: the transformation, analysis, and retrieval of information by computer

Automatic text processing: the transformation, analysis, and retrieval of information by computer
Silk from a sow's ear: extracting usable structures from the Web

Proceedings of the SIGCHI Conference on Human Factors in Computing Systems
Syntactic clustering of the Web

Selected papers from the sixth international conference on World Wide Web
Improved algorithms for topic distillation in a hyperlinked environment

Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval
Automatic resource compilation by analyzing hyperlink structure and associated text

WWW7 Proceedings of the seventh international conference on World Wide Web 7
The anatomy of a large-scale hypertextual Web search engine

WWW7 Proceedings of the seventh international conference on World Wide Web 7
Learning to remove Internet advertisements

Proceedings of the third annual conference on Autonomous Agents
Authoritative sources in a hyperlinked environment

Proceedings of the ninth annual ACM-SIAM symposium on Discrete algorithms
Does “authority” mean quality? predicting expert quality ratings of Web documents

SIGIR '00 Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval
Graph structure in the Web

Proceedings of the 9th international World Wide Web conference on Computer networks : the international journal of computer and telecommunications netowrking
The stochastic approach for link-structure analysis (SALSA) and the TKC effect

Proceedings of the 9th international World Wide Web conference on Computer networks : the international journal of computer and telecommunications netowrking
Integrating the document object model with hyperlinks for enhanced topic distillation and information extraction

Proceedings of the 10th international conference on World Wide Web
Constructing multi-granular and topic-focused web site maps

Proceedings of the 10th international conference on World Wide Web
IEPAD: information extraction based on pattern discovery

Proceedings of the 10th international conference on World Wide Web
Enhanced topic distillation using text, markup tags, and hyperlinks

Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval
Modern Information Retrieval

Modern Information Retrieval
Mining the Web's Link Structure

Computer
Efficient Data Mining for Path Traversal Patterns

IEEE Transactions on Knowledge and Data Engineering
RoadRunner: Towards Automatic Data Extraction from Large Web Sites

Proceedings of the 27th International Conference on Very Large Data Bases
Wrapper induction for information extraction

Wrapper induction for information extraction

Mining Web Informative Structures and Contents Based on Entropy Analysis

IEEE Transactions on Knowledge and Data Engineering
Eliminating noisy information in Web pages for data mining

Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
Block-level link analysis

Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval
Editorial: special issue on web content mining

ACM SIGKDD Explorations Newsletter
Dempster-Shafer Theory for a Query-Biased Combination of Evidence on the Web

Information Retrieval
WISDOM: Web Intrapage Informative Structure Mining Based on Document Object Model

IEEE Transactions on Knowledge and Data Engineering
The volume and evolution of web page templates

WWW '05 Special interest tracks and posters of the 14th international conference on World Wide Web
Page-level template detection via isotonic smoothing

Proceedings of the 16th international conference on World Wide Web
Web page cleaning for web mining through feature weighting

IJCAI'03 Proceedings of the 18th international joint conference on Artificial intelligence
Document clustering of scientific texts using citation contexts

Information Retrieval
Automatic sitemaps generation: Exploring website structures using block extraction and hyperlink analysis

Expert Systems with Applications: An International Journal
The static absorbing model for the web

Journal of Web Engineering

Quantified Score

Hi-index	0.00

Visualization

Abstract

In this paper, we study the problem of mining the informative structure of a news Web site which consists of thousands of hyperlinked documents. We define the informative structure of a news Web site as a set of index pages (or referred to as TOC, i.e., table of contents, pages) and a set of article pages linked by TOC pages through informative links. It is noted that the Hyperlink Induced Topics Search (HITS) algorithm has been employed to provide a solution to analyzing authorities and hubs of pages. However, most of the content sites tend to contain some extra hyperlinks, such as navigation panels, advertisements and banners, so as to increase the add-on values of their Web pages. Therefore, due to the structure induced by these extra hyperlinks, HITS is found to be insufficient to provide a good precision in solving the problem. To remedy this, we develop an algorithm to utilize entropy-based Link Analysis on Mining Web Informative Structures. This algorithm is referred to as LAMIS. The key idea of LAMIS is to utilize information entropy for representing the knowledge that corresponds to the amount of information in a link or a page in the link analysis. Experiments on several real news Web sites show that the precision and the recall of LAMIS are much superior to those obtained by heuristic methods and conventional ink analysis methods.