Automatic text processing: the transformation, analysis, and retrieval of information by computer
Automatic text processing: the transformation, analysis, and retrieval of information by computer
Silk from a sow's ear: extracting usable structures from the Web
Proceedings of the SIGCHI Conference on Human Factors in Computing Systems
Syntactic clustering of the Web
Selected papers from the sixth international conference on World Wide Web
Improved algorithms for topic distillation in a hyperlinked environment
Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval
Automatic resource compilation by analyzing hyperlink structure and associated text
WWW7 Proceedings of the seventh international conference on World Wide Web 7
The anatomy of a large-scale hypertextual Web search engine
WWW7 Proceedings of the seventh international conference on World Wide Web 7
Learning to remove Internet advertisements
Proceedings of the third annual conference on Autonomous Agents
Authoritative sources in a hyperlinked environment
Proceedings of the ninth annual ACM-SIAM symposium on Discrete algorithms
Does “authority” mean quality? predicting expert quality ratings of Web documents
SIGIR '00 Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval
Proceedings of the 9th international World Wide Web conference on Computer networks : the international journal of computer and telecommunications netowrking
The stochastic approach for link-structure analysis (SALSA) and the TKC effect
Proceedings of the 9th international World Wide Web conference on Computer networks : the international journal of computer and telecommunications netowrking
Proceedings of the 10th international conference on World Wide Web
Constructing multi-granular and topic-focused web site maps
Proceedings of the 10th international conference on World Wide Web
IEPAD: information extraction based on pattern discovery
Proceedings of the 10th international conference on World Wide Web
Enhanced topic distillation using text, markup tags, and hyperlinks
Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval
Modern Information Retrieval
Mining the Web's Link Structure
Computer
Efficient Data Mining for Path Traversal Patterns
IEEE Transactions on Knowledge and Data Engineering
RoadRunner: Towards Automatic Data Extraction from Large Web Sites
Proceedings of the 27th International Conference on Very Large Data Bases
Wrapper induction for information extraction
Wrapper induction for information extraction
Mining Web Informative Structures and Contents Based on Entropy Analysis
IEEE Transactions on Knowledge and Data Engineering
Eliminating noisy information in Web pages for data mining
Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval
Editorial: special issue on web content mining
ACM SIGKDD Explorations Newsletter
Dempster-Shafer Theory for a Query-Biased Combination of Evidence on the Web
Information Retrieval
WISDOM: Web Intrapage Informative Structure Mining Based on Document Object Model
IEEE Transactions on Knowledge and Data Engineering
The volume and evolution of web page templates
WWW '05 Special interest tracks and posters of the 14th international conference on World Wide Web
Page-level template detection via isotonic smoothing
Proceedings of the 16th international conference on World Wide Web
Web page cleaning for web mining through feature weighting
IJCAI'03 Proceedings of the 18th international joint conference on Artificial intelligence
Document clustering of scientific texts using citation contexts
Information Retrieval
Expert Systems with Applications: An International Journal
The static absorbing model for the web
Journal of Web Engineering
Hi-index | 0.00 |
In this paper, we study the problem of mining the informative structure of a news Web site which consists of thousands of hyperlinked documents. We define the informative structure of a news Web site as a set of index pages (or referred to as TOC, i.e., table of contents, pages) and a set of article pages linked by TOC pages through informative links. It is noted that the Hyperlink Induced Topics Search (HITS) algorithm has been employed to provide a solution to analyzing authorities and hubs of pages. However, most of the content sites tend to contain some extra hyperlinks, such as navigation panels, advertisements and banners, so as to increase the add-on values of their Web pages. Therefore, due to the structure induced by these extra hyperlinks, HITS is found to be insufficient to provide a good precision in solving the problem. To remedy this, we develop an algorithm to utilize entropy-based Link Analysis on Mining Web Informative Structures. This algorithm is referred to as LAMIS. The key idea of LAMIS is to utilize information entropy for representing the knowledge that corresponds to the amount of information in a link or a page in the link analysis. Experiments on several real news Web sites show that the precision and the recall of LAMIS are much superior to those obtained by heuristic methods and conventional ink analysis methods.