Automatic text processing: the transformation, analysis, and retrieval of information by computer
Automatic text processing: the transformation, analysis, and retrieval of information by computer
Silk from a sow's ear: extracting usable structures from the Web
Proceedings of the SIGCHI Conference on Human Factors in Computing Systems
PAT-tree-based keyword extraction for Chinese information retrieval
Proceedings of the 20th annual international ACM SIGIR conference on Research and development in information retrieval
Syntactic clustering of the Web
Selected papers from the sixth international conference on World Wide Web
Improved algorithms for topic distillation in a hyperlinked environment
Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval
Automatic resource compilation by analyzing hyperlink structure and associated text
WWW7 Proceedings of the seventh international conference on World Wide Web 7
The anatomy of a large-scale hypertextual Web search engine
WWW7 Proceedings of the seventh international conference on World Wide Web 7
Learning to remove Internet advertisements
Proceedings of the third annual conference on Autonomous Agents
A hierarchical approach to wrapper induction
Proceedings of the third annual conference on Autonomous Agents
Generating finite-state transducers for semi-structured data extraction from the Web
Information Systems - Special issue on semistructured data
Mirror, mirror on the Web: a study of host pairs with replicated content
WWW '99 Proceedings of the eighth international conference on World Wide Web
Authoritative sources in a hyperlinked environment
Proceedings of the ninth annual ACM-SIAM symposium on Discrete algorithms
Does “authority” mean quality? predicting expert quality ratings of Web documents
SIGIR '00 Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval
Proceedings of the 9th international World Wide Web conference on Computer networks : the international journal of computer and telecommunications netowrking
The stochastic approach for link-structure analysis (SALSA) and the TKC effect
Proceedings of the 9th international World Wide Web conference on Computer networks : the international journal of computer and telecommunications netowrking
A comparison of techniques to find mirrored hosts on the WWW
Journal of the American Society for Information Science
Proceedings of the 10th international conference on World Wide Web
Constructing multi-granular and topic-focused web site maps
Proceedings of the 10th international conference on World Wide Web
Finding authorities and hubs from link structures on the World Wide Web
Proceedings of the 10th international conference on World Wide Web
IEPAD: information extraction based on pattern discovery
Proceedings of the 10th international conference on World Wide Web
Enhanced topic distillation using text, markup tags, and hyperlinks
Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval
Modifications of Kleinberg's HITS algorithm using matrix exponentiation and web log records
Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval
Modern Information Retrieval
Entropy-based link analysis for mining web informative structures
Proceedings of the eleventh international conference on Information and knowledge management
Mining the Web's Link Structure
Computer
Efficient Data Mining for Path Traversal Patterns
IEEE Transactions on Knowledge and Data Engineering
Discovering Structural Association of Semistructured Data
IEEE Transactions on Knowledge and Data Engineering
Wrapper Generation via Grammar Induction
ECML '00 Proceedings of the 11th European Conference on Machine Learning
RoadRunner: Towards Automatic Data Extraction from Large Web Sites
Proceedings of the 27th International Conference on Very Large Data Bases
Discovering informative content blocks from Web documents
Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
WISDOM: Web Intrapage Informative Structure Mining Based on Document Object Model
IEEE Transactions on Knowledge and Data Engineering
Clustering web pages based on their structure
Data & Knowledge Engineering - Special issue: WIDM 2003
Latent linkage semantic kernels for collective classification of link data
Journal of Intelligent Information Systems
An automatic data grabber for large web sites
VLDB '04 Proceedings of the Thirtieth international conference on Very large data bases - Volume 30
Combining content extraction heuristics: the CombinE system
Proceedings of the 10th International Conference on Information Integration and Web-based Applications & Services
Information extraction for search engines using fast heuristic techniques
Data & Knowledge Engineering
An intrusion detection based on support vector machines with a voting weight schema
IEA/AIE'07 Proceedings of the 20th international conference on Industrial, engineering, and other applications of applied intelligent systems
CETR: content extraction via tag ratios
Proceedings of the 19th international conference on World wide web
Expert Systems with Applications: An International Journal
Online social network profile data extraction for vulnerability analysis
International Journal of Internet Technology and Secured Transactions
DOM based content extraction via text density
Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval
A new method for focused crawler cross tunnel
RSKT'06 Proceedings of the First international conference on Rough Sets and Knowledge Technology
An intelligent extracting web content agent on the internet
KES'05 Proceedings of the 9th international conference on Knowledge-Based Intelligent Information and Engineering Systems - Volume Part II
Literal-matching-biased link analysis
AIRS'04 Proceedings of the 2004 international conference on Asian Information Retrieval Technology
Hybrid model of content extraction
Journal of Computer and System Sciences
Hi-index | 0.00 |
Abstract--In this paper, we study the problem of mining the informative structure of a news Web site that consists of thousands of hyperlinked documents. We define the informative structure of a news Web site as a set of index pages (or referred to as TOC, i.e., table of contents, pages) and a set of article pages linked by these TOC pages. Based on the Hyperlink Induced Topics Search (HITS) algorithm, we propose an entropy-based analysis (LAMIS) mechanism for analyzing the entropy of anchor texts and links to eliminate the redundancy of the hyperlinked structure so that the complex structure of a Web site can be distilled. However, to increase the value and the accessibility of pages, most of the content sites tend to publish their pages with intrasite redundant information, such as navigation panels, advertisements, copy announcements, etc. To further eliminate such redundancy, we propose another mechanism, called InfoDiscoverer, which applies the distilled structure to identify sets of article pages. InfoDiscoverer also employs the entropy information to analyze the information measures of article sets and to extract informative content blocks from these sets. Our result is useful for search engines, information agents, and crawlers to index, extract, and navigate significant information from a Web site. Experiments on several real news Web sites show that the precision and the recall of our approaches are much superior to those obtained by conventional methods in mining the informative structures of news Web sites. On the average, the augmented LAMIS leads to prominent performance improvement and increases the precision by a factor ranging from 122 to 257 percent when the desired recall falls between 0.5 and 1. In comparison with manual heuristics, the precision and the recall of InfoDiscoverer are greater than 0.956.