Automatic text processing: the transformation, analysis, and retrieval of information by computer
Automatic text processing: the transformation, analysis, and retrieval of information by computer
SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
Syntactic clustering of the Web
Selected papers from the sixth international conference on World Wide Web
Record-boundary discovery in Web documents
SIGMOD '99 Proceedings of the 1999 ACM SIGMOD international conference on Management of data
Generating finite-state transducers for semi-structured data extraction from the Web
Information Systems - Special issue on semistructured data
Authoritative sources in a hyperlinked environment
Proceedings of the ninth annual ACM-SIAM symposium on Discrete algorithms
Recognizing structure in Web pages using similarity queries
AAAI '99/IAAI '99 Proceedings of the sixteenth national conference on Artificial intelligence and the eleventh Innovative applications of artificial intelligence conference innovative applications of artificial intelligence
Learning to construct knowledge bases from the World Wide Web
Artificial Intelligence - Special issue on Intelligent internet systems
Learning to extract hierarchical information from semi-structured documents
Proceedings of the ninth international conference on Information and knowledge management
Proceedings of the 10th international conference on World Wide Web
Template detection via data mining and its applications
Proceedings of the 11th international conference on World Wide Web
Modern Information Retrieval
A brief survey of web data extraction tools
ACM SIGMOD Record
Using micro information units for internet search
Proceedings of the eleventh international conference on Information and knowledge management
Entropy-based link analysis for mining web informative structures
Proceedings of the eleventh international conference on Information and knowledge management
Discovering Structural Association of Semistructured Data
IEEE Transactions on Knowledge and Data Engineering
Extracting Characteristic Structures among Words in Semistructured Documents
PAKDD '02 Proceedings of the 6th Pacific-Asia Conference on Advances in Knowledge Discovery and Data Mining
Discovery of Frequent Tag Tree Patterns in Semistructured Web Documents
PAKDD '02 Proceedings of the 6th Pacific-Asia Conference on Advances in Knowledge Discovery and Data Mining
Clipping and Analyzing News Using Machine Learning Techniques
DS '01 Proceedings of the 4th International Conference on Discovery Science
Discovering informative content blocks from Web documents
Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
Detecting web page structure for adaptive viewing on small form factor devices
WWW '03 Proceedings of the 12th international conference on World Wide Web
Reverse Engineering for Web Data: From Visual to Semantic Structures
ICDE '02 Proceedings of the 18th International Conference on Data Engineering
Web Mining: Information and Pattern Discovery on the World Wide Web
ICTAI '97 Proceedings of the 9th International Conference on Tools with Artificial Intelligence
Mining Web Informative Structures and Contents Based on Entropy Analysis
IEEE Transactions on Knowledge and Data Engineering
Page-level template detection via isotonic smoothing
Proceedings of the 16th international conference on World Wide Web
A graph-theoretic approach to webpage segmentation
Proceedings of the 17th international conference on World Wide Web
A densitometric approach to web page segmentation
Proceedings of the 17th ACM conference on Information and knowledge management
Webpage segmentation for extracting images and their surrounding contextual information
MM '09 Proceedings of the 17th ACM international conference on Multimedia
Boilerplate detection using shallow text features
Proceedings of the third ACM international conference on Web search and data mining
Web page DOM node characterization and its application to page segmentation
IMSAA'09 Proceedings of the 3rd IEEE international conference on Internet multimedia services architecture and applications
Page segmentation by web content clustering
Proceedings of the International Conference on Web Intelligence, Mining and Semantics
ICCSA'06 Proceedings of the 2006 international conference on Computational Science and Its Applications - Volume Part II
VisHue: web page segmentation for an improved query interface for medlineplus medical encyclopedia
DNIS'11 Proceedings of the 7th international conference on Databases in Networked Information Systems
Hi-index | 0.00 |
To increase the commercial value and accessibility of pages, most content sites tend to publish their pages with intrasite redundant information, such as navigation panels, advertisements, and copyright announcements. Such redundant information increases the index size of general search engines and causes page topics to drift. In this paper, we study the problem of mining intrapage informative structure in news Web sites in order to find and eliminate redundant information. Note that intrapage informative structure is a subset of the original Web page and is composed of a set of fine-grained and informative blocks. The intrapage informative structures of pages in a news Web site contain only anchors linking to news pages or bodies of news articles. We propose an intrapage informative structure mining system called WISDOM (Web Intrapage Informative Structure Mining based on the Document Object Model) which applies Information Theory to DOM tree knowledge in order to build the structure. WISDOM splits a DOM tree into many small subtrees and applies a top-down informative block searching algorithm to select a set of candidate informative blocks. The structure is built by expanding the set using proposed merging methods. Experiments on several real news Web sites show high precision and recall rates which validates WISDOM's practical applicability.