Discovering informative content blocks from Web documents

Authors:
Shian-Hua Lin;Jan-Ming Ho
Affiliations:
Academia Sinica, Nankang, Taipei 115, Taiwan;Academia Sinica, Nankang, Taipei 115, Taiwan
Venue:
Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
Year:
2002

Citing 14
Cited 84

Principles and practice of information theory

Principles and practice of information theory
Automatic text processing: the transformation, analysis, and retrieval of information by computer

Automatic text processing: the transformation, analysis, and retrieval of information by computer
New techniques for best-match retrieval

ACM Transactions on Information Systems (TOIS)
Information retrieval: data structures and algorithms

Information retrieval: data structures and algorithms
Information extraction

Communications of the ACM
PAT-tree-based keyword extraction for Chinese information retrieval

Proceedings of the 20th annual international ACM SIGIR conference on Research and development in information retrieval
The anatomy of a large-scale hypertextual Web search engine

WWW7 Proceedings of the seventh international conference on World Wide Web 7
Generating finite-state transducers for semi-structured data extraction from the Web

Information Systems - Special issue on semistructured data
Authoritative sources in a hyperlinked environment

Journal of the ACM (JACM)
Web mining research: a survey

ACM SIGKDD Explorations Newsletter
Integrating the document object model with hyperlinks for enhanced topic distillation and information extraction

Proceedings of the 10th international conference on World Wide Web
Discovering Structural Association of Semistructured Data

IEEE Transactions on Knowledge and Data Engineering
Wrapper induction for information extraction

Wrapper induction for information extraction
Machine learning for information extraction in informal domains

Machine learning for information extraction in informal domains

Mining Web Informative Structures and Contents Based on Entropy Analysis

IEEE Transactions on Knowledge and Data Engineering
Eliminating noisy information in Web pages for data mining

Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
Learning block importance models for web pages

Proceedings of the 13th international conference on World Wide Web
Block-based web search

Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval
Detecting and Partitioning Data Objects in Complex Web Pages

WI '04 Proceedings of the 2004 IEEE/WIC/ACM International Conference on Web Intelligence
Learning important models for web page blocks based on layout and content analysis

ACM SIGKDD Explorations Newsletter
WISDOM: Web Intrapage Informative Structure Mining Based on Document Object Model

IEEE Transactions on Knowledge and Data Engineering
A study on combination of block importance and relevance to estimate page relevance

WWW '05 Special interest tracks and posters of the 14th international conference on World Wide Web
Automatic extraction of informative blocks from webpages

Proceedings of the 2005 ACM symposium on Applied computing
Email data cleaning

Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining
Separating XHTML content from navigation clutter using DOM-structure block analysis

Proceedings of the sixteenth ACM conference on Hypertext and hypermedia
Web data extraction based on structural similarity

Knowledge and Information Systems
Learning Object Models from Semistructured Web Documents

IEEE Transactions on Knowledge and Data Engineering
Mining information extraction rules from datasheets without linguistic parsing

IEA/AIE'2005 Proceedings of the 18th international conference on Innovations in Applied Artificial Intelligence
Template detection for large scale search engines

Proceedings of the 2006 ACM symposium on Applied computing
Combining DOM tree and geometric layout analysis for online medical journal article segmentation

Proceedings of the 6th ACM/IEEE-CS joint conference on Digital libraries
Blocking objectionable web content by leveraging multiple information sources

ACM SIGKDD Explorations Newsletter
Logical structure analysis: From HTML to XML

Computer Standards & Interfaces
Mining key information of web pages: A method and its application

Expert Systems with Applications: An International Journal
Two-phase Web site classification based on Hidden Markov Tree models

Web Intelligence and Agent Systems
Homepage live: automatic block tracing for web personalization

Proceedings of the 16th international conference on World Wide Web
A web page topic segmentation algorithm based on visual criteria and content layout

SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
Elimination of junk document surrogate candidates through pattern recognition

Proceedings of the 2007 ACM symposium on Document engineering
Structure and content analysis for html medical articles: a hidden markov model approach

Proceedings of the 2007 ACM symposium on Document engineering
Computing block importance for searching on web sites

Proceedings of the sixteenth ACM conference on Conference on information and knowledge management
A Novel Web Page Analysis Method for Efficient Reasoning of User Preference

APCHI '08 Proceedings of the 8th Asia-Pacific conference on Computer-Human Interaction
Web Contents Extracting for Web-Based Learning

ICWL '08 Proceedings of the 7th international conference on Advances in Web Based Learning
Spatial Relation Based Object Extraction from the World Wide Web

WI-IAT '08 Proceedings of the 2008 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology - Volume 03
Combining content extraction heuristics: the CombinE system

Proceedings of the 10th International Conference on Information Integration and Web-based Applications & Services
SEA: Segment-enrich-annotate paradigm for adapting dialog-based content for improved accessibility

ACM Transactions on Information Systems (TOIS)
Extracting article text from the web with maximum subsequence segmentation

Proceedings of the 18th international conference on World wide web
Automatic Chinese catchword extraction based on time series analysis

CoNLL '08 Proceedings of the Twelfth Conference on Computational Natural Language Learning
Deriving image-text document surrogates to optimize cognition

Proceedings of the 9th ACM symposium on Document engineering
Web document text and images extraction using DOM analysis and natural language processing

Proceedings of the 9th ACM symposium on Document engineering
Theme Extraction from Chinese Web Documents Based on Page Segmentation and Entropy

ISMIS '09 Proceedings of the 18th International Symposium on Foundations of Intelligent Systems
Web page cleaning for web mining through feature weighting

IJCAI'03 Proceedings of the 18th international joint conference on Artificial intelligence
Enhanced Gestalt Theory Guided Web Page Segmentation for Mobile Browsing

WI-IAT '09 Proceedings of the 2009 IEEE/WIC/ACM International Joint Conference on Web Intelligence and Intelligent Agent Technology - Volume 03
Entropy-Based Visual Tree Evaluation on Block Extraction

WI-IAT '09 Proceedings of the 2009 IEEE/WIC/ACM International Joint Conference on Web Intelligence and Intelligent Agent Technology - Volume 01
Bridging the gap: from multi document Template Detection to single document Content Extraction

EuroIMSA '08 Proceedings of the IASTED International Conference on Internet and Multimedia Systems and Applications
Extracting content structure for web pages based on visual representation

APWeb'03 Proceedings of the 5th Asia-Pacific web conference on Web technologies and applications
Development of automatic web accessibility checking modules for advanced quality assurance tools

UAHCI'07 Proceedings of the 4th international conference on Universal access in human computer interaction: coping with diversity
A novel method of extracting and rendering news web sites on mobile devices

KES'07/WIRN'07 Proceedings of the 11th international conference, KES 2007 and XVII Italian workshop on neural networks conference on Knowledge-based intelligent information and engineering systems: Part I
CETR: content extraction via tag ratios

Proceedings of the 19th international conference on World wide web
An open source web browser for visually impaired

ICIC'07 Proceedings of the intelligent computing 3rd international conference on Advanced intelligent computing theories and applications
Clustering template based web documents

ECIR'08 Proceedings of the IR research, 30th European conference on Advances in information retrieval
DOM-based web pages to determine the structure of the similarity algorithm

IITA'09 Proceedings of the 3rd international conference on Intelligent information technology application
The research of optimization of browse efficiency based on web information on small-screen

FSKD'09 Proceedings of the 6th international conference on Fuzzy systems and knowledge discovery - Volume 3
Web page DOM node characterization and its application to page segmentation

IMSAA'09 Proceedings of the 3rd IEEE international conference on Internet multimedia services architecture and applications
Document structure meets page layout: loopy random fields for web news content extraction

Proceedings of the 10th ACM symposium on Document engineering
Automatic sitemaps generation: Exploring website structures using block extraction and hyperlink analysis

Expert Systems with Applications: An International Journal
A comparison of discriminative classifiers for web news content extraction

RIAO '10 Adaptivity, Personalization and Fusion of Heterogeneous Information
Generalized link suggestions via web site clustering

Proceedings of the 20th international conference on World wide web
Page segmentation by web content clustering

Proceedings of the International Conference on Web Intelligence, Mining and Semantics
A site oriented method for segmenting web pages

Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval
DOM based content extraction via text density

Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval
A preprocessing framework and approach for web applications

Journal of Web Engineering
Indexing and querying segmented web pages: the BlockWeb Model

World Wide Web
An indent shape based approach for web lists mining

WISM'11 Proceedings of the 2011 international conference on Web information systems and mining - Volume Part II
News information extraction based on adaptive weighting using unsupervised Bayesian algorithm

WISM'11 Proceedings of the 2011 international conference on Web information systems and mining - Volume Part II
Automatic web information extraction based on rules

WISE'11 Proceedings of the 12th international conference on Web information system engineering
A heuristic approach for topical information extraction from news pages

WISE'06 Proceedings of the 7th international conference on Web Information Systems
An implementation of web image search engines

ICADL'04 Proceedings of the 7th international Conference on Digital Libraries: international collaboration and cross-fertilization
Block-based language modeling approach towards web search

APWeb'05 Proceedings of the 7th Asia-Pacific web conference on Web Technologies Research and Development
Measuring redundancy level on the web

AINTEC '11 Proceedings of the 7th Asian Internet Engineering Conference
Detecting splogs using similarities of splog HTML structures

Proceedings of the 4th International Conference on Uniquitous Information Management and Communication
Informing the curious negotiator: automatic news extraction from the internet

Data Mining
A broadcast model for web image annotation

PCM'06 Proceedings of the 7th Pacific Rim conference on Advances in Multimedia Information Processing
Identifying content blocks from web documents

ISMIS'05 Proceedings of the 15th international conference on Foundations of Intelligent Systems
An intelligent extracting web content agent on the internet

KES'05 Proceedings of the 9th international conference on Knowledge-Based Intelligent Information and Engineering Systems - Volume Part II
A path-based approach for web page retrieval

World Wide Web
Cleaning web pages for effective web content mining

DEXA'06 Proceedings of the 17th international conference on Database and Expert Systems Applications
Hybrid model of content extraction

Journal of Computer and System Sciences
Advanced information retrieval from web pages

FDIA'07 Proceedings of the 1st BCS IRSG conference on Future Directions in Information Access
Retrieving informative content from web pages with conditional learning of support vector machines and semantic analysis

ICAISC'12 Proceedings of the 11th international conference on Artificial Intelligence and Soft Computing - Volume Part II
Automated internal web page clustering for improved data extraction

Proceedings of the 2nd International Conference on Web Intelligence, Mining and Semantics
Extracting informative textual parts from web pages containing user-generated content

Proceedings of the 12th International Conference on Knowledge Management and Knowledge Technologies
Effectiveness of template detection on noise reduction and websites summarization

Information Sciences: an International Journal
Automated information extraction from web APIs documentation

WISE'12 Proceedings of the 13th international conference on Web Information Systems Engineering
TB-WPRO: Title-Block Based Web Page Reorganization

International Journal of Advanced Pervasive and Ubiquitous Computing
Efficient and effective information finding on small screen devices

Proceedings of the 10th International Cross-Disciplinary Conference on Web Accessibility
A hybrid approach for extracting informative content from web pages

Information Processing and Management: an International Journal
URL tree: efficient unsupervised content extraction from streams of web documents

Proceedings of the 22nd ACM international conference on Conference on information & knowledge management
Heuristic role detection of visual elements of web pages

ICWE'13 Proceedings of the 13th international conference on Web Engineering
Current challenges in web crawling

ICWE'13 Proceedings of the 13th international conference on Web Engineering

Quantified Score

Hi-index	0.00

Visualization

Abstract

In this paper, we propose a new approach to discover informative contents from a set of tabular documents (or Web pages) of a Web site. Our system, InfoDiscoverer, first partitions a page into several content blocks according to HTML tag in a Web page. Based on the occurrence of the features (terms) in the set of pages, it calculates entropy value of each feature. According to the entropy value of each feature in a content block, the entropy value of the block is defined. By analyzing the information measure, we propose a method to dynamically select the entropy-threshold that partitions blocks into either informative or redundant. Informative content blocks are distinguished parts of the page, whereas redundant content blocks are common parts. Based on the answer set generated from 13 manually tagged news Web sites with a total of 26,518 Web pages, experiments show that both recall and precision rates are greater than 0.956. That is, using the approach, informative blocks (news articles) of these sites can be automatically separated from semantically redundant contents such as advertisements, banners, navigation panels, news categories, etc. By adopting InfoDiscoverer as the preprocessor of information retrieval and extraction applications, the retrieval and extracting precision will be increased, and the indexing size and extracting complexity will also be reduced.