Principles and practice of information theory
Principles and practice of information theory
Automatic text processing: the transformation, analysis, and retrieval of information by computer
Automatic text processing: the transformation, analysis, and retrieval of information by computer
New techniques for best-match retrieval
ACM Transactions on Information Systems (TOIS)
Information retrieval: data structures and algorithms
Information retrieval: data structures and algorithms
Communications of the ACM
PAT-tree-based keyword extraction for Chinese information retrieval
Proceedings of the 20th annual international ACM SIGIR conference on Research and development in information retrieval
The anatomy of a large-scale hypertextual Web search engine
WWW7 Proceedings of the seventh international conference on World Wide Web 7
Generating finite-state transducers for semi-structured data extraction from the Web
Information Systems - Special issue on semistructured data
Authoritative sources in a hyperlinked environment
Journal of the ACM (JACM)
ACM SIGKDD Explorations Newsletter
Proceedings of the 10th international conference on World Wide Web
Discovering Structural Association of Semistructured Data
IEEE Transactions on Knowledge and Data Engineering
Wrapper induction for information extraction
Wrapper induction for information extraction
Machine learning for information extraction in informal domains
Machine learning for information extraction in informal domains
Mining Web Informative Structures and Contents Based on Entropy Analysis
IEEE Transactions on Knowledge and Data Engineering
Eliminating noisy information in Web pages for data mining
Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
Learning block importance models for web pages
Proceedings of the 13th international conference on World Wide Web
Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval
Detecting and Partitioning Data Objects in Complex Web Pages
WI '04 Proceedings of the 2004 IEEE/WIC/ACM International Conference on Web Intelligence
Learning important models for web page blocks based on layout and content analysis
ACM SIGKDD Explorations Newsletter
WISDOM: Web Intrapage Informative Structure Mining Based on Document Object Model
IEEE Transactions on Knowledge and Data Engineering
A study on combination of block importance and relevance to estimate page relevance
WWW '05 Special interest tracks and posters of the 14th international conference on World Wide Web
Automatic extraction of informative blocks from webpages
Proceedings of the 2005 ACM symposium on Applied computing
Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining
Separating XHTML content from navigation clutter using DOM-structure block analysis
Proceedings of the sixteenth ACM conference on Hypertext and hypermedia
Web data extraction based on structural similarity
Knowledge and Information Systems
Learning Object Models from Semistructured Web Documents
IEEE Transactions on Knowledge and Data Engineering
Mining information extraction rules from datasheets without linguistic parsing
IEA/AIE'2005 Proceedings of the 18th international conference on Innovations in Applied Artificial Intelligence
Template detection for large scale search engines
Proceedings of the 2006 ACM symposium on Applied computing
Combining DOM tree and geometric layout analysis for online medical journal article segmentation
Proceedings of the 6th ACM/IEEE-CS joint conference on Digital libraries
Blocking objectionable web content by leveraging multiple information sources
ACM SIGKDD Explorations Newsletter
Logical structure analysis: From HTML to XML
Computer Standards & Interfaces
Mining key information of web pages: A method and its application
Expert Systems with Applications: An International Journal
Two-phase Web site classification based on Hidden Markov Tree models
Web Intelligence and Agent Systems
Homepage live: automatic block tracing for web personalization
Proceedings of the 16th international conference on World Wide Web
A web page topic segmentation algorithm based on visual criteria and content layout
SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
Elimination of junk document surrogate candidates through pattern recognition
Proceedings of the 2007 ACM symposium on Document engineering
Structure and content analysis for html medical articles: a hidden markov model approach
Proceedings of the 2007 ACM symposium on Document engineering
Computing block importance for searching on web sites
Proceedings of the sixteenth ACM conference on Conference on information and knowledge management
A Novel Web Page Analysis Method for Efficient Reasoning of User Preference
APCHI '08 Proceedings of the 8th Asia-Pacific conference on Computer-Human Interaction
Web Contents Extracting for Web-Based Learning
ICWL '08 Proceedings of the 7th international conference on Advances in Web Based Learning
Spatial Relation Based Object Extraction from the World Wide Web
WI-IAT '08 Proceedings of the 2008 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology - Volume 03
Combining content extraction heuristics: the CombinE system
Proceedings of the 10th International Conference on Information Integration and Web-based Applications & Services
SEA: Segment-enrich-annotate paradigm for adapting dialog-based content for improved accessibility
ACM Transactions on Information Systems (TOIS)
Extracting article text from the web with maximum subsequence segmentation
Proceedings of the 18th international conference on World wide web
Automatic Chinese catchword extraction based on time series analysis
CoNLL '08 Proceedings of the Twelfth Conference on Computational Natural Language Learning
Deriving image-text document surrogates to optimize cognition
Proceedings of the 9th ACM symposium on Document engineering
Web document text and images extraction using DOM analysis and natural language processing
Proceedings of the 9th ACM symposium on Document engineering
Theme Extraction from Chinese Web Documents Based on Page Segmentation and Entropy
ISMIS '09 Proceedings of the 18th International Symposium on Foundations of Intelligent Systems
Web page cleaning for web mining through feature weighting
IJCAI'03 Proceedings of the 18th international joint conference on Artificial intelligence
Enhanced Gestalt Theory Guided Web Page Segmentation for Mobile Browsing
WI-IAT '09 Proceedings of the 2009 IEEE/WIC/ACM International Joint Conference on Web Intelligence and Intelligent Agent Technology - Volume 03
Entropy-Based Visual Tree Evaluation on Block Extraction
WI-IAT '09 Proceedings of the 2009 IEEE/WIC/ACM International Joint Conference on Web Intelligence and Intelligent Agent Technology - Volume 01
Bridging the gap: from multi document Template Detection to single document Content Extraction
EuroIMSA '08 Proceedings of the IASTED International Conference on Internet and Multimedia Systems and Applications
Extracting content structure for web pages based on visual representation
APWeb'03 Proceedings of the 5th Asia-Pacific web conference on Web technologies and applications
Development of automatic web accessibility checking modules for advanced quality assurance tools
UAHCI'07 Proceedings of the 4th international conference on Universal access in human computer interaction: coping with diversity
A novel method of extracting and rendering news web sites on mobile devices
KES'07/WIRN'07 Proceedings of the 11th international conference, KES 2007 and XVII Italian workshop on neural networks conference on Knowledge-based intelligent information and engineering systems: Part I
CETR: content extraction via tag ratios
Proceedings of the 19th international conference on World wide web
An open source web browser for visually impaired
ICIC'07 Proceedings of the intelligent computing 3rd international conference on Advanced intelligent computing theories and applications
Clustering template based web documents
ECIR'08 Proceedings of the IR research, 30th European conference on Advances in information retrieval
DOM-based web pages to determine the structure of the similarity algorithm
IITA'09 Proceedings of the 3rd international conference on Intelligent information technology application
The research of optimization of browse efficiency based on web information on small-screen
FSKD'09 Proceedings of the 6th international conference on Fuzzy systems and knowledge discovery - Volume 3
Web page DOM node characterization and its application to page segmentation
IMSAA'09 Proceedings of the 3rd IEEE international conference on Internet multimedia services architecture and applications
Document structure meets page layout: loopy random fields for web news content extraction
Proceedings of the 10th ACM symposium on Document engineering
Expert Systems with Applications: An International Journal
A comparison of discriminative classifiers for web news content extraction
RIAO '10 Adaptivity, Personalization and Fusion of Heterogeneous Information
Generalized link suggestions via web site clustering
Proceedings of the 20th international conference on World wide web
Page segmentation by web content clustering
Proceedings of the International Conference on Web Intelligence, Mining and Semantics
A site oriented method for segmenting web pages
Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval
DOM based content extraction via text density
Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval
A preprocessing framework and approach for web applications
Journal of Web Engineering
An indent shape based approach for web lists mining
WISM'11 Proceedings of the 2011 international conference on Web information systems and mining - Volume Part II
News information extraction based on adaptive weighting using unsupervised Bayesian algorithm
WISM'11 Proceedings of the 2011 international conference on Web information systems and mining - Volume Part II
Automatic web information extraction based on rules
WISE'11 Proceedings of the 12th international conference on Web information system engineering
A heuristic approach for topical information extraction from news pages
WISE'06 Proceedings of the 7th international conference on Web Information Systems
An implementation of web image search engines
ICADL'04 Proceedings of the 7th international Conference on Digital Libraries: international collaboration and cross-fertilization
Block-based language modeling approach towards web search
APWeb'05 Proceedings of the 7th Asia-Pacific web conference on Web Technologies Research and Development
Measuring redundancy level on the web
AINTEC '11 Proceedings of the 7th Asian Internet Engineering Conference
Detecting splogs using similarities of splog HTML structures
Proceedings of the 4th International Conference on Uniquitous Information Management and Communication
A broadcast model for web image annotation
PCM'06 Proceedings of the 7th Pacific Rim conference on Advances in Multimedia Information Processing
Identifying content blocks from web documents
ISMIS'05 Proceedings of the 15th international conference on Foundations of Intelligent Systems
An intelligent extracting web content agent on the internet
KES'05 Proceedings of the 9th international conference on Knowledge-Based Intelligent Information and Engineering Systems - Volume Part II
A path-based approach for web page retrieval
World Wide Web
Cleaning web pages for effective web content mining
DEXA'06 Proceedings of the 17th international conference on Database and Expert Systems Applications
Hybrid model of content extraction
Journal of Computer and System Sciences
Advanced information retrieval from web pages
FDIA'07 Proceedings of the 1st BCS IRSG conference on Future Directions in Information Access
ICAISC'12 Proceedings of the 11th international conference on Artificial Intelligence and Soft Computing - Volume Part II
Automated internal web page clustering for improved data extraction
Proceedings of the 2nd International Conference on Web Intelligence, Mining and Semantics
Extracting informative textual parts from web pages containing user-generated content
Proceedings of the 12th International Conference on Knowledge Management and Knowledge Technologies
Effectiveness of template detection on noise reduction and websites summarization
Information Sciences: an International Journal
Automated information extraction from web APIs documentation
WISE'12 Proceedings of the 13th international conference on Web Information Systems Engineering
TB-WPRO: Title-Block Based Web Page Reorganization
International Journal of Advanced Pervasive and Ubiquitous Computing
Efficient and effective information finding on small screen devices
Proceedings of the 10th International Cross-Disciplinary Conference on Web Accessibility
A hybrid approach for extracting informative content from web pages
Information Processing and Management: an International Journal
URL tree: efficient unsupervised content extraction from streams of web documents
Proceedings of the 22nd ACM international conference on Conference on information & knowledge management
Heuristic role detection of visual elements of web pages
ICWE'13 Proceedings of the 13th international conference on Web Engineering
Current challenges in web crawling
ICWE'13 Proceedings of the 13th international conference on Web Engineering
Hi-index | 0.00 |
In this paper, we propose a new approach to discover informative contents from a set of tabular documents (or Web pages) of a Web site. Our system, InfoDiscoverer, first partitions a page into several content blocks according to HTML tag in a Web page. Based on the occurrence of the features (terms) in the set of pages, it calculates entropy value of each feature. According to the entropy value of each feature in a content block, the entropy value of the block is defined. By analyzing the information measure, we propose a method to dynamically select the entropy-threshold that partitions blocks into either informative or redundant. Informative content blocks are distinguished parts of the page, whereas redundant content blocks are common parts. Based on the answer set generated from 13 manually tagged news Web sites with a total of 26,518 Web pages, experiments show that both recall and precision rates are greater than 0.956. That is, using the approach, informative blocks (news articles) of these sites can be automatically separated from semantically redundant contents such as advertisements, banners, navigation panels, news categories, etc. By adopting InfoDiscoverer as the preprocessor of information retrieval and extraction applications, the retrieval and extracting precision will be increased, and the indexing size and extracting complexity will also be reduced.