Hybrid model of content extraction

Authors:
Pir Abdul Rasool Qureshi;Nasrullah Memon
Affiliations:
The Maersk Mc-Kinney Moller Institute, University of Southern Denmark, Denmark;The Maersk Mc-Kinney Moller Institute, University of Southern Denmark, Denmark
Venue:
Journal of Computer and System Sciences
Year:
2012

Citing 21
Cited 0

NoDoSE—a tool for semi-automatically extracting structured and semistructured data from text documents

SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
Two approaches to bringing Internet services to WAP devices

Proceedings of the 9th international World Wide Web conference on Computer networks : the international journal of computer and telecommunications netowrking
Template detection via data mining and its applications

Proceedings of the 11th international conference on World Wide Web
QuASM: a system for question answering using semi-structured data

Proceedings of the 2nd ACM/IEEE-CS joint conference on Digital libraries
Wrapping web data into XML

ACM SIGMOD Record
Mining the Web: Discovering Knowledge from HyperText Data

Mining the Web: Discovering Knowledge from HyperText Data
Discovering informative content blocks from Web documents

Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
DOM-based content extraction of HTML documents

WWW '03 Proceedings of the 12th international conference on World Wide Web
XWRAP: An XML-Enabled Wrapper Construction System for Web Information Sources

ICDE '00 Proceedings of the 16th International Conference on Data Engineering
Mining Web Informative Structures and Contents Based on Entropy Analysis

IEEE Transactions on Knowledge and Data Engineering
Misuse detection for information retrieval systems

CIKM '03 Proceedings of the twelfth international conference on Information and knowledge management
CrimeNet explorer: a framework for criminal network knowledge discovery

ACM Transactions on Information Systems (TOIS)
Separating XHTML content from navigation clutter using DOM-structure block analysis

Proceedings of the sixteenth ACM conference on Hypertext and hypermedia
Content Code Blurring: A New Approach to Content Extraction

DEXA '08 Proceedings of the 2008 19th International Conference on Database and Expert Systems Application
Toward 2W, beyond web 2.0

Communications of the ACM - Inspiring Women in Computing
Extracting article text from the web with maximum subsequence segmentation

Proceedings of the 18th international conference on World wide web
Extracting content structure for web pages based on visual representation

APWeb'03 Proceedings of the 5th Asia-Pacific web conference on Web technologies and applications
CETR: content extraction via tag ratios

Proceedings of the 19th international conference on World wide web
EWaS: Novel Approach for Generating Early Warnings to Prevent Terrorist Attacks

ICCEA '10 Proceedings of the 2010 Second International Conference on Computer Engineering and Applications - Volume 02
Statistical Model for Content Extraction

EISIC '11 Proceedings of the 2011 European Intelligence and Security Informatics Conference
Identifying content blocks from web documents

ISMIS'05 Proceedings of the 15th international conference on Foundations of Intelligent Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

We present a hybrid model for content extraction from HTML documents. The model operates on Document Object Model (DOM) tree of the corresponding HTML document. It evaluates each tree node and associated statistical features like link density and text distribution across the node to predict significance of the node towards overall content provided by the document. Once significance of the nodes is determined, the formatting characteristics like fonts, styles and the position of the nodes are evaluated to identify the nodes with similar formatting as compared to the significant nodes. The proposed hybrid model is derived from two different models, i.e., one is based on statistical features and other on formatting characteristics and achieved the best accuracy. We describe the validity of model with the help of experiments conducted on the standard data sets. The results revealed that the proposed model outperformed other existing content extraction models. We present a browser based implementation of the proposed model as proof of concept and compare the implementation strategy with various state of art implementations. We also discuss various applications of the proposed model with special emphasis on open source intelligence.