Statistical Model for Content Extraction

  • Authors:
  • Pir Abdul Rasool Qureshi;Nasrullah Memon

  • Affiliations:
  • -;-

  • Venue:
  • EISIC '11 Proceedings of the 2011 European Intelligence and Security Informatics Conference
  • Year:
  • 2011

Quantified Score

Hi-index 0.00

Visualization

Abstract

We present a statistical model for content extraction from HTML documents. The model operates on Document Object Model (DOM) tree of the corresponding HTML document. It evaluates each tree node and associated statistical features to predict significance of the node towards overall content of the document. The model exploits feature set including link densities and text distribution across the nodes of DOM tree. We describe the validity of model with the help of experiments conducted on the standard data sets. The results revealed that the proposed model outperformed other state of art models. We also describe the significance of the model in the domain of counterterrorism and open source intelligence.