Learning to remove Internet advertisements
Proceedings of the third annual conference on Autonomous Agents
Enhanced topic distillation using text, markup tags, and hyperlinks
Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval
Template detection via data mining and its applications
Proceedings of the 11th international conference on World Wide Web
A visual tool for building logical data models of websites
Proceedings of the 4th international workshop on Web information and data management
Visual Web Information Extraction with Lixto
Proceedings of the 27th International Conference on Very Large Data Bases
RoadRunner: Towards Automatic Data Extraction from Large Web Sites
Proceedings of the 27th International Conference on Very Large Data Bases
Discovering informative content blocks from Web documents
Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
Detecting web page structure for adaptive viewing on small form factor devices
WWW '03 Proceedings of the 12th international conference on World Wide Web
Extracting structured data from Web pages
Proceedings of the 2003 ACM SIGMOD international conference on Management of data
Wrapper induction for information extraction
Wrapper induction for information extraction
Eliminating noisy information in Web pages for data mining
Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
Learning block importance models for web pages
Proceedings of the 13th international conference on World Wide Web
Automatic detection of fragments in dynamically generated web pages
Proceedings of the 13th international conference on World Wide Web
Using the structure of Web sites for automatic segmentation of tables
SIGMOD '04 Proceedings of the 2004 ACM SIGMOD international conference on Management of data
Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval
WISDOM: Web Intrapage Informative Structure Mining Based on Document Object Model
IEEE Transactions on Knowledge and Data Engineering
The volume and evolution of web page templates
WWW '05 Special interest tracks and posters of the 14th international conference on World Wide Web
Automatic Identification of Informative Sections of Web Pages
IEEE Transactions on Knowledge and Data Engineering
Proceedings of the 15th international conference on World Wide Web
A fast and robust method for web page template detection and removal
CIKM '06 Proceedings of the 15th ACM international conference on Information and knowledge management
Page-level template detection via isotonic smoothing
Proceedings of the 16th international conference on World Wide Web
Robust web page segmentation for mobile terminal using content-distances and page layout information
Proceedings of the 16th international conference on World Wide Web
Page segmentation by web content clustering
Proceedings of the International Conference on Web Intelligence, Mining and Semantics
ICAISC'12 Proceedings of the 11th international conference on Artificial Intelligence and Soft Computing - Volume Part II
Extracting informative textual parts from web pages containing user-generated content
Proceedings of the 12th International Conference on Knowledge Management and Knowledge Technologies
Automated information extraction from web APIs documentation
WISE'12 Proceedings of the 13th international conference on Web Information Systems Engineering
Hi-index | 0.00 |
Web pages are generally organized in tenns of visually distinct segments, such as Navigation bars, Advertisement banners, Headers, Portlets and Widgets. Despite the apparent structured layout, web pages are considered a source of unstructured data, from information extraction point of view. Hence, as a step towards interpreting the organization of web data, web page segmentation attempts to identify cohesive regions of a page. In this paper, we present a novel DOM tree mining approach for page segmentation. We first characterize the nodes of DOM tree structure, based on their Content Size and Entropy. While Content Size of a node indicates the amount of textual content contributed by its subtree, Entropy measures the strength of local "patterns" exhibited therein. In other words, a node manifesting highly repetitive patterns begets a high Entropy as per our fonnulation. Based on the characterization of DOM nodes, we then develop an unsupervised algorithm to automatically identify segments of a given web page.