A vector space model for automatic indexing
Communications of the ACM
Discovering informative content blocks from Web documents
Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
Recursive X-Y cut using bounding boxes of connected components
ICDAR '95 Proceedings of the Third International Conference on Document Analysis and Recognition (Volume 2) - Volume 2
Pattern Classification (2nd Edition)
Pattern Classification (2nd Edition)
Eliminating noisy information in Web pages for data mining
Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
Learning block importance models for web pages
Proceedings of the 13th international conference on World Wide Web
Hierarchical clustering of WWW image search results using visual, textual and link information
Proceedings of the 12th annual ACM international conference on Multimedia
Automatic Identification of Informative Sections of Web Pages
IEEE Transactions on Knowledge and Data Engineering
Combining DOM tree and geometric layout analysis for online medical journal article segmentation
Proceedings of the 6th ACM/IEEE-CS joint conference on Digital libraries
Proceedings of the 6th ACM international conference on Image and video retrieval
BlockWeb: An IR Model for Block Structured Web Pages
CBMI '09 Proceedings of the 2009 Seventh International Workshop on Content-Based Multimedia Indexing
Indexing by permeability in block structured web pages
Proceedings of the 9th ACM symposium on Document engineering
Hierarchical indexing and flexible element retrieval for structured document
ECIR'03 Proceedings of the 25th European conference on IR research
Indexing dataspaces with partitions
World Wide Web
Adaptive support framework for wisdom web of things
World Wide Web
Hi-index | 0.00 |
We present in this paper a model for indexing and querying web pages, based on the hierarchical decomposition of pages into blocks. Splitting up a page into blocks has several advantages in terms of page design, indexing and querying such as (i) blocks of a page most similar to a query may be returned instead of the page as a whole (ii) the importance of a block can be taken into account, as well as (iii) the permeability of the blocks to neighbor blocks: a block b is said to be permeable to a block b驴 in the same page if b驴 content (text, image, etc.) can be (partially) inherited by b upon indexing. An engine implementing this model is described including: the transformation of web pages into blocks hierarchies, the definition of a dedicated language to express indexing rules and the storage of indexed blocks into an XML repository. The model is assessed on a dataset of electronic news, and a dataset drawn from web pages of the ImagEval campaign where it improves by 16% the mean average precision of the baseline.