Page segmentation and classification
CVGIP: Graphical Models and Image Processing
Document Representation and Its Application to Page Decomposition
IEEE Transactions on Pattern Analysis and Machine Intelligence
Twenty Years of Document Image Analysis in PAMI
IEEE Transactions on Pattern Analysis and Machine Intelligence
Two approaches to bringing Internet services to WAP devices
Proceedings of the 9th international World Wide Web conference on Computer networks : the international journal of computer and telecommunications netowrking
Accordion summarization for end-game browsing on PDAs and cellular phones
Proceedings of the SIGCHI Conference on Human Factors in Computing Systems
Function-based object model towards website adaptation
Proceedings of the 10th international conference on World Wide Web
The Document Spectrum for Page Layout Analysis
IEEE Transactions on Pattern Analysis and Machine Intelligence
Toward Learning Based Web Query Processing
VLDB '00 Proceedings of the 26th International Conference on Very Large Data Bases
Discovering informative content blocks from Web documents
Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
Recursive X-Y cut using bounding boxes of connected components
ICDAR '95 Proceedings of the Third International Conference on Document Analysis and Recognition (Volume 2) - Volume 2
Document Object Model
Extracting content structure for web pages based on visual representation
APWeb'03 Proceedings of the 5th Asia-Pacific web conference on Web technologies and applications
Structure and content analysis for html medical articles: a hidden markov model approach
Proceedings of the 2007 ACM symposium on Document engineering
Indexing by permeability in block structured web pages
Proceedings of the 9th ACM symposium on Document engineering
Web page analysis based on HTML DOM and its usage for forum statistics and alerts
ECC'10 Proceedings of the 4th conference on European computing conference
WSEAS Transactions on Computers
Hi-index | 0.02 |
We describe an HTML web page segmentation algorithm, which is applied to segment online medical journal articles (regular HTML and PDF-Converted-HTML files). The web page content is modeled by a zone tree structure based primarily on the geometric layout of the web page. For a given journal article, a zone tree is generated by combining DOM tree analysis and recursive X-Y cut algorithm. Combining with other visual cues, such as background color, font size, font color and so on, the page is segmented into homogeneous regions. Evaluation is conducted with 104 articles from 11 journals. Out of 9726 ground-truth zones, 9376 zones are correctly segmented, for an accuracy of 96.40%. Segmenting the entire web page into zones can significantly expedite and increase the accuracy of the subsequent information retrieval steps.