Recursive X-Y cut using bounding boxes of connected components
ICDAR '95 Proceedings of the Third International Conference on Document Analysis and Recognition (Volume 2) - Volume 2
ICDM '02 Proceedings of the 2002 IEEE International Conference on Data Mining
Reproduced and emergent genres of communication on the World-Wide Web
HICSS '97 Proceedings of the 30th Hawaii International Conference on System Sciences: Digital Documents - Volume 6
The portrait of a common HTML web page
Proceedings of the 2006 ACM symposium on Document engineering
Hi-index | 0.00 |
Automatic genre classification historically has focused on extracting textual features from documents. In this research, we investigate whether visual features of HTML documents can improve the classification of fine grained genres. Three different sets of features were compared on a genre classification task in the e-commerce domain - one with just textual features, one with HTML features added, and a third with additional visual features. Our experiments show that adding HTML and visual features provides much better classification than textual features alone.