Multimodal subjectivity analysis of multiparty conversation
EMNLP '08 Proceedings of the Conference on Empirical Methods in Natural Language Processing
Classifying Web Pages by Genre: An n-Gram Approach
WI-IAT '09 Proceedings of the 2009 IEEE/WIC/ACM International Joint Conference on Web Intelligence and Intelligent Agent Technology - Volume 01
Enhance web pages genre identification using neighboring pages
WISE'11 Proceedings of the 12th international conference on Web information system engineering
Testing a genre-enabled application: a preliminary assessment
FDIA'08 Proceedings of the 2nd BCS IRSG conference on Future Directions in Information Access
Cross-lingual genre classification
EACL '12 Proceedings of the Student Research Workshop at the 13th Conference of the European Chapter of the Association for Computational Linguistics
Hi-index | 0.00 |
An important factor for discriminating between webpages is their genre (e.g., blogs, personal homepages, e-shops, online newspapers, etc). Webpage genre identification has a great potential in information retrieval since users of search engines can combine genre-based and traditional topic-based queries to improve the quality of the results. So far, various features have been proposed to quantify the style of webpages including word and html-tag frequencies. In this paper, we propose a low-level representation for this problem based on character n-grams. Using an existing approach, we produce feature sets of variable-length character n- grams and combine this representation with information about the most frequent html-tags. Based on two benchmark corpora, we present webpage genre identification experiments and improve the best reported results in both cases.