Multiple sets of features for automatic genre classification of web documents

Authors:
Chul Su Lim;Kong Joo Lee;Gil Chang Kim
Affiliations:
Division of Computer Science, Department of EECS, KAIST, 373-1 Kusong-dong, Yusong-gu, Taejon 305-701, South Korea;School of Computer and Information Technology, KyungIn Women's College, 101 Kyesan-dong, Gyeyang-gu, Incheon 407-740, South Korea;Division of Computer Science, Department of EECS, KAIST, 373-1 Kusong-dong, Yusong-gu, Taejon 305-701, South Korea
Venue:
Information Processing and Management: an International Journal
Year:
2005

Citing 10
Cited 5

Constant interaction-time scatter/gather browsing of very large document collections

SIGIR '93 Proceedings of the 16th annual international ACM SIGIR conference on Research and development in information retrieval
Grouper: a dynamic clustering interface to Web search results

WWW '99 Proceedings of the eighth international conference on World Wide Web
The Importance of Prior Probabilities for Entry Page Search

SIGIR '02 Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval
Text genre classification with genre-revealing and subject-revealing features

SIGIR '02 Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval
Evaluating contents-link coupled web page clustering for web search results

Proceedings of the eleventh international conference on Information and knowledge management
An Empirical Text Categorizing Computational Model Based on Stylistic Aspects

ICTAI '96 Proceedings of the 8th International Conference on Tools with Artificial Intelligence
Automatic text categorization in terms of genre and author

Computational Linguistics
Automatic detection of text genre

ACL '98 Proceedings of the 35th Annual Meeting of the Association for Computational Linguistics and Eighth Conference of the European Chapter of the Association for Computational Linguistics
Recognizing text genres with simple metrics using discriminant analysis

COLING '94 Proceedings of the 15th conference on Computational linguistics - Volume 2
Text genre detection using common word frequencies

COLING '00 Proceedings of the 18th conference on Computational linguistics - Volume 2

Cuisine: Classification using stylistic feature sets and-or name-based feature sets

Journal of the American Society for Information Science and Technology
Which clustering do you want? inducing your ideal clustering with minimal feedback

Journal of Artificial Intelligence Research
Identifying historical period and ethnic origin of documents using stylistic feature sets

DS'06 Proceedings of the 9th international conference on Discovery Science
Open-Set classification for automated genre identification

ECIR'13 Proceedings of the 35th European conference on Advances in Information Retrieval
Recognition of word collocation habits using frequency rank ratio and inter-term intimacy

Expert Systems with Applications: An International Journal

Quantified Score

Hi-index	0.00

Visualization

Abstract

With the increase of information on the Web, it is difficult to find desired information quickly out of the documents retrieved by a search engine. One way to solve this problem is to classify web documents according to various criteria. Most document classification has been focused on a subject or a topic of a document. A genre or a style is another view of a document different from a subject or a topic. The genre is also a criterion to classify documents. In this paper, we suggest multiple sets of features to classify genres of web documents. The basic set of features, which have been proposed in the previous studies, is acquired from the textual properties of documents, such as the number of sentences, the number of a certain word, etc. However, web documents are different from textual documents in that they contain URL and HTML tags within the pages. We introduce new sets of features specific to web documents, which are extracted from URL and HTML tags. The present work is an attempt to evaluate the performance of the proposed sets of features, and to discuss their characteristics. Finally, we conclude which is an appropriate set of features in automatic genre classification of web documents.