On the Impact of Lexical and Linguistic Features in Genre- and Domain-Based Categorization

  • Authors:
  • Guillaume Cleuziou;Céline Poudat

  • Affiliations:
  • LIFO, Université d'Orléans, France;ERTIM, INALCO, Paris, France

  • Venue:
  • CICLing '07 Proceedings of the 8th International Conference on Computational Linguistics and Intelligent Text Processing
  • Year:
  • 2009

Quantified Score

Hi-index 0.00

Visualization

Abstract

Classification in genres and domains is a major field of research for Information Retrieval (scientific and technical watch, data-mining, etc.) and the selection of appropriate descriptors to characterize and classify texts is particularly crucial to that effect.Most of practical experiments consider that domains are correlated to the content level (words, tokens, lemmas, etc.) and genres to the morphosyntactic or linguistic one (function words, POS, etc.). However, currently used variables are generally not accurate enough to be applied to the categorization task.The present study assesses the impact of the lexical and linguistic levels in the field of genre and domain categorization. The empirical results we obtained demonstrate how important it is to select an appropriate tagset that meets the requirement of the task. The results also assess the efficiency of the linguistic level for both genre- and domain-based categorization.