Automatic specialized vs. non-specialized sentence differentiation

  • Authors:
  • Iria Da Cunha;M. Teresa Cabré;Eric SanJuan;Gerardo Sierra;Juan Manuel Torres-Moreno;Jorge Vivaldi

  • Affiliations:
  • Grupo de Ingeniería Lingüística, Instituto de Ingeniería, UNAM, Mexico, D.F., Mexico and Laboratoire Inf. d'Avignon, UAPV, Avignon Cedex 9, France and Institut Univ. de Linguis ...;Institut Universitari de Linguistique Applicada, UPF, Barcelona, España;Laboratoire Informatique d'Avignon, UAPV, Avignon Cedex 9, France;Grupo de Ingeniería Lingüística, Instituto de Ingeniería, UNAM, Mexico, D.F., Mexico;Grupo de Ingeniería Lingüística, Inst. de Ingeniería, UNAM, Mexico, D.F., Mexico and Lab. Inf. d'Avignon, UAPV, Avignon Cedex 9, France and École Polytechnique de Montr ...;Institut Universitari de Linguistique Applicada, UPF, Barcelona, España

  • Venue:
  • CICLing'11 Proceedings of the 12th international conference on Computational linguistics and intelligent text processing - Volume Part II
  • Year:
  • 2011

Quantified Score

Hi-index 0.00

Visualization

Abstract

Compilation of Languages for Specific Purposes (LSP) corpora is a task which is fraught with several difficulties (mainly time and human effort), because it is not easy to discern between specialized and non-specialized text. The aim of this work is to study automatic specialized vs. non-specialized sentence differentiation. The experiments are carried out on two corpora of sentences extracted from specialized and non-specialized texts. One in economics (academic publications and news from newspapers), another about sexuality (academic publications and texts from forums and blogs). First we show the feasibility of the task using a statistical n-gram classifier. Then we show that grammatical features can also be used to classify sentences from the first corpus. For such purpose we use association rule mining.