Analysing part-of-speech for portuguese text classification

Authors:
Teresa Gonçalves;Cassiana Silva;Paulo Quaresma;Renata Vieira
Affiliations:
Dep. Informática, Universidade de Évora, Évora, Portugal;Unisinos, São Leopoldo, RS, Brasil;Dep. Informática, Universidade de Évora, Évora, Portugal;Unisinos, São Leopoldo, RS, Brasil
Venue:
CICLing'06 Proceedings of the 7th international conference on Computational Linguistics and Intelligent Text Processing
Year:
2006

Citing 8
Cited 1

Elements of information theory

Elements of information theory
A comparison of classifiers and document representations for the routing problem

SIGIR '95 Proceedings of the 18th annual international ACM SIGIR conference on Research and development in information retrieval
Support-Vector Networks

Machine Learning
Data mining: practical machine learning tools and techniques with Java implementations

Data mining: practical machine learning tools and techniques with Java implementations
Learning to Classify Text Using Support Vector Machines: Methods, Theory and Algorithms

Learning to Classify Text Using Support Vector Machines: Methods, Theory and Algorithms
Introduction to Modern Information Retrieval

Introduction to Modern Information Retrieval
Feature Selection for Unbalanced Class Distribution and Naive Bayes

ICML '99 Proceedings of the Sixteenth International Conference on Machine Learning
Kernel Methods for Pattern Analysis

Kernel Methods for Pattern Analysis

Using Graph-Kernels to Represent Semantic Information in Text Classification

MLDM '09 Proceedings of the 6th International Conference on Machine Learning and Data Mining in Pattern Recognition

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper proposes and evaluates the use of linguistic information in the pre-processing phase of text classification. We present several experiments evaluating the selection of terms based on different measures and linguistic knowledge. To build the classifier we used Support Vector Machines (SVM), which are known to produce good results on text classification tasks. Our proposals were applied to two different datasets written in the Portuguese language: articles from a Brazilian newspaper (Folha de São Paulo) and juridical documents from the Portuguese Attorney General’s Office. The results show the relevance of part-of-speech information for the pre-processing phase of text classification allowing for a strong reduction of the number of features needed in the text classification.