Discourse type clustering using POS n-gram profiles and high-dimensional embeddings

Authors:
Christelle Cocco
Affiliations:
University of Lausanne, Switzerland
Venue:
EACL '12 Proceedings of the Student Research Workshop at the 13th Conference of the European Chapter of the Association for Computational Linguistics
Year:
2012

Citing 7
Cited 0

Foundations of statistical natural language processing

Foundations of statistical natural language processing
A Comparative Study on Feature Selection in Text Categorization

ICML '97 Proceedings of the Fourteenth International Conference on Machine Learning
Recognizing text genres with simple metrics using discriminant analysis

COLING '94 Proceedings of the 15th conference on Computational linguistics - Volume 2
Briefly noted: modes of discourse: The local structure of texts

Computational Linguistics
Can shared-neighbor distances defeat the curse of dimensionality?

SSDBM'10 Proceedings of the 22nd international conference on Scientific and statistical database management
Euclidean distances, soft and spectral clustering on weighted graphs

ECML PKDD'10 Proceedings of the 2010 European conference on Machine learning and knowledge discovery in databases: Part I
On the Schoenberg Transformations in Data Analysis: Theory and Illustrations

Journal of Classification

Quantified Score

Hi-index	0.00

Visualization

Abstract

To cluster textual sequence types (discourse types/modes) in French texts, K-means algorithm with high-dimensional embeddings and fuzzy clustering algorithm were applied on clauses whose POS (part-of speech) n-gram profiles were previously extracted. Uni-, bi- and trigrams were used on four 19th century French short stories by Maupassant. For high-dimensional embeddings, power transformations on the chisquared distances between clauses were explored. Preliminary results show that highdimensional embeddings improve the quality of clustering, contrasting the use of bi- and trigrams whose performance is disappointing, possibly because of feature space sparsity.