Linguistic preprocessing for distributional classification of words

Authors:
Viktor Pekar
Affiliations:
University of Wolverhampton, MB, Wolverhampton, UK
Venue:
ElectricDict '04 Proceedings of the Workshop on Enhancing and Using Electronic Dictionaries
Year:
2004

Citing 10
Cited 2

Machine learning in automated text categorization

ACM Computing Surveys (CSUR)
Naive (Bayes) at Forty: The Independence Assumption in Information Retrieval

ECML '98 Proceedings of the 10th European Conference on Machine Learning
Extending a Lexical Ontology by a Combination of Distributional Semantics Signatures

EKAW '02 Proceedings of the 13th International Conference on Knowledge Engineering and Knowledge Management. Ontologies and the Semantic Web
Text-translation alignment

Computational Linguistics - Special issue on using large corpora: I
Automatic retrieval and clustering of similar words

COLING '98 Proceedings of the 17th international conference on Computational linguistics - Volume 2
An IR approach for translating new words from nonparallel, comparable texts

COLING '98 Proceedings of the 17th international conference on Computational linguistics - Volume 1
Automatic construction of a hypernym-labeled noun hierarchy from text

ACL '99 Proceedings of the 37th annual meeting of the Association for Computational Linguistics on Computational Linguistics
Taxonomy learning: factoring the structure of a taxonomy into a semantic classification decision

COLING '02 Proceedings of the 19th international conference on Computational linguistics - Volume 1
Constructing semantic space models from parsed corpora

ACL '03 Proceedings of the 41st Annual Meeting on Association for Computational Linguistics - Volume 1
Boosting automatic lexical acquisition with morphological information

ULA '02 Proceedings of the ACL-02 workshop on Unsupervised lexical acquisition - Volume 9

Extending a thesaurus with words from Pan-Chinese sources

COLING '08 Proceedings of the 22nd International Conference on Computational Linguistics - Volume 1
Automatic word clustering in Russian texts

TSD'07 Proceedings of the 10th international conference on Text, speech and dialogue

Quantified Score

Hi-index	0.00

Visualization

Abstract

The paper is concerned with automatic classification of new lexical items into synonymic sets on the basis of their cooccurrence data obtained from a corpus. Our goal is to examine the impact that different types of linguistic preprocessing of the cooccurrence material have on the classification accuracy. The paper comparatively studies several preprocessing techniques frequently used for this and similar tasks and makes conclusions about their relative merits. We find that a carefully chosen preprocessing procedure achieves a relative effectiveness improvement of up to 88% depending on the classification method in comparison to the window-based context delineation, along with using much smaller feature space.