Distributional term representations: an experimental comparison

Authors:
Alberto Lavelli;Fabrizio Sebastiani;Roberto Zanoli
Affiliations:
ITC-irst, Povo di Trento, Italy;ISTI-CNR, Pisa, Italy;ITC-irst, Povo di Trento, Italy
Venue:
Proceedings of the thirteenth ACM international conference on Information and knowledge management
Year:
2004

Citing 24
Cited 7

An approach to the automatic construction of global thesauri

Information Processing and Management: an International Journal
Term clustering of syntactic phrases

SIGIR '90 Proceedings of the 13th annual international ACM SIGIR conference on Research and development in information retrieval
Elements of information theory

Elements of information theory
Experiments in automatic statistical thesaurus construction

SIGIR '92 Proceedings of the 15th annual international ACM SIGIR conference on Research and development in information retrieval
Dimensions of meaning

Proceedings of the 1992 ACM/IEEE conference on Supercomputing
Concept based query expansion

SIGIR '93 Proceedings of the 16th annual international ACM SIGIR conference on Research and development in information retrieval
Class-based n-gram models of natural language

Computational Linguistics
Automatic thesaurus generation for an electronic community system

Journal of the American Society for Information Science
Experiments in multilingual information retrieval using the SPIDER system

SIGIR '96 Proceedings of the 19th annual international ACM SIGIR conference on Research and development in information retrieval
A cooccurrence-based thesaurus and two applications to information retrieval

Information Processing and Management: an International Journal
Making large-scale support vector machine learning practical

Advances in kernel methods
Data clustering: a review

ACM Computing Surveys (CSUR)
BoosTexter: A Boosting-based Systemfor Text Categorization

Machine Learning - Special issue on information retrieval
An improved boosting algorithm and its application to text categorization

Proceedings of the ninth international conference on Information and knowledge management
Machine learning in automated text categorization

ACM Computing Surveys (CSUR)
Explorations in Automatic Thesaurus Discovery

Explorations in Automatic Thesaurus Discovery
Cross-Language Information Retrieval in a Multilingual Legal Domain

ECDL '97 Proceedings of the First European Conference on Research and Advanced Technology for Digital Libraries
Expanding domain-specific lexicons by term categorization

Proceedings of the 2003 ACM symposium on Applied computing
Relationship-based clustering and cluster ensembles for high-dimensional data mining

Relationship-based clustering and cluster ensembles for high-dimensional data mining
Automatic word sense discrimination

Computational Linguistics - Special issue on word sense disambiguation
Automatic retrieval and clustering of similar words

COLING '98 Proceedings of the 17th international conference on Computational linguistics - Volume 2
Distributional clustering of English words

ACL '93 Proceedings of the 31st annual meeting on Association for Computational Linguistics
Automatically discovering word senses

NAACL-Demonstrations '03 Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology: Demonstrations - Volume 4
Discretizing continuous attributes in AdaBoost for text categorization

ECIR'03 Proceedings of the 25th European conference on IR research

Medical decision making using vector space model

Proceedings of the 1st ACM International Health Informatics Symposium
Concept based representations for ranking in geographic information retrieval

IceTAL'10 Proceedings of the 7th international conference on Advances in natural language processing
A Survey of Automatic Query Expansion in Information Retrieval

ACM Computing Surveys (CSUR)
Multimodal indexing based on semantic cohesion for image retrieval

Information Retrieval
Multimodal recognition of visual concepts using histograms of textual concepts and selective weighted late fusion scheme

Computer Vision and Image Understanding
Distributional term representations for short-text categorization

CICLing'13 Proceedings of the 14th international conference on Computational Linguistics and Intelligent Text Processing - Volume 2
Automatic classification of web databases using domain-dictionaries

MLDM'13 Proceedings of the 9th international conference on Machine Learning and Data Mining in Pattern Recognition

Quantified Score

Hi-index	0.00

Visualization

Abstract

A number of content management tasks, including term categorization, term clustering, and automated thesaurus generation, view natural language terms (e.g. words, noun phrases) as first-class objects, i.e. as objects endowed with an internal representation which makes them suitable for explicit manipulation by the corresponding algorithms. The information retrieval (IR) literature has traditionally used an extensional (aka distributional) representation for terms according to which a term is represented by the "bag of documents" in which the term occurs. The computational linguistics (CL) literature has independently developed an alternative distributional representation for terms, according to which a term is represented by the "bag of terms" that co-occur with it in some document. This paper aims at discovering which of the two representations is most effective, i.e. brings about higher effectiveness once used in tasks that require terms to be explicitly represented and manipulated. We carry out experiments on (i) a term categorization task, and (ii) a term clustering task; this allows us to compare the two different representations in closely controlled experimental conditions. We report the results of experiments in which we categorize/cluster under 42 different classes the terms extracted from a corpus of more than 65,000 documents. Our results show a substantial difference in effectiveness between the two representation styles; we give both an intuitive explanation and an information-theoretic justification for these different behaviours.