An approach to the automatic construction of global thesauri
Information Processing and Management: an International Journal
Term clustering of syntactic phrases
SIGIR '90 Proceedings of the 13th annual international ACM SIGIR conference on Research and development in information retrieval
Elements of information theory
Elements of information theory
Experiments in automatic statistical thesaurus construction
SIGIR '92 Proceedings of the 15th annual international ACM SIGIR conference on Research and development in information retrieval
Proceedings of the 1992 ACM/IEEE conference on Supercomputing
SIGIR '93 Proceedings of the 16th annual international ACM SIGIR conference on Research and development in information retrieval
Class-based n-gram models of natural language
Computational Linguistics
Automatic thesaurus generation for an electronic community system
Journal of the American Society for Information Science
Experiments in multilingual information retrieval using the SPIDER system
SIGIR '96 Proceedings of the 19th annual international ACM SIGIR conference on Research and development in information retrieval
A cooccurrence-based thesaurus and two applications to information retrieval
Information Processing and Management: an International Journal
Making large-scale support vector machine learning practical
Advances in kernel methods
ACM Computing Surveys (CSUR)
BoosTexter: A Boosting-based Systemfor Text Categorization
Machine Learning - Special issue on information retrieval
An improved boosting algorithm and its application to text categorization
Proceedings of the ninth international conference on Information and knowledge management
Machine learning in automated text categorization
ACM Computing Surveys (CSUR)
Explorations in Automatic Thesaurus Discovery
Explorations in Automatic Thesaurus Discovery
Cross-Language Information Retrieval in a Multilingual Legal Domain
ECDL '97 Proceedings of the First European Conference on Research and Advanced Technology for Digital Libraries
Expanding domain-specific lexicons by term categorization
Proceedings of the 2003 ACM symposium on Applied computing
Relationship-based clustering and cluster ensembles for high-dimensional data mining
Relationship-based clustering and cluster ensembles for high-dimensional data mining
Automatic word sense discrimination
Computational Linguistics - Special issue on word sense disambiguation
Automatic retrieval and clustering of similar words
COLING '98 Proceedings of the 17th international conference on Computational linguistics - Volume 2
Distributional clustering of English words
ACL '93 Proceedings of the 31st annual meeting on Association for Computational Linguistics
Automatically discovering word senses
NAACL-Demonstrations '03 Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology: Demonstrations - Volume 4
Discretizing continuous attributes in AdaBoost for text categorization
ECIR'03 Proceedings of the 25th European conference on IR research
Medical decision making using vector space model
Proceedings of the 1st ACM International Health Informatics Symposium
Concept based representations for ranking in geographic information retrieval
IceTAL'10 Proceedings of the 7th international conference on Advances in natural language processing
A Survey of Automatic Query Expansion in Information Retrieval
ACM Computing Surveys (CSUR)
Multimodal indexing based on semantic cohesion for image retrieval
Information Retrieval
Computer Vision and Image Understanding
Distributional term representations for short-text categorization
CICLing'13 Proceedings of the 14th international conference on Computational Linguistics and Intelligent Text Processing - Volume 2
Automatic classification of web databases using domain-dictionaries
MLDM'13 Proceedings of the 9th international conference on Machine Learning and Data Mining in Pattern Recognition
Hi-index | 0.00 |
A number of content management tasks, including term categorization, term clustering, and automated thesaurus generation, view natural language terms (e.g. words, noun phrases) as first-class objects, i.e. as objects endowed with an internal representation which makes them suitable for explicit manipulation by the corresponding algorithms. The information retrieval (IR) literature has traditionally used an extensional (aka distributional) representation for terms according to which a term is represented by the "bag of documents" in which the term occurs. The computational linguistics (CL) literature has independently developed an alternative distributional representation for terms, according to which a term is represented by the "bag of terms" that co-occur with it in some document. This paper aims at discovering which of the two representations is most effective, i.e. brings about higher effectiveness once used in tasks that require terms to be explicitly represented and manipulated. We carry out experiments on (i) a term categorization task, and (ii) a term clustering task; this allows us to compare the two different representations in closely controlled experimental conditions. We report the results of experiments in which we categorize/cluster under 42 different classes the terms extracted from a corpus of more than 65,000 documents. Our results show a substantial difference in effectiveness between the two representation styles; we give both an intuitive explanation and an information-theoretic justification for these different behaviours.