Towards Unrestricted, Large-Scale Acquisition of Feature-Based Conceptual Representations from Corpus Data

Authors:
Barry Devereux;Nicholas Pilkington;Thierry Poibeau;Anna Korhonen
Affiliations:
Centre for Speech, Language and the Brain, Department of Experimental Psychology, University of Cambridge, Cambridge, UK CB2 3EB;Computer Laboratory & RCEAL, University of Cambridge, Cambridge, UK CB3 0FD;Laboratoire LaTTiCe, CNRS UMR 8094 and École Normale Supérieure, Montrouge, France 92120;Computer Laboratory & RCEAL, University of Cambridge, Cambridge, UK CB3 0FD
Venue:
Research on Language and Computation
Year:
2009

Citing 13
Cited 2

An Information-Theoretic Definition of Similarity

ICML '98 Proceedings of the Fifteenth International Conference on Machine Learning
Using corpus statistics and WordNet relations for sense identification

Computational Linguistics - Special issue on word sense disambiguation
Verbs semantics and lexical selection

ACL '94 Proceedings of the 32nd annual meeting on Association for Computational Linguistics
Evaluating WordNet-based Measures of Lexical Semantic Relatedness

Computational Linguistics
The second release of the RASP system

COLING-ACL '06 Proceedings of the COLING/ACL on Interactive presentation sessions
Dependency-Based Construction of Semantic Space Models

Computational Linguistics
Semantic classification with distributional kernels

COLING '08 Proceedings of the 22nd International Conference on Computational Linguistics - Volume 1
Computing semantic relatedness using Wikipedia-based explicit semantic analysis

IJCAI'07 Proceedings of the 20th international joint conference on Artifical intelligence
Using information content to evaluate semantic similarity in a taxonomy

IJCAI'95 Proceedings of the 14th international joint conference on Artificial intelligence - Volume 1
EEG responds to conceptual stimuli and corpus semantics

EMNLP '09 Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume 2 - Volume 2
Unsupervised and constrained Dirichlet process mixture models for verb clustering

GEMS '09 Proceedings of the Workshop on Geometrical Models of Natural Language Semantics
Natural Language Processing with Python

Natural Language Processing with Python
Verb class discovery from rich syntactic data

CICLing'08 Proceedings of the 9th international conference on Computational linguistics and intelligent text processing

Semi-supervised learning for automatic conceptual property extraction

CMCL '12 Proceedings of the 3rd Workshop on Cognitive Modeling and Computational Linguistics
Objects and categories: Feature statistics and object processing in the ventral stream

Journal of Cognitive Neuroscience

Quantified Score

Hi-index	0.00

Visualization

Abstract

In recent years a number of methods have been proposed for the automatic acquisition of feature-based conceptual representations from text corpora. Such methods could offer valuable support for theoretical research on conceptual representation. However, existing methods do not target the full range of concept-relation-feature triples occurring in human-generated norms (e.g. flute produce sound) but rather focus on concept-feature pairs (e.g. flute --- sound) or triples involving specific relations only (e.g. is-a or part-of relations). In this article we investigate the challenges that need to be met in both methodology and evaluation when moving towards the acquisition of more comprehensive conceptual representations from corpora. In particular, we investigate the usefulness of three types of knowledge in guiding the extraction process: encyclopedic, syntactic and semantic. We present first a semantic analysis of existing, human-generated feature production norms, which reveals information about co-occurring concept and feature classes. We introduce then a novel method for large-scale feature extraction which uses the class-based information to guide the acquisition process. The method involves extracting candidate triples consisting of concepts, relations and features (e.g. deer have antlers, flute produce sound) from corpus data parsed for grammatical dependencies, and re-weighting the triples on the basis of conditional probabilities calculated from our semantic analysis. We apply this method to an automatically parsed Wikipedia corpus which includes encyclopedic information and evaluate its accuracy using a number of different methods: direct evaluation against the McRae norms in terms of feature types and frequencies, human evaluation, and novel evaluation in terms of conceptual structure variables. Our investigation highlights a number of issues which require addressing in both methodology and evaluation when aiming to improve the accuracy of unconstrained feature extraction further.