Towards full automation of lexicon construction

Authors:
Richard Rohwer;Dayne Freitag
Affiliations:
Fair Isaac Corporation;Fair Isaac Corporation
Venue:
CLS '04 Proceedings of the HLT-NAACL Workshop on Computational Lexical Semantics
Year:
2004

Citing 8
Cited 5

Class-based n-gram models of natural language

Computational Linguistics
Tagging English text with a probabilistic model

Computational Linguistics
Does Baum-Welch re-estimation help taggers?

ANLC '94 Proceedings of the fourth conference on Applied natural language processing
A practical part-of-speech tagger

ANLC '92 Proceedings of the third conference on Applied natural language processing
Distributional part-of-speech tagging

EACL '95 Proceedings of the seventh conference on European chapter of the Association for Computational Linguistics
Part-of-speech induction from scratch

ACL '93 Proceedings of the 31st annual meeting on Association for Computational Linguistics
Inducing syntactic categories by context distribution clustering

ConLL '00 Proceedings of the 2nd workshop on Learning language in logic and the 4th conference on Computational natural language learning - Volume 7
Unsupervised induction of stochastic context-free grammars using distributional clustering

ConLL '01 Proceedings of the 2001 workshop on Computational Natural Language Learning - Volume 7

Lexical acquisition and clustering of word senses to conceptual lexicon construction

Computers & Mathematics with Applications
A comparison of co-occurrence and similarity measures as simulations of context

CICLing'08 Proceedings of the 9th international conference on Computational linguistics and intelligent text processing
A novel approach for biclustering gene expression data using modular singular value decomposition

CIBB'09 Proceedings of the 6th international conference on Computational intelligence methods for bioinformatics and biostatistics
Cause identification from aviation safety incident reports via weakly supervised semantic lexicon construction

Journal of Artificial Intelligence Research
PAC-Bayesian Analysis of Co-clustering and Beyond

The Journal of Machine Learning Research

Quantified Score

Hi-index	0.00

Visualization

Abstract

We describe work in progress aimed at developing methods for automatically constructing a lexicon using only statistical data derived from analysis of corpora, a problem we call lexical optimization. Specifically, we use statistical methods alone to obtain information equivalent to syntactic categories, and to discover the semantically meaningful units of text, which may be multi-word units or polysemous terms-in-context. Our guiding principle is to employ a notion of "meaningfulness" that can be quantified information-theoretically, so that plausible variants of a lexicon can be judged relative to each other. We describe a technique of this nature called information theoretic co-clustering and give results of a series of experiments built around it that demonstrate the main ingredients of lexical optimization. We conclude by describing our plans for further improvements, and for applying the same mathematical principles to other problems in natural language processing.