The Unsupervised Acquisition of a Lexicon from Continuous Speech

Authors:
Carl de Marcken
Affiliations:
-
Venue:
The Unsupervised Acquisition of a Lexicon from Continuous Speech
Year:
1995

Citing 0
Cited 12

An Efficient, Probabilistically Sound Algorithm for Segmentation andWord Discovery

Machine Learning - Special issue on natural language learning
Approximation algorithms for grammar-based compression

SODA '02 Proceedings of the thirteenth annual ACM-SIAM symposium on Discrete algorithms
A statistical model for word discovery in transcribed speech

Computational Linguistics
Linguistic structure as composition and perturbation

ACL '96 Proceedings of the 34th annual meeting on Association for Computational Linguistics
Unsupervised discovery of morphemes

MPL '02 Proceedings of the ACL-02 workshop on Morphological and phonological learning - Volume 6
Grammar induction by MDL-based distributional classification

New developments in parsing technology
An algorithm for unsupervised topic discovery from broadcast news stories

HLT '02 Proceedings of the second international conference on Human Language Technology Research
Voting experts: An unsupervised algorithm for segmenting sequences

Intelligent Data Analysis
Applications of corpus-based semantic similarity and word segmentation to database schema matching

The VLDB Journal — The International Journal on Very Large Data Bases
A Generalized Approach to Word Segmentation Using Maximum Length Descending Frequency and Entropy Rate

CICLing '07 Proceedings of the 8th International Conference on Computational Linguistics and Intelligent Text Processing
Morphemes as necessary concept for structures discovery from untagged corpora

NeMLaP3/CoNLL '98 Proceedings of the Joint Conferences on New Methods in Language Processing and Computational Natural Language Learning
Computational modeling of phonetic and lexical learning in early language acquisition: Existing models and future directions

Speech Communication

Quantified Score

Hi-index	0.00

Visualization

Abstract

We present an unsupervised learning algorithm that acquires a natural- language lexicon from raw speech. The algorithm is based on the optimal encoding of symbol sequences in an MDL framework, and uses a hierarchical representation of language that overcomes many of the problems that have stymied previous grammar-induction procedures. The forward mapping from symbol sequences to the speech stream is modeled using features based on articulatory gestures. We present results on the acquisition of lexicons and language models from raw speech, text, and phonetic transcripts, and demonstrate that our algorithm compares very favorably to other reported results with respect to segmentation performance and statistical efficiency.