Automatic discovery of topics and acoustic morphemes from speech

Authors:
Christophe Cerisara
Affiliations:
LORIA UMR 7503, Campus Scientifique, 54506 Vandoeuvre-les-Nancy, France
Venue:
Computer Speech and Language
Year:
2009

Citing 20
Cited 0

Fast text searching: allowing errors

Communications of the ACM
Corpus processing for lexical acquisition

Corpus processing for lexical acquisition
Distinguished usage

Corpus processing for lexical acquisition
An Efficient, Probabilistically Sound Algorithm for Segmentation andWord Discovery

Machine Learning - Special issue on natural language learning
Grammar fragment acquisition using syntactic and semantic clustering

Speech Communication
An investigation of linguistic features and clustering algorithms for topical document clustering

SIGIR '00 Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval
Inference of Variable-length Acoustic Units for Continuous Speech Recognition

ICASSP '97 Proceedings of the 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP '97)-Volume 3 - Volume 3
Approaches to Phoneme-Based Topic Spotting: An Experimental Comparison

ICASSP '97 Proceedings of the 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP '97)-Volume 3 - Volume 3
Pattern discovery in sequence databases: algorithms and applications to dna/protein classification

Pattern discovery in sequence databases: algorithms and applications to dna/protein classification
Topic segmentation: algorithms and applications

Topic segmentation: algorithms and applications
Latent dirichlet allocation

The Journal of Machine Learning Research
TextTiling: segmenting text into multi-paragraph subtopic passages

Computational Linguistics
Advances in domain independent linear text segmentation

NAACL 2000 Proceedings of the 1st North American chapter of the Association for Computational Linguistics conference
Effective utterance classification with unsupervised phonotactic models

NAACL '03 Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology - Volume 1
Active learning for classifying phone sequences from unsupervised phonotactic models

NAACL-Short '03 Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology: companion volume of the Proceedings of HLT-NAACL 2003--short papers - Volume 2
A phonotactic-semantic paradigm for automatic spoken document classification

Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval
Using co-composition for acquiring syntactic and semantic subcategorisation

ULA '02 Proceedings of the ACL-02 workshop on Unsupervised lexical acquisition - Volume 9
BuzzTrack: topic detection and tracking in email

Proceedings of the 12th international conference on Intelligent user interfaces
An algorithm for unsupervised topic discovery from broadcast news stories

HLT '02 Proceedings of the second international conference on Human Language Technology Research
Learning concept hierarchies from text corpora using formal concept analysis

Journal of Artificial Intelligence Research

Quantified Score

Hi-index	0.00

Visualization

Abstract

This work deals with automatic lexical acquisition and topic discovery from a speech stream. The proposed algorithm builds a lexicon enriched with topic information in three steps: transcription of an audio stream into phone sequences with a speaker- and task-independent phone recogniser, automatic lexical acquisition based on approximate string matching, and hierarchical topic clustering of the lexical entries based on a knowledge-poor co-occurrence approach. The resulting semantic lexicon is then used to automatically cluster the incoming speech stream into topics. The main advantages of this algorithm are its very low computational requirements and its independence to pre-defined linguistic resources, which makes it easy to port to new languages and to adapt to new tasks. It is evaluated both qualitatively and quantitatively on two corpora and on two tasks related to topic clustering. The results of these evaluations are encouraging and outline future directions of research for the proposed algorithm, such as building automatic orthographic labels of the lexical items.