A neural probabilistic language model

Authors:
Yoshua Bengio;Réjean Ducharme;Pascal Vincent;Christian Janvin
Affiliations:
Département d'Informatique et Recherche Opérationnelle, Centre de Recherche Mathématiques, Université de Montréal, Montréal, Québec, Canada;Département d'Informatique et Recherche Opérationnelle, Centre de Recherche Mathématiques, Université de Montréal, Montréal, Québec, Canada;Département d'Informatique et Recherche Opérationnelle, Centre de Recherche Mathématiques, Université de Montréal, Montréal, Québec, Canada;Département d'Informatique et Recherche Opérationnelle, Centre de Recherche Mathématiques, Université de Montréal, Montréal, Québec, Canada
Venue:
The Journal of Machine Learning Research
Year:
2003

Citing 9
Cited 60

Class-based n-gram models of natural language

Computational Linguistics
A maximum entropy approach to natural language processing

Computational Linguistics
Distributional clustering of words for text classification

Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval
Effiicient BackProp

Neural Networks: Tricks of the Trade, this book is an outgrowth of a 1996 NIPS workshop
Word Space

Advances in Neural Information Processing Systems 5, [NIPS Conference]
Extracting Distributed Representations of Concepts and Relations from Positive and Negative Propositions

IJCNN '00 Proceedings of the IEEE-INNS-ENNS International Joint Conference on Neural Networks (IJCNN'00)-Volume 2 - Volume 2
Distributional clustering of English words

ACL '93 Proceedings of the 31st annual meeting on Association for Computational Linguistics
Sequential neural text compression

IEEE Transactions on Neural Networks
Taking on the curse of dimensionality in joint distributions using neural networks

IEEE Transactions on Neural Networks

A Neural Syntactic Language Model

Machine Learning
Combining Statistical Language Models via the Latent Maximum Entropy Principle

Machine Learning
Training connectionist models for the structured language model

EMNLP '03 Proceedings of the 2003 conference on Empirical methods in natural language processing
Self-organizing η-gram model for automatic word spacing

ACL-44 Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics
A hierarchical Bayesian language model based on Pitman-Yor processes

ACL-44 Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics
Training neural network language models on very large corpora

HLT '05 Proceedings of the conference on Human Language Technology and Empirical Methods in Natural Language Processing
Continuous space language models for statistical machine translation

COLING-ACL '06 Proceedings of the COLING/ACL on Main conference poster sessions
Three new graphical models for statistical language modelling

Proceedings of the 24th international conference on Machine learning
Hand gesture recognition and tracking based on distributed locally linear embedding

Image and Vision Computing
Modeling Topic and Role Information in Meetings Using the Hierarchical Dirichlet Process

MLMI '08 Proceedings of the 5th international workshop on Machine Learning for Multimodal Interaction
Context dependent class language model based on word co-occurrence matrix in LSA framework for speech recognition

ACS'08 Proceedings of the 8th conference on Applied computer scince
Improving a statistical language model through non-linear prediction

Neurocomputing
A stochastic memoizer for sequence data

ICML '09 Proceedings of the 26th Annual International Conference on Machine Learning
Fast Evaluation of Connectionist Language Models

IWANN '09 Proceedings of the 10th International Work-Conference on Artificial Neural Networks: Part I: Bio-Inspired Systems: Computational and Ambient Intelligence
Discriminative learning of selectional preference from unlabeled text

EMNLP '08 Proceedings of the Conference on Empirical Methods in Natural Language Processing
Tied-mixture language modeling in continuous space

NAACL '09 Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics
Building a statistical machine translation system for French using the Europarl corpus

StatMT '07 Proceedings of the Second Workshop on Statistical Machine Translation
First steps towards a general purpose French/English statistical machine translation system

StatMT '08 Proceedings of the Third Workshop on Statistical Machine Translation
Learning Deep Architectures for AI

Foundations and Trends® in Machine Learning
Distributional representations for handling sparsity in supervised sequence-labeling

ACL '09 Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP: Volume 1 - Volume 1
Rules and generalization capacity extraction from ANN with GP

IWANN'03 Proceedings of the Artificial and natural neural networks 7th international conference on Computational methods in neural modeling - Volume 1
Web document modeling

The adaptive web
Word representations: a simple and general method for semi-supervised learning

ACL '10 Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics
UCH-UPV English: Spanish system for WMT10

WMT '10 Proceedings of the Joint Fifth Workshop on Statistical Machine Translation and MetricsMATR
Exploring representation-learning approaches to domain adaptation

DANLP 2010 Proceedings of the 2010 Workshop on Domain Adaptation for Natural Language Processing
Training continuous space language models: some practical issues

EMNLP '10 Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing
Hierarchical Bayesian language models for conversational speech recognition

IEEE Transactions on Audio, Speech, and Language Processing
The sequence memoizer

Communications of the ACM
A neural network for text representation

ICANN'05 Proceedings of the 15th international conference on Artificial neural networks: formal models and their applications - Volume Part II
Learning word vectors for sentiment analysis

HLT '11 Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies - Volume 1
Temporal restricted Boltzmann machines for dependency parsing

HLT '11 Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies: short papers - Volume 2
A scalable probabilistic classifier for language modeling

HLT '11 Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies: short papers - Volume 2
Prosodic and temporal features for language modeling for dialog

Speech Communication
Sentiment classification based on supervised latent n-gram analysis

Proceedings of the 20th ACM international conference on Information and knowledge management
Half-context language models

Computational Linguistics
An Artificial Intelligence-based language modeling framework

Expert Systems with Applications: An International Journal
Collaborative ranking

Proceedings of the fifth ACM international conference on Web search and data mining
LIMSI @ WMT11

WMT '11 Proceedings of the Sixth Workshop on Statistical Machine Translation
CEU-UPV English-Spanish system for WMT11

WMT '11 Proceedings of the Sixth Workshop on Statistical Machine Translation
A hybrid approach to statistical language modeling with multilayer perceptrons and unigrams

TSD'05 Proceedings of the 8th international conference on Text, Speech and Dialogue
Semi-supervised recursive autoencoders for predicting sentiment distributions

EMNLP '11 Proceedings of the Conference on Empirical Methods in Natural Language Processing
Efficient subsampling for training complex language models

EMNLP '11 Proceedings of the Conference on Empirical Methods in Natural Language Processing
The latent words language model

Computer Speech and Language
A scalable distributed syntactic, semantic, and lexical language model

Computational Linguistics
Lexical surprisal as a general predictor of reading time

EACL '12 Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics
Continuous space translation models with neural networks

NAACL HLT '12 Proceedings of the 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies
Deep unsupervised feature learning for natural language processing

NAACL HLT '12 Proceedings of the 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies: Student Research Workshop
Domain and function: a dual-space model of semantic relations and compositions

Journal of Artificial Intelligence Research
Improving word representations via global context and multiple word prototypes

ACL '12 Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Long Papers - Volume 1
Measuring the influence of long range dependencies with neural network language models

WLM '12 Proceedings of the NAACL-HLT 2012 Workshop: Will We Ever Really Replace the N-gram Model? On the Future of Language Modeling for HLT
Large, pruned or continuous space language models on a GPU for statistical machine translation

WLM '12 Proceedings of the NAACL-HLT 2012 Workshop: Will We Ever Really Replace the N-gram Model? On the Future of Language Modeling for HLT
Deep neural network language models

WLM '12 Proceedings of the NAACL-HLT 2012 Workshop: Will We Ever Really Replace the N-gram Model? On the Future of Language Modeling for HLT
A bottom-up exploration of the dimensions of dialog state in spoken interaction

SIGDIAL '12 Proceedings of the 13th Annual Meeting of the Special Interest Group on Discourse and Dialogue
Sentiment classification with supervised sequence embedding

ECML PKDD'12 Proceedings of the 2012 European conference on Machine Learning and Knowledge Discovery in Databases - Volume Part I
Universal schema for entity type prediction

Proceedings of the 2013 workshop on Automated knowledge base construction
Word classification for sentiment polarity estimation using neural network

HCI International'13 Proceedings of the 15th international conference on Human Interface and the Management of Information: information and interaction design - Volume Part I
Deep learning of representations: looking forward

SLSP'13 Proceedings of the First international conference on Statistical Language and Speech Processing
Neural network language models for off-line handwriting recognition

Pattern Recognition
A semantic matching energy function for learning with multi-relational data

Machine Learning
Converting Neural Network Language Models into Back-off Language Models for Efficient Decoding in Automatic Speech Recognition

IEEE/ACM Transactions on Audio, Speech and Language Processing (TASLP)

Quantified Score

Hi-index	0.02

Visualization

Abstract

A goal of statistical language modeling is to learn the joint probability function of sequences of words in a language. This is intrinsically difficult because of the curse of dimensionality: a word sequence on which the model will be tested is likely to be different from all the word sequences seen during training. Traditional but very successful approaches based on n-grams obtain generalization by concatenating very short overlapping sequences seen in the training set. We propose to fight the curse of dimensionality by learning a distributed representation for words which allows each training sentence to inform the model about an exponential number of semantically neighboring sentences. The model learns simultaneously (1) a distributed representation for each word along with (2) the probability function for word sequences, expressed in terms of these representations. Generalization is obtained because a sequence of words that has never been seen before gets high probability if it is made of words that are similar (in the sense of having a nearby representation) to words forming an already seen sentence. Training such large models (with millions of parameters) within a reasonable time is itself a significant challenge. We report on experiments using neural networks for the probability function, showing on two text corpora that the proposed approach significantly improves on state-of-the-art n-gram models, and that the proposed approach allows to take advantage of longer contexts.