An empirical study of smoothing techniques for language modeling

Authors:
Stanley F. Chen;Joshua Goodman
Affiliations:
Harvard University, Cambridge, MA;Harvard University, Cambridge, MA
Venue:
ACL '96 Proceedings of the 34th annual meeting on Association for Computational Linguistics
Year:
1996

Citing 6
Cited 181

A statistical approach to machine translation

Computational Linguistics
An estimate of an upper bound for the entropy of English

Computational Linguistics
Natural language parsing as statistical pattern recognition

Natural language parsing as statistical pattern recognition
Building probabilistic models for natural language

Building probabilistic models for natural language
A stochastic parts program and noun phrase parser for unrestricted text

ANLC '88 Proceedings of the second conference on Applied natural language processing
A spelling correction program based on a noisy channel model

COLING '90 Proceedings of the 13th conference on Computational linguistics - Volume 2

Learning to resolve natural language ambiguities: a unified approach

AAAI '98/IAAI '98 Proceedings of the fifteenth national/tenth conference on Artificial intelligence/Innovative applications of artificial intelligence
Similarity-Based Models of Word Cooccurrence Probabilities

Machine Learning - Special issue on natural language learning
A Winnow-Based Approach to Context-Sensitive Spelling Correction

Machine Learning - Special issue on natural language learning
Mixed Memory Markov Models: Decomposing Complex Stochastic Processes as Mixtures of Simpler Ones

Machine Learning
Learning a monolingual language model from a multilingual text database

Proceedings of the ninth international conference on Information and knowledge management
Relevance based language models

Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval
Optimal Mixture Models in IR

Proceedings of the 24th BCS-IRSG European Colloquium on IR Research: Advances in Information Retrieval
Better Contextual Translation Using Machine Learning

AMTA '02 Proceedings of the 5th Conference of the Association for Machine Translation in the Americas on Machine Translation: From Research to Real Users
Introduction to the special issue on word sense disambiguation: the state of the art

Computational Linguistics - Special issue on word sense disambiguation
Memory-based learning: using similarity for smoothing

ACL '98 Proceedings of the 35th Annual Meeting of the Association for Computational Linguistics and Eighth Conference of the European Chapter of the Association for Computational Linguistics
Extracting the names of genes and gene products with a hidden Markov model

COLING '00 Proceedings of the 18th conference on Computational linguistics - Volume 1
Mining reference tables for automatic text segmentation

Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining
On modeling profiles instead of values

UAI '04 Proceedings of the 20th conference on Uncertainty in artificial intelligence
Word prediction using a clustered optimal binary search tree

Information Processing Letters
Language identification in web pages

Proceedings of the 2005 ACM symposium on Applied computing
Probabilistic Finite-State Machines-Part II

IEEE Transactions on Pattern Analysis and Machine Intelligence
Lexicalization of probabilistic grammars

COLING '02 Proceedings of the 19th international conference on Computational linguistics - Volume 1
Improving subcategorization acquisition using word sense disambiguation

ACL '03 Proceedings of the 41st Annual Meeting on Association for Computational Linguistics - Volume 1
A study of the dirichlet priors for term frequency normalisation

Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval
Iterative translation disambiguation for cross-language information retrieval

Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval
Comparison between tagged corpora for the named entity task

WCC '00 Proceedings of the workshop on Comparing corpora - Volume 9
Maximum Likelihood Set for Estimating a Probability Mass Function

Neural Computation
Semantically motivated subcategorization acquisition

ULA '02 Proceedings of the ACL-02 workshop on Unsupervised lexical acquisition - Volume 9
Transformational priors over grammars

EMNLP '02 Proceedings of the ACL-02 conference on Empirical methods in natural language processing - Volume 10
A very very large corpus doesn't always yield reliable estimates

COLING-02 proceedings of the 6th conference on Natural language learning - Volume 20
On the robustness of entropy-based similarity measures in evaluation of subcategorization acquisition systems

COLING-02 proceedings of the 6th conference on Natural language learning - Volume 20
A language model approach to keyphrase extraction

MWE '03 Proceedings of the ACL 2003 workshop on Multiword expressions: analysis, acquisition and treatment - Volume 18
Extracting key-substring-group features for text classification

Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining
Estimating average precision with incomplete and imperfect judgments

CIKM '06 Proceedings of the 15th ACM international conference on Information and knowledge management
Text classification improved through multigram models

CIKM '06 Proceedings of the 15th ACM international conference on Information and knowledge management
A lower bound on compression of unknown alphabets

Theoretical Computer Science
Improving IBM word-alignment model 1

ACL '04 Proceedings of the 42nd Annual Meeting on Association for Computational Linguistics
Self-organizing η-gram model for automatic word spacing

ACL-44 Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics
Reranking answers for definitional QA using language modeling

ACL-44 Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics
Effective self-training for parsing

HLT-NAACL '06 Proceedings of the main conference on Human Language Technology Conference of the North American Chapter of the Association of Computational Linguistics
An information-theoretic approach to automatic evaluation of summaries

HLT-NAACL '06 Proceedings of the main conference on Human Language Technology Conference of the North American Chapter of the Association of Computational Linguistics
A retrospective study of a hybrid document-context based retrieval model

Information Processing and Management: an International Journal
Superior Guarantees for Sequential Prediction and Lossless Compression via Alphabet Decomposition

The Journal of Machine Learning Research
Using bilingual comparable corpora and semi-supervised clustering for topic tracking

COLING-ACL '06 Proceedings of the COLING/ACL on Main conference poster sessions
Inducing word alignments with bilexical synchronous trees

COLING-ACL '06 Proceedings of the COLING/ACL on Main conference poster sessions
Three new graphical models for statistical language modelling

Proceedings of the 24th international conference on Machine learning
Searching with style: authorship attribution in classic literature

ACSC '07 Proceedings of the thirtieth Australasian conference on Computer science - Volume 62
Time-dependent event hierarchy construction

Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining
Database selection using actual physical and acquired logical collection resources in a massive domain-specific operational environment

VLDB '02 Proceedings of the 28th international conference on Very Large Data Bases
Relevance models for topic detection and tracking

HLT '02 Proceedings of the second international conference on Human Language Technology Research
Topic tracking based on bilingual comparable corpora and semisupervised clustering

ACM Transactions on Asian Language Information Processing (TALIP)
Automatic scoring of short handwritten essays in reading comprehension tests

Artificial Intelligence
Entropy of search logs: how hard is search? with personalization? with backoff?

WSDM '08 Proceedings of the 2008 International Conference on Web Search and Data Mining
Similarity based smoothing in language modeling

Acta Cybernetica
A simple and efficient sampling method for estimating AP and NDCG

Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval
EDA: AN EVOLUTIONARY DECODING ALGORITHM FOR STATISTICAL MACHINE TRANSLATION

Applied Artificial Intelligence
Estimating average precision when judgments are incomplete

Knowledge and Information Systems
Word segmentation for the Myanmar language

Journal of Information Science
Discrete data clustering using finite mixture models

Pattern Recognition
Bilingual Text Classification

IbPRIA '07 Proceedings of the 3rd Iberian conference on Pattern Recognition and Image Analysis, Part I
Improving Automatic Image Annotation Based on Word Co-occurrence

Adaptive Multimedial Retrieval: Retrieval, User, and Semantics
A Hybrid Approach to Word Segmentation of Vietnamese Texts

Language and Automata Theory and Applications
Classifying Digital Resources in a Practical and Coherent Way with Easy-to-Get Features

PAKM '08 Proceedings of the 7th International Conference on Practical Aspects of Knowledge Management
Improving a statistical language model through non-linear prediction

Neurocomputing
Discovering users' specific geo intention in web search

Proceedings of the 18th international conference on World wide web
Probabilistic Classifications with TBL

CICLing '07 Proceedings of the 8th International Conference on Computational Linguistics and Intelligent Text Processing
Modeling actions of PubMed users with n-gram language models

Information Retrieval
Smoothing a tera-word language model

HLT-Short '08 Proceedings of the 46th Annual Meeting of the Association for Computational Linguistics on Human Language Technologies: Short Papers
Large-Scale Statistical Machine Translation with Weighted Finite State Transducers

Proceedings of the 2009 conference on Finite-State Methods and Natural Language Processing: Post-proceedings of the 7th International Workshop FSMNLP 2008
Recognizing names in biomedical texts using hidden Markov model and SVM plus sigmoid

JNLPBA '04 Proceedings of the International Joint Workshop on Natural Language Processing in Biomedicine and its Applications
Exploring deep knowledge resources in biomedical name recognition

JNLPBA '04 Proceedings of the International Joint Workshop on Natural Language Processing in Biomedicine and its Applications
System scoring using partial prior information

Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval
On a Kernel Regression Approach to Machine Translation

IbPRIA '09 Proceedings of the 4th Iberian Conference on Pattern Recognition and Image Analysis
Language independent thresholding optimization using a Gaussian mixture modelling of the character shapes

Proceedings of the International Workshop on Multilingual OCR
Context-based Arabic morphological analysis for machine translation

CoNLL '08 Proceedings of the Twelfth Conference on Computational Natural Language Learning
High-level goal recognition in a wireless LAN

AAAI'04 Proceedings of the 19th national conference on Artifical intelligence
A look at parsing and its applications

AAAI'06 proceedings of the 21st national conference on Artificial intelligence - Volume 2
A local alignment kernel in the context of NLP

COLING '08 Proceedings of the 22nd International Conference on Computational Linguistics - Volume 1
Comparison between tagged corpora for the named entity task

CompareCorpora '00 Proceedings of the Workshop on Comparing Corpora
Style & topic language model adaptation using HMM-LDA

EMNLP '06 Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing
Time-Sensitive Language Modelling for Online Term Recurrence Prediction

ICTIR '09 Proceedings of the 2nd International Conference on Theory of Information Retrieval: Advances in Information Retrieval Theory
Computational linkuistics: word triggers across hyperlinks

HLT-NAACL-Short '04 Proceedings of HLT-NAACL 2004: Short Papers
Tied-mixture language modeling in continuous space

NAACL '09 Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics
Streaming for large scale NLP: language modeling

NAACL '09 Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics
KU: word sense disambiguation by substitution

SemEval '07 Proceedings of the 4th International Workshop on Semantic Evaluations
Sequence prediction exploiting similarity information

IJCAI'07 Proceedings of the 20th international joint conference on Artifical intelligence
Scaling high-order character language models to gigabytes

Software '05 Proceedings of the Workshop on Software
MaTrEx: the DCU MT system for WMT 2008

StatMT '08 Proceedings of the Third Workshop on Statistical Machine Translation
MaTrEx: the DCU MT system for WMT 2009

StatMT '09 Proceedings of the Fourth Workshop on Statistical Machine Translation
LIMSI's statistical translation systems for WMT'09

StatMT '09 Proceedings of the Fourth Workshop on Statistical Machine Translation
Understanding tag-cloud and visual features for better annotation of concepts in NUS-WIDE dataset

WSMC '09 Proceedings of the 1st workshop on Web-scale multimedia corpus
Cache-based language model adaptation using visual attention for ASR in meeting scenarios

Proceedings of the 2009 international conference on Multimodal interfaces
Learning mixture models via component-wise parameter smoothing

Computational Statistics & Data Analysis
Optimizing word alignment combination for phrase table training

ACLShort '09 Proceedings of the ACL-IJCNLP 2009 Conference Short Papers
Distributional representations for handling sparsity in supervised sequence-labeling

ACL '09 Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP: Volume 1 - Volume 1
A joint language model with fine-grain syntactic tags

EMNLP '09 Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume 3 - Volume 3
Language models for contextual error detection and correction

CLAGI '09 Proceedings of the EACL 2009 Workshop on Computational Linguistic Aspects of Grammatical Inference
Word prediction using a clustered optimal binary search tree

Information Processing Letters
Entropy-based authorship search in large document collections

ECIR'07 Proceedings of the 29th European conference on IR research
Improving naive Bayes text classifier using smoothing methods

ECIR'07 Proceedings of the 29th European conference on IR research
Concept models for domain-specific search

CLEF'08 Proceedings of the 9th Cross-language evaluation forum conference on Evaluating systems for multilingual and multimodal information access
Conceptual language models for domain-specific retrieval

Information Processing and Management: an International Journal
Recognition driven page orientation detection

ICIP'09 Proceedings of the 16th IEEE international conference on Image processing
Rewriting the orthography of sms messages

Natural Language Engineering
Improved extraction assessment through better language models

HLT '10 Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics
Online learning for interactive statistical machine translation

HLT '10 Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics
The University of Maryland statistical machine translation system for the Fifth Workshop on Machine Translation

WMT '10 Proceedings of the Joint Fifth Workshop on Statistical Machine Translation and MetricsMATR
MaTrEx: the DCU MT system for WMT 2010

WMT '10 Proceedings of the Joint Fifth Workshop on Statistical Machine Translation and MetricsMATR
Better punctuation prediction with dynamic conditional random fields

EMNLP '10 Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing
Top-down nearly-context-sensitive parsing

EMNLP '10 Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing
Training continuous space language models: some practical issues

EMNLP '10 Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing
Measuring the interestingness of articles in a limited user environment

Information Processing and Management: an International Journal
Web image concept annotation with better understanding of tags and visual features

Journal of Visual Communication and Image Representation
Local lexical adaptation in machine translation through triangulation: SMT helping SMT

COLING '10 Proceedings of the 23rd International Conference on Computational Linguistics
Learning to predict readability using diverse linguistic features

COLING '10 Proceedings of the 23rd International Conference on Computational Linguistics
Evaluating multiple viewpoint models of tabla sequences

Proceedings of 3rd international workshop on Machine learning and music
Semantic and phonetic automatic reconstruction of medical dictations

Computer Speech and Language
Enhanced suffix arrays as language models: virtual k-testable languages

ICGI'10 Proceedings of the 10th international colloquium conference on Grammatical inference: theoretical results and applications
Multi-modal computer assisted speech transcription

International Conference on Multimodal Interfaces and the Workshop on Machine Learning for Multimodal Interaction
Using local alignments for relation recognition

Journal of Artificial Intelligence Research
Directional distributional similarity for lexical inference

Natural Language Engineering
Head-modifier relation based non-lexical reordering model for phrase-based translation

COLING '10 Proceedings of the 23rd International Conference on Computational Linguistics: Posters
Filtering artificial texts with statistical machine learning techniques

Language Resources and Evaluation
A logistic regression-based smoothing method for Chinese text categorization

Expert Systems with Applications: An International Journal
Finding deceptive opinion spam by any stretch of the imagination

HLT '11 Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies - Volume 1
A hierarchical Pitman-Yor process HMM for unsupervised part of speech induction

HLT '11 Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies - Volume 1
Algorithm selection and model adaptation for ESL correction tasks

HLT '11 Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies - Volume 1
Crypt analysis of two time pads in case of compressed speech

Computers and Electrical Engineering
Detecting outlier sections in us congressional legislation

Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval
Smoothing techniques for adaptive online language models: topic tracking in tweet streams

Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining
Explicit length modelling for statistical machine translation

IbPRIA'11 Proceedings of the 5th Iberian conference on Pattern recognition and image analysis
Smoothing multinomial naïve bayes in the presence of imbalance

MLDM'11 Proceedings of the 7th international conference on Machine learning and data mining in pattern recognition
A spectral learning algorithm for finite state transducers

ECML PKDD'11 Proceedings of the 2011 European conference on Machine learning and knowledge discovery in databases - Volume Part I
A survey of probabilistic methods of morphological tagging

Automatic Documentation and Mathematical Linguistics
Bilingual random walk models for automated grammar correction of ESL author-produced text

IUNLPBEA '11 Proceedings of the 6th Workshop on Innovative Use of NLP for Building Educational Applications
The Web as a Source of Evidence for Filtering Candidate Answers to Natural Language Questions

WI-IAT '11 Proceedings of the 2011 IEEE/WIC/ACM International Conferences on Web Intelligence and Intelligent Agent Technology - Volume 01
Using temporal data for making recommendations

UAI'01 Proceedings of the Seventeenth conference on Uncertainty in artificial intelligence
Incorporating external information in bayesian classifiers via linear feature transformations

FinTAL'06 Proceedings of the 5th international conference on Advances in Natural Language Processing
Automatic chinese text classification using n-gram model

ICCSA'10 Proceedings of the 2010 international conference on Computational Science and Its Applications - Volume Part III
Different approaches to bilingual text classification based on grammatical inference techniques

IbPRIA'05 Proceedings of the Second Iberian conference on Pattern Recognition and Image Analysis - Volume Part II
Lexicalized beam thresholding parsing with prior and boundary estimates

CICLing'05 Proceedings of the 6th international conference on Computational Linguistics and Intelligent Text Processing
Resolution of data sparseness in named entity recognition using hierarchical features and feature relaxation principle

CICLing'05 Proceedings of the 6th international conference on Computational Linguistics and Intelligent Text Processing
Bandwidth-aware reconfigurable cache design with hybrid memory technologies

Proceedings of the International Conference on Computer-Aided Design
The CMU-ARK German-English translation system

WMT '11 Proceedings of the Sixth Workshop on Statistical Machine Translation
Noisy SMS machine translation in low-density languages

WMT '11 Proceedings of the Sixth Workshop on Statistical Machine Translation
LSM: language sense model for information retrieval

WAIM '06 Proceedings of the 7th international conference on Advances in Web-Age Information Management
Syntactic decision tree LMs: random selection or intelligent design?

EMNLP '11 Proceedings of the Conference on Empirical Methods in Natural Language Processing
A probabilistic forest-to-string model for language generation from typed lambda calculus expressions

EMNLP '11 Proceedings of the Conference on Empirical Methods in Natural Language Processing
A chunking strategy towards unknown word detection in chinese word segmentation

IJCNLP'05 Proceedings of the Second international joint conference on Natural Language Processing
Language modelling with dynamic syntax

TSD'06 Proceedings of the 9th international conference on Text, Speech and Dialogue
On compression-based text classification

ECIR'05 Proceedings of the 27th European conference on Advances in Information Retrieval Research
The latent words language model

Computer Speech and Language
A monotonic statistical machine translation approach to speaking style transformation

Computer Speech and Language
Combining confidence score and mal-rule filters for automatic creation of bangla error corpus: grammar checker perspective

CICLing'12 Proceedings of the 13th international conference on Computational Linguistics and Intelligent Text Processing - Volume Part II
Explicit length modelling for statistical machine translation

Pattern Recognition
The word-gesture keyboard: reimagining keyboard interaction

Communications of the ACM
A utility-theoretic ranking method for semi-automated text classification

SIGIR '12 Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval
Human activity recognition with trajectory data in multi-floor indoor environment

RSKT'12 Proceedings of the 7th international conference on Rough Sets and Knowledge Technology
Continuous space translation models with neural networks

NAACL HLT '12 Proceedings of the 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies
An empirical evaluation of stop word removal in statistical machine translation

EACL 2012 Proceedings of the Joint Workshop on Exploiting Synergies between Information Retrieval and Machine Translation (ESIRMT) and Hybrid Approaches to Machine Translation (HyTra)
ICMI'12 grand challenge: haptic voice recognition

Proceedings of the 14th ACM international conference on Multimodal interaction
Probabilistic integration of partial lexical information for noise robust haptic voice recognition

ACL '12 Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Long Papers - Volume 1
Large-scale syntactic language modeling with treelets

ACL '12 Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Long Papers - Volume 1
Topic models for dynamic translation model adaptation

ACL '12 Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Short Papers - Volume 2
Measuring the influence of long range dependencies with neural network language models

WLM '12 Proceedings of the NAACL-HLT 2012 Workshop: Will We Ever Really Replace the N-gram Model? On the Future of Language Modeling for HLT
Type-supervised hidden Markov models for part-of-speech tagging with incomplete tag dictionaries

EMNLP-CoNLL '12 Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning
Thesaurus-based feedback to support mixed search and browsing environments

ECDL'07 Proceedings of the 11th European conference on Research and Advanced Technology for Digital Libraries
The CMU-avenue French-English translation system

WMT '12 Proceedings of the Seventh Workshop on Statistical Machine Translation
Optimization strategies for online large-margin learning in machine translation

WMT '12 Proceedings of the Seventh Workshop on Statistical Machine Translation
Segmenting web-domains and hashtags using length specific models

Proceedings of the 21st ACM international conference on Information and knowledge management
A picture paints a thousand words: a method of generating image-text timelines

Proceedings of the 21st ACM international conference on Information and knowledge management
Contextual Language Models For Ranking Answers To Natural Language Definition Questions

Computational Intelligence
Detecting Trends in Social Bookmarking Systems: A del.icio.us Endeavor

International Journal of Data Warehousing and Mining
Class-Based language models for chinese-english parallel corpus

CICLing'13 Proceedings of the 14th international conference on Computational Linguistics and Intelligent Text Processing - Volume 2
ContextType: using hand posture information to improve mobile touch screen text entry

Proceedings of the SIGCHI Conference on Human Factors in Computing Systems
Sentiment diversification with different biases

Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval
Leveraging relevance cues for language modeling in speech recognition

Information Processing and Management: an International Journal
Mining source code repositories at massive scale using language modeling

Proceedings of the 10th Working Conference on Mining Software Repositories
Connecting users across social media sites: a behavioral-modeling approach

Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining
Human sensing for smart cities

Proceedings of the 2013 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining
Using naive bayes to detect spammy names in social networks

Proceedings of the 2013 ACM workshop on Artificial intelligence and security
Statistical machine translation enhancements through linguistic levels: A survey

ACM Computing Surveys (CSUR)
Using part of speech n-grams for improving automatic speech recognition of polish

MLDM'13 Proceedings of the 9th international conference on Machine Learning and Data Mining in Pattern Recognition
Detecting hidden enemy lines in IP address space

Proceedings of the 2013 workshop on New security paradigms workshop
Improving accuracy in back-of-device multitouch typing: a clustering-based approach to keyboard updating

Proceedings of the 19th international conference on Intelligent User Interfaces
Genre-Based Music Language Modeling with Latent Hierarchical Pitman-Yor Process Allocation

IEEE/ACM Transactions on Audio, Speech and Language Processing (TASLP)

Quantified Score

Hi-index	0.02

Visualization

Abstract

We present an extensive empirical comparison of several smoothing techniques in the domain of language modeling, including those described by Jelinek and Mercer (1980), Katz (1987), and Church and Gale (1991). We investigate for the first time how factors such as training data size, corpus (e.g., Brown versus Wall Street Journal), and n-gram order (bigram versus trigram) affect the relative performance of these methods, which we measure through the cross-entropy of test data. In addition, we introduce two novel smoothing techniques, one a variation of Jelinek-Mercer smoothing and one a very simple linear interpolation technique, both of which outperform existing methods.