Optimization, maxent models, and conditional estimation without magic

Authors:
Christopher Manning;Dan Klein
Affiliations:
Stanford University;Stanford University
Venue:
NAACL-Tutorials '03 Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology: Tutorials - Volume 5
Year:
2003

Citing 0
Cited 20

Choosing the right translation: a syntactically informed classification approach

COLING '08 Proceedings of the 22nd International Conference on Computational Linguistics - Volume 1
Question classification using head words and their hypernyms

EMNLP '08 Proceedings of the Conference on Empirical Methods in Natural Language Processing
Efficient online learning and prediction of users' desktop actions

IJCAI'09 Proceedings of the 21st international jont conference on Artifical intelligence
Investigation of question classifier in question answering

EMNLP '09 Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume 2 - Volume 2
Accurate semantic class classifier for coreference resolution

EMNLP '09 Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume 3 - Volume 3
Domain adaptive bootstrapping for named entity recognition

EMNLP '09 Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume 3 - Volume 3
An extractive supervised two-stage method for sentence compression

HLT '10 Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics
Tuning as ranking

EMNLP '11 Proceedings of the Conference on Empirical Methods in Natural Language Processing
Active learning with Amazon Mechanical Turk

EMNLP '11 Proceedings of the Conference on Empirical Methods in Natural Language Processing
An evaluation of classification models for question topic categorization

Journal of the American Society for Information Science and Technology
Did it happen? the pragmatic complexity of veridicality assessment

Computational Linguistics
Recognising sentence similarity using similitude and dissimilarity features

International Journal of Advanced Intelligence Paradigms
Automatic generation of short informative sentiment summaries

EACL '12 Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics
SRIUBC: simple similarity features for semantic textual similarity

SemEval '12 Proceedings of the First Joint Conference on Lexical and Computational Semantics - Volume 1: Proceedings of the main conference and the shared task, and Volume 2: Proceedings of the Sixth International Workshop on Semantic Evaluation
Using sense-labeled discourse connectives for statistical machine translation

EACL 2012 Proceedings of the Joint Workshop on Exploiting Synergies between Information Retrieval and Machine Translation (ESIRMT) and Hybrid Approaches to Machine Translation (HyTra)
Joshua 4.0: packing, PRO, and paraphrases

WMT '12 Proceedings of the Seventh Workshop on Statistical Machine Translation
Concept adjustment for description logics

Proceedings of the seventh international conference on Knowledge capture
Analyzing users' narratives to understand experience with interactive products

Proceedings of the SIGCHI Conference on Human Factors in Computing Systems
Natural language watermarking for german texts

Proceedings of the first ACM workshop on Information hiding and multimedia security
Using micro-reviews to select an efficient set of reviews

Proceedings of the 22nd ACM international conference on Conference on information & knowledge management

Quantified Score

Hi-index	0.00

Visualization

Abstract

This tutorial aims to cover the basic ideas and algorithms behind techniques such as maximum entropy modeling, conditional estimation of generative probabilistic models, and issues regarding the use of models more complex than simple Naive Bayes and Hidden Markov Models. In recent years, these sophisticated probabilistic methods have been used with considerable success on most of the core tasks of natural language processing, for speech language models, and for IR tasks such as text filtering and categorization, but the methods and their relationships are often not well understood by practitioners. Our focus is on insight and understanding, using graphical illustrations rather than detailed derivations whenever possible. The goal of the tutorial is that the inner workings of these modeling and estimation techniques be transparent and intuitive, rather than black boxes labeled "magic here".The tutorial decomposes these methods into optimization problems on the one side, and optimization methods on the other. The first hour of the tutorial presents the basics of non-linear optimization, assuming only knowledge of basic calculus. We begin with a discussion of convexity and unconstrained optimization, focusing on gradient methods. We discuss in detail both simple gradient descent and the much more practical conjugate gradient descent. The key ideas are presented, including a comparison/contrast with alternative methods. Next, the case of constrained optimization is presented, highlighting the method of Lagrange multipliers and presenting several ways of translating the abstract ideas into a concrete optimization method. The principal goal, again, is to make Lagrange methods appear as intuitively natural, rather than as mathematical sleight-of-hand.The second part of the tutorial begins with a presentation of maximum entropy models from first principles, showing their equivalence to exponential models (also known as loglinear models, and particular versions of which give logistic regression, and conditional random fields). We present many simple examples to build intuition for what maxent models can and cannot do. Finally, we discuss how to find parameters for maximum entropy models using the previously presented optimization methods. By this point in the tutorial, audience members should have a clear understanding of how to build a system for estimating maxent models. We conclude with a discussion of issues specific to the language technology domain, including conditional estimation of generative models, and the issues involved in choosing model structure (such as independence, label and observation biases, and so on). We also discuss methods of smoothing, focusing on how smoothing works differently for maxent models than for standard relative-frequency-based distributions.The tutorial will run 3 hours, with a break in the middle. Participants will be assumed to know basic calculus and probability theory, and to have some exposure to models such as Naive Bayes and HMMs, but need only have a basic awareness of language technology problems.