More accurate tests for the statistical significance of result differences

Authors:
Alexander Yeh
Affiliations:
Mitre Corp., Bedford, MA
Venue:
COLING '00 Proceedings of the 18th conference on Computational linguistics - Volume 2
Year:
2000

Citing 4
Cited 51

Empirical methods for artificial intelligence

Empirical methods for artificial intelligence
Computer Methods for Mathematical Computations

Computer Methods for Mathematical Computations
Introduction to the special issue on computational linguistics using large corpora

Computational Linguistics - Special issue on using large corpora: I
Evaluating message understanding systems: an analysis of the third message understanding conference (MUC-3)

Computational Linguistics

Getting into information retrieval

Lectures on information retrieval
Getting into Information Retrieval

ESSIR '00 Proceedings of the Third European Summer-School on Lectures on Information Retrieval-Revised Lectures
Memory-based shallow parsing

The Journal of Machine Learning Research
Shallow parsing with conditional random fields

NAACL '03 Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology - Volume 1
Using the distribution of performance for studying statistical NLP systems and corpora

ELDS '01 Proceedings of the workshop on Evaluation for Language and Dialogue Systems - Volume 9
The Notion of Argument in Prepositional Phrase Attachment

Computational Linguistics
Data-defined kernels for parse reranking derived from probabilistic models

ACL '05 Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics
An effective two-stage model for exploiting non-local dependencies in named entity recognition

ACL-44 Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics
Significance tests for the evaluation of ranking methods

COLING '04 Proceedings of the 20th international conference on Computational Linguistics
Accurate function parsing

HLT '05 Proceedings of the conference on Human Language Technology and Empirical Methods in Natural Language Processing
Wide-coverage deep statistical parsing using automatic dependency structure annotation

Computational Linguistics
Combining evidence, specificity, and proximity towards the normalization of gene ontology terms in text

EURASIP Journal on Bioinformatics and Systems Biology
A Joint Segmenting and Labeling Approach for Chinese Lexical Analysis

ECML PKDD '08 Proceedings of the European conference on Machine Learning and Knowledge Discovery in Databases - Part II
Terminological cleansing for improved information retrieval based on ontological terms

Proceedings of the WSDM '09 Workshop on Exploiting Semantic Annotations in Information Retrieval
Porting statistical parsers with data-defined kernels

CoNLL-X '06 Proceedings of the Tenth Conference on Computational Natural Language Learning
Loss minimization in parse reranking

EMNLP '06 Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing
Integrating multi-level linguistic knowledge with a unified framework for Mandarin speech recognition

EMNLP '08 Proceedings of the Conference on Empirical Methods in Natural Language Processing
A structured vector space model for word meaning in context

EMNLP '08 Proceedings of the Conference on Empirical Methods in Natural Language Processing
Learning document-level semantic properties from free-text annotations

Journal of Artificial Intelligence Research
Lexical and structural biases for function parsing

Parsing '05 Proceedings of the Ninth International Workshop on Parsing Technology
Classifying relations for biomedical named entity disambiguation

EMNLP '09 Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume 3 - Volume 3
Ranking paraphrases in context

TextInfer '09 Proceedings of the 2009 Workshop on Applied Textual Inference
Cross-lingual annotation projection of semantic roles

Journal of Artificial Intelligence Research
Ontology refinement for improved information retrieval

Information Processing and Management: an International Journal
Improving the use of pseudo-words for evaluating selectional preferences

ACL '10 Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics
Grammar-driven versus data-driven: which parsing system is more affected by domain shifts?

NLPLING '10 Proceedings of the 2010 Workshop on NLP and Linguistics: Finding the Common Ground
A baseline approach for detecting sentences containing uncertainty

CoNLL '10: Shared Task Proceedings of the Fourteenth Conference on Computational Natural Language Learning --- Shared Task
Tagging and linking web forum posts

CoNLL '10 Proceedings of the Fourteenth Conference on Computational Natural Language Learning
Unsupervised parse selection for HPSG

EMNLP '10 Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing
Measuring distributional similarity in context

EMNLP '10 Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing
Word sense disambiguation for event trigger word detection

DTMBIO '10 Proceedings of the ACM fourth international workshop on Data and text mining in biomedical informatics
Inductive probabilistic taxonomy learning using singular value decomposition

Natural Language Engineering
Lexical normalisation of short text messages: makn sens a #twitter

HLT '11 Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies - Volume 1
Effective measures of domain similarity for parsing

HLT '11 Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies - Volume 1
Joint reranking of parsing and word recognition with automatic segmentation

Computer Speech and Language
Cross-Domain Effects on Parse Selection for Precision Grammars

Research on Language and Computation
Predicting thread discourse structure over technical web forums

EMNLP '11 Proceedings of the Conference on Empirical Methods in Natural Language Processing
Exploring supervised lda models for assigning attributes to adjective-noun phrases

EMNLP '11 Proceedings of the Conference on Empirical Methods in Natural Language Processing
Linguistic redundancy in Twitter

EMNLP '11 Proceedings of the Conference on Empirical Methods in Natural Language Processing
Latent vector weighting for word meaning in context

EMNLP '11 Proceedings of the Conference on Empirical Methods in Natural Language Processing
Structured lexical similarity via convolution kernels on dependency trees

EMNLP '11 Proceedings of the Conference on Empirical Methods in Natural Language Processing
Probabilistic models of similarity in syntactic context

EMNLP '11 Proceedings of the Conference on Empirical Methods in Natural Language Processing
A probabilistic interpretation of precision, recall and F-score, with implication for evaluation

ECIR'05 Proceedings of the 27th European conference on Advances in Information Retrieval Research
Verb classification using distributional similarity in syntactic and semantic structures

ACL '12 Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Long Papers - Volume 1
Automatically constructing a normalisation dictionary for microblogs

EMNLP-CoNLL '12 Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning
An empirical investigation of statistical significance in NLP

EMNLP-CoNLL '12 Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning
Lexical normalization for social media text

ACM Transactions on Intelligent Systems and Technology (TIST) - Special section on twitter and microblogging services, social recommender systems, and CAMRa2010: Movie recommendation in context
Towards unsupervised learning of temporal relations between events

Journal of Artificial Intelligence Research
Learning to rank from structures in hierarchical text classification

ECIR'13 Proceedings of the 35th European conference on Advances in Information Retrieval
Word sense and semantic relations in noun compounds

ACM Transactions on Speech and Language Processing (TSLP) - Special issue on multiword expressions: From theory to practice and use, part 2
Multilingual joint parsing of syntactic and semantic dependencies with a latent variable model

Computational Linguistics

Quantified Score

Hi-index	0.00

Visualization

Abstract

Statistical significance testing of differences in values of metrics like recall, precision and balanced F-score is a necessary part of empirical natural language processing. Unfortunately, we find in a set of experiments that many commonly used tests often underestimate the significance and so are less likely to detect differences that exist between different techniques. This underestimation comes from an independence assumption that is often violated. We point out some useful tests that do not make this assumption, including computationally-intensive randomization tests.