Evaluation in Music Information Retrieval

Authors:
Julián Urbano;Markus Schedl;Xavier Serra
Affiliations:
Department of Computer Science, University Carlos III of Madrid, Leganés, Spain;Department of Computational Perception, Johannes Kepler University, Linz, Austria;Music Technology Group, Universitat Pompeu Fabra, Barcelona, Spain
Venue:
Journal of Intelligent Information Systems
Year:
2013

Citing 75
Cited 1

The significance of the Cranfield tests on index languages

SIGIR '91 Proceedings of the 14th annual international ACM SIGIR conference on Research and development in information retrieval
Neural networks and the bias/variance dilemma

Neural Computation
The pragmatics of information retrieval experimentation, revisited

Information Processing and Management: an International Journal - Special issue on evaluation issues in information retrieval
Evaluation of evaluation in information retrieval

SIGIR '95 Proceedings of the 18th annual international ACM SIGIR conference on Research and development in information retrieval
The SMART lab report

ACM SIGIR Forum
Efficient construction of large test collections

Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval
How reliable are the results of large-scale information retrieval experiments?

Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval
Do batch and user evaluations give the same results?

SIGIR '00 Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval
Evaluating evaluation measure stability

SIGIR '00 Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval
Variations in relevance judgments and the measurement of retrieval effectiveness

Information Processing and Management: an International Journal
Ranking retrieval systems without relevance judgments

Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval
Evaluation by highly relevant documents

Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval
Why batch and user evaluations do not give the same results

Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval
The effect of topic set size on retrieval experiment error

SIGIR '02 Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval
Cumulated gain-based evaluation of IR techniques

ACM Transactions on Information Systems (TOIS)
The Philosophy of Information Retrieval Evaluation

CLEF '01 Revised Papers from the Second Workshop of the Cross-Language Evaluation Forum on Evaluation of Cross-Language Information Retrieval Systems
Challenges in information retrieval and language modeling: report of a workshop held at the center for intelligent information retrieval, University of Massachusetts Amherst, September 2002

ACM SIGIR Forum
A unified model for metasearch, pooling, and system evaluation

CIKM '03 Proceedings of the twelfth international conference on Information and knowledge management
Retrieval evaluation with incomplete information

Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval
Information retrieval system evaluation: effort, sensitivity, and reliability

Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval
When will information retrieval be "good enough"?

Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval
The Scientific Evaluation of Music Information Retrieval Systems: Foundations and Future

Computer Music Journal
Recommended reading for IR research students

ACM SIGIR Forum
TREC: Experiment and Evaluation in Information Retrieval (Digital Libraries and Electronic Publishing)

TREC: Experiment and Evaluation in Information Retrieval (Digital Libraries and Electronic Publishing)
Minimal test collections for retrieval evaluation

SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
Evaluating evaluation metrics based on the bootstrap

SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
Statistical precision of information retrieval evaluation

SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
Estimating average precision with incomplete and imperfect judgments

CIKM '06 Proceedings of the 15th ACM international conference on Information and knowledge management
On the reliability of information retrieval metrics based on graded relevance

Information Processing and Management: an International Journal - Special issue: AIRS2005: Information retrieval research in Asia
Robust test collections for retrieval evaluation

SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
Test theory for assessing IR test collections

SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
Strategic system comparisons via targeted relevance judgments

SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
How well does result relevance predict session satisfaction?

SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
The relationship between IR effectiveness measures and user satisfaction

SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
Bias and the limits of pooling for large collections

Information Retrieval
A comparison of statistical significance tests for information retrieval evaluation

Proceedings of the sixteenth ACM conference on Conference on information and knowledge management
Inferring document relevance from incomplete information

Proceedings of the sixteenth ACM conference on Conference on information and knowledge management
Hypothesis testing with incomplete relevance judgments

Proceedings of the sixteenth ACM conference on Conference on information and knowledge management
Semiautomatic evaluation of retrieval systems using document similarities

Proceedings of the sixteenth ACM conference on Conference on information and knowledge management
Crowdsourcing user studies with Mechanical Turk

Proceedings of the SIGCHI Conference on Human Factors in Computing Systems
On the history of evaluation in IR

Journal of Information Science
On information retrieval metrics designed for evaluation with incomplete relevance assessments

Information Retrieval
A simple and efficient sampling method for estimating AP and NDCG

Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval
Relevance assessment: are judges exchangeable and does it matter

Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval
Relevance thresholds in system evaluations

Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval
Precision-at-ten considered redundant

Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval
Rank-biased precision for measurement of retrieval effectiveness

ACM Transactions on Information Systems (TOIS)
Statistical power in retrieval experimentation

Proceedings of the 17th ACM conference on Information and knowledge management
Beyond binary relevance: preferences, diversity, and set-level judgments

ACM SIGIR Forum
If I Had a Million Queries

ECIR '09 Proceedings of the 31th European Conference on IR Research on Advances in Information Retrieval
Cheap and fast---but is it good?: evaluating non-expert annotations for natural language tasks

EMNLP '08 Proceedings of the Conference on Empirical Methods in Natural Language Processing
A few good topics: Experiments in topic set reduction for retrieval evaluation

ACM Transactions on Information Systems (TOIS)
Improvements that don't add up: ad-hoc retrieval results since 1998

Proceedings of the 18th ACM conference on Information and knowledge management
Empirical justification of the gain and discount function for nDCG

Proceedings of the 18th ACM conference on Information and knowledge management
Expected reciprocal rank for graded relevance

Proceedings of the 18th ACM conference on Information and knowledge management
Binary and graded relevance in IR evaluations-Comparison of the effects on ranking of IR systems

Information Processing and Management: an International Journal
Measuring the reusability of test collections

Proceedings of the third ACM international conference on Web search and data mining
The effect of assessor error on IR system evaluation

Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval
Reusable test collections through experimental design

Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval
Do user preferences and evaluation measures line up?

Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval
Extending average precision to graded relevance judgments

Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval
Quality management on Amazon Mechanical Turk

Proceedings of the ACM SIGKDD Workshop on Human Computation
Crowdsourcing for search evaluation

ACM SIGIR Forum
On the contributions of topics to system evaluation

ECIR'11 Proceedings of the 33rd European conference on Advances in information retrieval
System effectiveness, user models, and user utility: a conceptual framework for investigation

Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval
Information Retrieval Evaluation

Information Retrieval Evaluation
Instrumenting the crowd: using implicit behavioral measures to predict task performance

Proceedings of the 24th annual ACM symposium on User interface software and technology
Principles for robust evaluation infrastructure

Proceedings of the 2011 workshop on Data infrastructurEs for supporting information retrieval evaluation
IR research: systems, interaction, evaluation and theories

ACM SIGIR Forum
Multiple testing in statistical analysis of systems-based information retrieval experiments

ACM Transactions on Information Systems (TOIS)
The million song dataset challenge

Proceedings of the 21st international conference companion on World Wide Web
Melody Transcription From Music Audio: Approaches and Evaluation

IEEE Transactions on Audio, Speech, and Language Processing
Frontiers, challenges, and opportunities for information retrieval: Report from SWIRL 2012 the second strategic workshop on information retrieval in Lorne

ACM SIGIR Forum
Time-based calibration of effectiveness measures

SIGIR '12 Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval
Using crowdsourcing for TREC relevance assessment

Information Processing and Management: an International Journal

Classification accuracy is not enough

Journal of Intelligent Information Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

The field of Music Information Retrieval has always acknowledged the need for rigorous scientific evaluations, and several efforts have set out to develop and provide the infrastructure, technology and methodologies needed to carry out these evaluations. The community has enormously gained from these evaluation forums, but we have reached a point where we are stuck with evaluation frameworks that do not allow us to improve as much and as well as we want. The community recently acknowledged this problem and showed interest in addressing it, though it is not clear what to do to improve the situation. We argue that a good place to start is again the Text IR field. Based on a formalization of the evaluation process, this paper presents a survey of past evaluation work in the context of Text IR, from the point of view of validity, reliability and efficiency of the experiments. We show the problems that our community currently has in terms of evaluation, point to several lines of research to improve it and make various proposals in that line.