Corpus creation for new genres: A crowdsourced approach to PP attachment

Authors:
Mukund Jha;Jacob Andreas;Kapil Thadani;Sara Rosenthal;Kathleen McKeown
Affiliations:
Columbia University, New York, NY;Columbia University, New York, NY;Columbia University, New York, NY;Columbia University, New York, NY;Columbia University, New York, NY
Venue:
CSLDAMT '10 Proceedings of the NAACL HLT 2010 Workshop on Creating Speech and Language Data with Amazon's Mechanical Turk
Year:
2010

Citing 9
Cited 1

Some properties of preposition and subordinate conjunction attachments

COLING '98 Proceedings of the 17th international conference on Computational linguistics - Volume 2
Information diffusion through blogspace

Proceedings of the 13th international conference on World Wide Web
A maximum entropy model for prepositional phrase attachment

HLT '94 Proceedings of the workshop on Human Language Technology
Understanding how bloggers feel: recognizing affect in blog posts

CHI '06 Extended Abstracts on Human Factors in Computing Systems
Discriminative syntactic language modeling for speech recognition

ACL '05 Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics
A classification-based approach to question answering in discussion boards

Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval
Cheap and fast---but is it good?: evaluating non-expert annotations for natural language tasks

EMNLP '08 Proceedings of the Conference on Empirical Methods in Natural Language Processing
Non-projective parsing for statistical machine translation

EMNLP '09 Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume 1 - Volume 1
Synchronous tree adjoining machine translation

EMNLP '09 Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume 2 - Volume 2

Creating speech and language data with Amazon's Mechanical Turk

CSLDAMT '10 Proceedings of the NAACL HLT 2010 Workshop on Creating Speech and Language Data with Amazon's Mechanical Turk

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper explores the task of building an accurate prepositional phrase attachment corpus for new genres while avoiding a large investment in terms of time and money by crowd-sourcing judgments. We develop and present a system to extract prepositional phrases and their potential attachments from ungrammatical and informal sentences and pose the subsequent disambiguation tasks as multiple choice questions to workers from Amazon's Mechanical Turk service. Our analysis shows that this two-step approach is capable of producing reliable annotations on informal and potentially noisy blog text, and this semi-automated strategy holds promise for similar annotation projects in new genres.