The shared corpora working group report

Authors:
Adam Meyers;Nancy Ide;Ludovic Denoyer;Yusuke Shinyama
Affiliations:
New York University, New York, NY;Vassar College, Poughkeepsie, NY;University of Paris, Paris, France;New York University, New York, NY
Venue:
LAW '07 Proceedings of the Linguistic Annotation Workshop
Year:
2007

Citing 5
Cited 1

The Wikipedia XML corpus

ACM SIGIR Forum
Yago: a core of semantic knowledge

Proceedings of the 16th international conference on World Wide Web
WikiRelate! computing semantic relatedness using wikipedia

AAAI'06 proceedings of the 21st national conference on Artificial intelligence - Volume 2
Subtree mining for relation extraction from Wikipedia

NAACL-Short '07 Human Language Technologies 2007: The Conference of the North American Chapter of the Association for Computational Linguistics; Companion Volume, Short Papers
Web corpus mining by instance of Wikipedia

WAC '06 Proceedings of the 2nd International Workshop on Web as Corpus

Automatic recognition of logical relations for English, Chinese and Japanese in the GLARF framework

DEW '09 Proceedings of the Workshop on Semantic Evaluations: Recent Achievements and Future Directions

Quantified Score

Hi-index	0.00

Visualization

Abstract

We seek to identify a limited amount of representative corpora, suitable for annotation by the computational linguistics annotation community. Our hope is that a wide variety of annotation will be undertaken on the same corpora, which would facilitate: (1) the comparison of annotation schemes; (2) the merging of information represented by various annotation schemes; (3) the emergence of NLP systems that use information in multiple annotation schemes; and (4) the adoption of various types of best practice in corpus annotation. Such best practices would include: (a) clearer demarcation of phenomena being annotated; (b) the use of particular test corpora to determine whether a particular annotation task can feasibly achieve good agreement scores; (c) The use of underlying models for representing annotation content that facilitate merging, comparison, and analysis; and (d) To the extent possible, the use of common annotation categories or a mapping among categories for the same phenomenon used by different annotation groups. This study will focus on the problem of identifying such corpora as well as the suitability of two candidate corpora: the Open portion of the American National Corpus (Ide and Macleod, 2001; Ide and Suderman, 2004) and the "Controversial" portions of the WikipediaXML corpus (Denoyer and Gallinari, 2006).