Annotated web as corpus

Authors:
Paul Rayson;James Walkerdine;William H. Fletcher;Adam Kilgarriff
Affiliations:
Lancaster University, UK;Lancaster University, UK;United States Naval Academy;Lexical Computing Ltd., UK
Venue:
WAC '06 Proceedings of the 2nd International Workshop on Web as Corpus
Year:
2006

Citing 7
Cited 2

Mining the Web: Discovering Knowledge from HyperText Data

Mining the Web: Discovering Knowledge from HyperText Data
Introduction to the special issue on the web as corpus

Computational Linguistics - Special issue on web as corpus
Introduction to the special issue on evaluating word sense disambiguation systems

Natural Language Engineering
P2P-4-DL: Digital Library over Peer-to-Peer

P2P '04 Proceedings of the Fourth International Conference on Peer-to-Peer Computing
Using the web to overcome data sparseness

EMNLP '02 Proceedings of the ACL-02 conference on Empirical methods in natural language processing - Volume 10
Blueprint for a high performance NLP infrastructure

SEALTS '03 Proceedings of the HLT-NAACL 2003 workshop on Software engineering and architecture of language technology systems - Volume 8
Parsing the WSJ using CCG and log-linear models

ACL '04 Proceedings of the 42nd Annual Meeting on Association for Computational Linguistics

Constructing a large scale text corpus based on the grid and trustworthiness

TSD'07 Proceedings of the 10th international conference on Text, speech and dialogue
PackPlay: mining semantic data in collaborative games

LAW IV '10 Proceedings of the Fourth Linguistic Annotation Workshop

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper presents a proposal to facilitate the use of the annotated web as corpus by alleviating the annotation bottleneck for corpus data drawn from the web. We describe a framework for large-scale distributed corpus annotation using peer-to-peer (P2P) technology to meet this need. We also propose to annotate a large reference corpus in order to evaluate this framework. This will allow us to investigate the affordances offered by distributed techniques to ensure replicability of linguistic research based on web-derived corpora.