Large linguistically-processed web corpora for multiple languages

Authors:
Marco Baroni;Adam Kilgarriff
Affiliations:
University of Bologna, Italy;Lexical Computing Ltd. and University of Sussex, Brighton, UK
Venue:
EACL '06 Proceedings of the Eleventh Conference of the European Chapter of the Association for Computational Linguistics: Posters & Demonstrations
Year:
2006

Citing 3
Cited 10

Syntactic clustering of the Web

Selected papers from the sixth international conference on World Wide Web
Mining the Web: Discovering Knowledge from HyperText Data

Mining the Web: Discovering Knowledge from HyperText Data
Introduction to the special issue on the web as corpus

Computational Linguistics - Special issue on web as corpus

Googleology is Bad Science

Computational Linguistics
An efficient algorithm for building a distributional thesaurus (and other Sketch Engine developments)

ACL '07 Proceedings of the 45th Annual Meeting of the ACL on Interactive Poster and Demonstration Sessions
Compilation of specialized comparable corpora in French and Japanese

BUCC '09 Proceedings of the 2nd Workshop on Building and Using Comparable Corpora: from Parallel to Non-parallel Corpora
Scalable discriminative parsing for German

IWPT '09 Proceedings of the 11th International Conference on Parsing Technologies
SemEval-2010 task 7: Argument selection and coercion

SemEval '10 Proceedings of the 5th International Workshop on Semantic Evaluation
Creating and exploiting a resource of parallel parses

LAW IV '10 Proceedings of the Fourth Linguistic Annotation Workshop
A machine learning approach to relational noun mining in German

MWE '11 Proceedings of the Workshop on Multiword Expressions: from Parsing and Generation to the Real World
Exploiting semantic annotations in math information retrieval

Proceedings of the fifth workshop on Exploiting semantic annotations in information retrieval
Collaboratively built semi-structured content and Artificial Intelligence: The story so far

Artificial Intelligence
Knowledge sources for constituent parsing of german, a morphologically rich and less-configurational language

Computational Linguistics

Quantified Score

Hi-index	0.00

Visualization

Abstract

The Web contains vast amounts of linguistic data. One key issue for linguists and language technologists is how to access it. Commercial search engines give highly compromised access. An alternative is to crawl the Web ourselves, which also allows us to remove duplicates and near-duplicates, navigational material, and a range of other kinds of non-linguistic matter. We can also tokenize, lemmatise and part-of-speech tag the corpus, and load the data into a corpus query tool which supports sophisticated linguistic queries. We have now done this for German and Italian, with corpus sizes of over 1 billion words in each case. We provide Web access to the corpora in our query tool, the Sketch Engine.