Syntactic clustering of the Web
Selected papers from the sixth international conference on World Wide Web
Mining the Web: Discovering Knowledge from HyperText Data
Mining the Web: Discovering Knowledge from HyperText Data
Introduction to the special issue on the web as corpus
Computational Linguistics - Special issue on web as corpus
Computational Linguistics
ACL '07 Proceedings of the 45th Annual Meeting of the ACL on Interactive Poster and Demonstration Sessions
Compilation of specialized comparable corpora in French and Japanese
BUCC '09 Proceedings of the 2nd Workshop on Building and Using Comparable Corpora: from Parallel to Non-parallel Corpora
Scalable discriminative parsing for German
IWPT '09 Proceedings of the 11th International Conference on Parsing Technologies
SemEval-2010 task 7: Argument selection and coercion
SemEval '10 Proceedings of the 5th International Workshop on Semantic Evaluation
Creating and exploiting a resource of parallel parses
LAW IV '10 Proceedings of the Fourth Linguistic Annotation Workshop
A machine learning approach to relational noun mining in German
MWE '11 Proceedings of the Workshop on Multiword Expressions: from Parsing and Generation to the Real World
Exploiting semantic annotations in math information retrieval
Proceedings of the fifth workshop on Exploiting semantic annotations in information retrieval
Collaboratively built semi-structured content and Artificial Intelligence: The story so far
Artificial Intelligence
Hi-index | 0.00 |
The Web contains vast amounts of linguistic data. One key issue for linguists and language technologists is how to access it. Commercial search engines give highly compromised access. An alternative is to crawl the Web ourselves, which also allows us to remove duplicates and near-duplicates, navigational material, and a range of other kinds of non-linguistic matter. We can also tokenize, lemmatise and part-of-speech tag the corpus, and load the data into a corpus query tool which supports sophisticated linguistic queries. We have now done this for German and Italian, with corpus sizes of over 1 billion words in each case. We provide Web access to the corpora in our query tool, the Sketch Engine.