Web-based models for natural language processing

Authors:
Mirella Lapata;Frank Keller
Affiliations:
University of Edinburgh, Edinburgh, UK;University of Edinburgh, Edinburgh, UK
Venue:
ACM Transactions on Speech and Language Processing (TSLP)
Year:
2005

Citing 38
Cited 60

Selection and information: a class-based approach to lexical relationships

Selection and information: a class-based approach to lexical relationships
Automated postediting of documents

AAAI '94 Proceedings of the twelfth national conference on Artificial intelligence (vol. 1)
Word sense disambiguation using a second language monolingual corpus

Computational Linguistics
A Winnow-Based Approach to Context-Sensitive Spelling Correction

Machine Learning - Special issue on natural language learning
On the MSE robustness of batching estimators

Proceedings of the 33nd conference on Winter simulation
Automatic Rule Acquisition for Spelling Correction

ICML '97 Proceedings of the Fourteenth International Conference on Machine Learning
Introduction to the special issue on the web as corpus

Computational Linguistics - Special issue on web as corpus
The Web as a parallel corpus

Computational Linguistics - Special issue on web as corpus
wEBMT: developing and validating an example-based machine translation system using the world wide web

Computational Linguistics - Special issue on web as corpus
Using the web to obtain frequencies for unseen bigrams

Computational Linguistics - Special issue on web as corpus
Automatic association of web directories with word senses

Computational Linguistics - Special issue on web as corpus
Structural ambiguity and lexical relations

Computational Linguistics - Special issue on using large corpora: I
Building a large annotated corpus of English: the penn treebank

Computational Linguistics - Special issue on using large corpora: II
Lexical semantic techniques for corpus analysis

Computational Linguistics - Special issue on using large corpora: II
Contextual spelling correction using latent semantic analysis

ANLC '97 Proceedings of the fifth conference on Applied natural language processing
Statistical models for unsupervised prepositional phrase attachment

COLING '98 Proceedings of the 17th international conference on Computational linguistics - Volume 2
Generation that exploits corpus-based statistical knowledge

COLING '98 Proceedings of the 17th international conference on Computational linguistics - Volume 1
Corpus statistics meet the noun compound: some empirical results

ACL '95 Proceedings of the 33rd annual meeting on Association for Computational Linguistics
Two-level, many-paths generation

ACL '95 Proceedings of the 33rd annual meeting on Association for Computational Linguistics
Combining Trigram-based and feature-based methods for context-sensitive spelling correction

ACL '96 Proceedings of the 34th annual meeting on Association for Computational Linguistics
A rule-based approach to prepositional phrase attachment disambiguation

COLING '94 Proceedings of the 15th conference on Computational linguistics - Volume 2
Comlex Syntax: building a computational lexicon

COLING '94 Proceedings of the 15th conference on Computational linguistics - Volume 1
Using a probabilistic class-based lexicon for lexical ambiguity resolution

COLING '00 Proceedings of the 18th conference on Computational linguistics - Volume 2
Ordering among premodifiers

ACL '99 Proceedings of the 37th annual meeting of the Association for Computational Linguistics on Computational Linguistics
A method for word sense disambiguation of unrestricted text

ACL '99 Proceedings of the 37th annual meeting of the Association for Computational Linguistics on Computational Linguistics
Mitigating the paucity-of-data problem: exploring the effect of training corpus size on classifier performance for natural language processing

HLT '01 Proceedings of the first international conference on Human language technology research
Base Noun Phrase translation using web data and the EM algorithm

COLING '02 Proceedings of the 19th international conference on Computational linguistics - Volume 1
Scaling to very very large corpora for natural language disambiguation

ACL '01 Proceedings of the 39th Annual Meeting on Association for Computational Linguistics
Getting more mileage from web text sources for conversational speech language modeling using class-dependent mixtures

NAACL-Short '03 Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology: companion volume of the Proceedings of HLT-NAACL 2003--short papers - Volume 2
Learning the countability of English nouns from corpus data

ACL '03 Proceedings of the 41st Annual Meeting on Association for Computational Linguistics - Volume 1
The order of prenominal adjectives in natural language generation

ACL '00 Proceedings of the 38th Annual Meeting on Association for Computational Linguistics
An unsupervised approach to prepositional phrase attachment using contextually similar words

ACL '00 Proceedings of the 38th Annual Meeting on Association for Computational Linguistics
A maximum entropy model for prepositional phrase attachment

HLT '94 Proceedings of the workshop on Human Language Technology
Memory-based learning for article generation

ConLL '00 Proceedings of the 2nd workshop on Learning language in logic and the 4th conference on Computational natural language learning - Volume 7
Augmented mixture models for lexical disambiguation

EMNLP '02 Proceedings of the ACL-02 conference on Empirical methods in natural language processing - Volume 10
MEANING: a roadmap to knowledge technologies

COLING-Roadmap '02 Proceedings of the 2002 COLING workshop: A roadmap for computational linguistics - Volume 13
Using the web in machine learning for other-anaphora resolution

EMNLP '03 Proceedings of the 2003 conference on Empirical methods in natural language processing
Automatic article restoration

HLT-SRWS '04 Proceedings of the Student Research Workshop at HLT-NAACL 2004

A feedback-augmented method for detecting errors in the writing of learners of English

ACL-44 Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics
Using the web as an implicit training set: application to structural ambiguity resolution

HLT '05 Proceedings of the conference on Human Language Technology and Empirical Methods in Natural Language Processing
Measuring semantic similarity between words using web search engines

Proceedings of the 16th international conference on World Wide Web
Reinforcing English countability prediction with one countability per discourse property

COLING-ACL '06 Proceedings of the COLING/ACL on Main conference poster sessions
An unsupervised method for learning generation dictionaries for spoken dialogue systems by mining user reviews

ACM Transactions on Speech and Language Processing (TSLP)
Web resources for language modeling in conversational speech recognition

ACM Transactions on Speech and Language Processing (TSLP)
A Noun-Predicate Bigram-Based Similarity Measure for Lexical Relations

GoTAL '08 Proceedings of the 6th international conference on Advances in Natural Language Processing
Zero-Anaphora Resolution in Chinese Using Maximum Entropy

IEICE - Transactions on Information and Systems
A Method for Reinforcing Noun Countability Prediction

IEICE - Transactions on Information and Systems
Service Selection in Business Service Ecosystem

Service-Oriented Computing --- ICSOC 2008 Workshops
Improving classification accuracy using automatically extracted training data

Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining
Semantic classification of noun phrases using web counts and learning algorithms

ACL '07 Proceedings of the 45th Annual Meeting of the ACL: Student Research Workshop
The syntax and semantics of prepositions in the task of automatic interpretation of nominal phrases and compounds: A cross-linguistic study

Computational Linguistics
Unsupervised recognition of literal and non-literal use of idiomatic expressions

EACL '09 Proceedings of the 12th Conference of the European Chapter of the Association for Computational Linguistics
Detecting parser errors using web-based semantic filters

EMNLP '06 Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing
Text data acquisition for domain-specific language models

EMNLP '06 Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing
Interpretation of compound nominalisations using corpus and web statistics

MWE '06 Proceedings of the Workshop on Multiword Expressions: Identifying and Exploiting Underlying Properties
Selecting relevant text subsets from web-data for building topic specific language models

NAACL-Short '06 Proceedings of the Human Language Technology Conference of the NAACL, Companion Volume: Short Papers
The effect of corpus size on case frame acquisition for discourse analysis

NAACL '09 Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics
UCB: system description for SemEval task #4

SemEval '07 Proceedings of the 4th International Workshop on Semantic Evaluations
Using lexical patterns in the Google Web 1T corpus to deduce semantic relations between nouns

DEW '09 Proceedings of the Workshop on Semantic Evaluations: Recent Achievements and Future Directions
Efficient handling of N-gram language models for statistical machine translation

StatMT '07 Proceedings of the Second Workshop on Statistical Machine Translation
Answer typing for information retrieval

Proceedings of the 18th ACM conference on Information and knowledge management
Web-scale N-gram models for lexical disambiguation

IJCAI'09 Proceedings of the 21st international jont conference on Artifical intelligence
Automatic identification of semantic relations in Italian complex nominals

IWCS-8 '09 Proceedings of the Eighth International Conference on Computational Semantics
Web-scale distributional similarity and entity set expansion

EMNLP '09 Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume 2 - Volume 2
A cohesion graph based approach for unsupervised recognition of literal and non-literal use of multiword expressions

TextGraphs-4 Proceedings of the 2009 Workshop on Graph-based Methods for Natural Language Processing
A knowledge-rich approach to identifying semantic relations between nominals

Information Processing and Management: an International Journal
An English and/or Japanese writing support tool based on a web search engine

International Journal of Computer Applications in Technology
Creating robust supervised classifiers via web-scale N-gram data

ACL '10 Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics
UvT: Memory-based pairwise ranking of paraphrasing verbs

SemEval '10 Proceedings of the 5th International Workshop on Semantic Evaluation
Search right and thou shalt find...: using web queries for learner error detection

IUNLPBEA '10 Proceedings of the NAACL HLT 2010 Fifth Workshop on Innovative Use of NLP for Building Educational Applications
Improved natural language learning via variance-regularization support vector machines

CoNLL '10 Proceedings of the Fourteenth Conference on Computational Natural Language Learning
Using web-scale N-grams to improve base NP parsing performance

COLING '10 Proceedings of the 23rd International Conference on Computational Linguistics
Exploring the data-driven prediction of prepositions in English

COLING '10 Proceedings of the 23rd International Conference on Computational Linguistics: Posters
Web-based and combined language models: a case study on noun compound identification

COLING '10 Proceedings of the 23rd International Conference on Computational Linguistics: Posters
SDDB: a self-dependent and data-based method for constructing bilingual dictionary from the web

APWeb'11 Proceedings of the 13th Asia-Pacific web conference on Web technologies and applications
Resources for Turkish morphological processing

Language Resources and Evaluation
Grammatical error correction with alternating structure optimization

HLT '11 Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies - Volume 1
Piggyback: using search engines for robust cross-domain named entity recognition

HLT '11 Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies - Volume 1
Using large monolingual and bilingual corpora to improve coordination disambiguation

HLT '11 Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies - Volume 1
Exploiting web-derived selectional preference to improve statistical dependency parsing

HLT '11 Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies - Volume 1
ONTECTAS: bridging the gap between collaborative tagging systems and structured data

CAiSE'11 Proceedings of the 23rd international conference on Advanced information systems engineering
Exploiting learners' tendencies for detecting english determiner errors

KES'11 Proceedings of the 15th international conference on Knowledge-based and intelligent information and engineering systems - Volume Part II
Improving Korean verb-verb morphological disambiguation using lexical knowledge from unambiguous unlabeled data and selective web counts

Pattern Recognition Letters
Web-based validation for contextual targeted paraphrasing

MTTG '11 Proceedings of the Workshop on Monolingual Text-To-Text Generation
Using verbs to characterize noun-noun relations

AIMSA'06 Proceedings of the 12th international conference on Artificial Intelligence: methodology, Systems, and Applications
Reranking bilingually extracted paraphrases using monolingual distributional similarity

GEMS '11 Proceedings of the GEMS 2011 Workshop on GEometrical Models of Natural Language Semantics
Exploiting syntactic and distributional information for spelling correction with web-scale n-gram models

EMNLP '11 Proceedings of the Conference on Empirical Methods in Natural Language Processing
Real-Word typo detection

NLDB'09 Proceedings of the 14th international conference on Applications of Natural Language to Information Systems
A framework for semantic discovery of web services

iUBICOM'10 Proceedings of the 5th international conference on Ubiquitous and Collaborative Computing
Automated functional testing of online search services

Software Testing, Verification & Reliability
Unsupervised learning on an approximate corpus

NAACL HLT '12 Proceedings of the 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies
Monolingual distributional similarity for text-to-text generation

SemEval '12 Proceedings of the First Joint Conference on Lexical and Computational Semantics - Volume 1: Proceedings of the main conference and the shared task, and Volume 2: Proceedings of the Sixth International Workshop on Semantic Evaluation
Context similarity measure using Fuzzy Formal Concept Analysis

Proceedings of the Second International Conference on Computational Science, Engineering and Information Technology
Zero anaphora resolution in chinese and its application in chinese-english machine translation

NLDB'07 Proceedings of the 12th international conference on Applications of Natural Language to Information Systems
Collaboratively built semi-structured content and Artificial Intelligence: The story so far

Artificial Intelligence
Extraction of multi-word expressions from small parallel corpora

Natural Language Engineering
Semantic interpretation of noun compounds using verbal and other paraphrases

ACM Transactions on Speech and Language Processing (TSLP) - Special issue on multiword expressions: From theory to practice and use, part 2
Using part---whole relations for automatic deduction of compound-internal relations in GermaNet

Language Resources and Evaluation

Quantified Score

Hi-index	0.00

Visualization

Abstract

Previous work demonstrated that Web counts can be used to approximate bigram counts, suggesting that Web-based frequencies should be useful for a wide variety of Natural Language Processing (NLP) tasks. However, only a limited number of tasks have so far been tested using Web-scale data sets. The present article overcomes this limitation by systematically investigating the performance of Web-based models for several NLP tasks, covering both syntax and semantics, both generation and analysis, and a wider range of n-grams and parts of speech than have been previously explored. For the majority of our tasks, we find that simple, unsupervised models perform better when n-gram counts are obtained from the Web rather than from a large corpus. In some cases, performance can be improved further by using backoff or interpolation techniques that combine Web counts and corpus counts. However, unsupervised Web-based models generally fail to outperform supervised state-of-the-art models trained on smaller corpora. We argue that Web-based models should therefore be used as a baseline for, rather than an alternative to, standard supervised models.