Using register-diversified corpora for general language studies

Authors:
Douglas Biber
Affiliations:
Northern Arizona University
Venue:
Computational Linguistics - Special issue on using large corpora: II
Year:
1993

Citing 0
Cited 19

Introduction to the special issue on the web as corpus

Computational Linguistics - Special issue on web as corpus
Automatic text categorization in terms of genre and author

Computational Linguistics
The domain dependence of parsing

ANLC '97 Proceedings of the fifth conference on Applied natural language processing
How verb subcategorization frequencies are affected by corpus choice

COLING '98 Proceedings of the 17th international conference on Computational linguistics - Volume 2
Comlex Syntax: building a computational lexicon

COLING '94 Proceedings of the 15th conference on Computational linguistics - Volume 1
Text genre detection using common word frequencies

COLING '00 Proceedings of the 18th conference on Computational linguistics - Volume 2
The Comlex Syntax project: the first year

HLT '94 Proceedings of the workshop on Human Language Technology
Verb subcategorization frequency differences between business-news and balanced corpora: the role of verb sense

WCC '00 Proceedings of the workshop on Comparing corpora - Volume 9
Contextual feature selection for text classification

Information Processing and Management: an International Journal - Special issue: AIRS2005: Information retrieval research in Asia
Web resources for language modeling in conversational speech recognition

ACM Transactions on Speech and Language Processing (TSLP)
Automated essay scoring for nonnative English speakers

ASSESSEVALNLP '99 Proceedings of a Symposium on Computer Mediated Language Assessment and Evaluation in Natural Language Processing
Verb subcategorization frequency differences between business-news and balanced corpora: the role of verb sense

CompareCorpora '00 Proceedings of the Workshop on Comparing Corpora
Classifying factored genres with part-of-speech histograms

NAACL-Short '09 Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics, Companion Volume: Short Papers
Automatic thesaurus construction based on grammatical relations

IJCAI'95 Proceedings of the 14th international joint conference on Artificial intelligence - Volume 2
Lessons from building a Persian written corpus: Peykare

Language Resources and Evaluation
Intelligent semantic-based system for corpus analysis through hybrid probabilistic neural networks

ISNN'11 Proceedings of the 8th international conference on Advances in neural networks - Volume Part I
An experimental study of boosting model classifiers for chinese text categorization

ICADL'04 Proceedings of the 7th international Conference on Digital Libraries: international collaboration and cross-fertilization
Is a morphologically complex language really that complex in full-text retrieval?

FinTAL'06 Proceedings of the 5th international conference on Advances in Natural Language Processing
Filtering contents with bigrams and named entities to improve text classification

AIRS'05 Proceedings of the Second Asia conference on Asia Information Retrieval Technology

Quantified Score

Hi-index	0.00

Visualization

Abstract

The present study summarizes corpus-based research on linguistic characteristics from several different structural levels, in English as well as other languages, showing that register variation is inherent in natural language. It further argues that, due to the importance and systematicity of the linguistic differences among registers, diversified corpora representing a broad range of register variation are required as the basis for general language studies.First, the extent of cross-register differences are illustrated from consideration of individual grammatical and lexical features; these register differences are also important for probabilistic part-of-speech taggers and syntactic parsers, because the probabilities associated with grammatically ambiguous forms are often markedly different across registers. Then, corpus-based multidimensional analyses of English are summarized, showing that linguistic features from several structural levels function together as underlying dimensions of variation, with each dimension defining a different set of linguistic relations among registers. Finally, the paper discusses how such analyses, based on register-diversified corpora, can be used to address two current issues in computational linguistics: the automatic classification of texts into register categories and cross-linguistic comparisons of register variation.