Introduction to the special issue on the web as corpus
Computational Linguistics - Special issue on web as corpus
Automatic text categorization in terms of genre and author
Computational Linguistics
The domain dependence of parsing
ANLC '97 Proceedings of the fifth conference on Applied natural language processing
How verb subcategorization frequencies are affected by corpus choice
COLING '98 Proceedings of the 17th international conference on Computational linguistics - Volume 2
Comlex Syntax: building a computational lexicon
COLING '94 Proceedings of the 15th conference on Computational linguistics - Volume 1
Text genre detection using common word frequencies
COLING '00 Proceedings of the 18th conference on Computational linguistics - Volume 2
The Comlex Syntax project: the first year
HLT '94 Proceedings of the workshop on Human Language Technology
WCC '00 Proceedings of the workshop on Comparing corpora - Volume 9
Contextual feature selection for text classification
Information Processing and Management: an International Journal - Special issue: AIRS2005: Information retrieval research in Asia
Web resources for language modeling in conversational speech recognition
ACM Transactions on Speech and Language Processing (TSLP)
Automated essay scoring for nonnative English speakers
ASSESSEVALNLP '99 Proceedings of a Symposium on Computer Mediated Language Assessment and Evaluation in Natural Language Processing
CompareCorpora '00 Proceedings of the Workshop on Comparing Corpora
Classifying factored genres with part-of-speech histograms
NAACL-Short '09 Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics, Companion Volume: Short Papers
Automatic thesaurus construction based on grammatical relations
IJCAI'95 Proceedings of the 14th international joint conference on Artificial intelligence - Volume 2
Lessons from building a Persian written corpus: Peykare
Language Resources and Evaluation
Intelligent semantic-based system for corpus analysis through hybrid probabilistic neural networks
ISNN'11 Proceedings of the 8th international conference on Advances in neural networks - Volume Part I
An experimental study of boosting model classifiers for chinese text categorization
ICADL'04 Proceedings of the 7th international Conference on Digital Libraries: international collaboration and cross-fertilization
Is a morphologically complex language really that complex in full-text retrieval?
FinTAL'06 Proceedings of the 5th international conference on Advances in Natural Language Processing
Filtering contents with bigrams and named entities to improve text classification
AIRS'05 Proceedings of the Second Asia conference on Asia Information Retrieval Technology
Hi-index | 0.00 |
The present study summarizes corpus-based research on linguistic characteristics from several different structural levels, in English as well as other languages, showing that register variation is inherent in natural language. It further argues that, due to the importance and systematicity of the linguistic differences among registers, diversified corpora representing a broad range of register variation are required as the basis for general language studies.First, the extent of cross-register differences are illustrated from consideration of individual grammatical and lexical features; these register differences are also important for probabilistic part-of-speech taggers and syntactic parsers, because the probabilities associated with grammatically ambiguous forms are often markedly different across registers. Then, corpus-based multidimensional analyses of English are summarized, showing that linguistic features from several structural levels function together as underlying dimensions of variation, with each dimension defining a different set of linguistic relations among registers. Finally, the paper discusses how such analyses, based on register-diversified corpora, can be used to address two current issues in computational linguistics: the automatic classification of texts into register categories and cross-linguistic comparisons of register variation.