Comparing corpora using frequency profiling

Authors:
Paul Rayson;Roger Garside
Affiliations:
Lancaster University, Lancaster, UK;Lancaster University, Lancaster, UK
Venue:
CompareCorpora '00 Proceedings of the Workshop on Comparing Corpora
Year:
2000

Citing 2
Cited 20

Ethnographically-informed systems design for air traffic control

CSCW '92 Proceedings of the 1992 ACM conference on Computer-supported cooperative work
Accurate methods for the statistics of surprise and coincidence

Computational Linguistics - Special issue on using large corpora: I

P2P-4-DL: Digital Library over Peer-to-Peer

P2P '04 Proceedings of the Fourth International Conference on Peer-to-Peer Computing
Supporting Law Enforcement in Digital Communities through Natural Language Analysis

IWCF '08 Proceedings of the 2nd international workshop on Computational Forensics
Who said what to whom?: capturing the structure of debates

Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval
Linguistically-based sub-sentential alignment for terminology extraction from a bilingual automotive corpus

COLING '08 Proceedings of the 22nd International Conference on Computational Linguistics - Volume 1
Multilingual term extraction from domain-specific corpora using morphological structure

EACL '06 Proceedings of the Eleventh Conference of the European Chapter of the Association for Computational Linguistics: Posters & Demonstrations
Language-independent bilingual terminology extraction from a multilingual parallel corpus

EACL '09 Proceedings of the 12th Conference of the European Chapter of the Association for Computational Linguistics
Scary films good, scary flights bad: topic driven feature selection for classification of sentiment

Proceedings of the 1st international CIKM workshop on Topic-sentiment analysis for mass opinion
SemEval-2010 task 17: All-words word sense disambiguation on a specific domain

SemEval '10 Proceedings of the 5th International Workshop on Semantic Evaluation
IIITH: Domain specific word sense disambiguation

SemEval '10 Proceedings of the 5th International Workshop on Semantic Evaluation
The nature of collocations in the Russian language. The experience of automatic extraction and classification of the material of news texts

Automatic Documentation and Mathematical Linguistics
Focused retrieval and result aggregation with political data

Information Retrieval
Exploring variations across biomedical subdomains

COLING '10 Proceedings of the 23rd International Conference on Computational Linguistics
A corpus of Australian contract language: description, profiling and analysis

Proceedings of the 13th International Conference on Artificial Intelligence and Law
Analyzing word frequencies in large text corpora using inter-arrival times and bootstrapping

ECML PKDD'11 Proceedings of the 2011 European conference on Machine learning and knowledge discovery in databases - Volume Part II
Cross-Domain Effects on Parse Selection for Precision Grammars

Research on Language and Computation
Ontology based law discovery

Semantic Processing of Legal Texts
“Without the clutter of unimportant words”: Descriptive keyphrases for text visualization

ACM Transactions on Computer-Human Interaction (TOCHI)
Term extraction from sparse, ungrammatical domain-specific documents

Expert Systems with Applications: An International Journal
An online system with end-user services: mining novelty concepts from tv broadcast subtitles

Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining
There's no such thing as gaining a pound: reconsidering the bathroom scale user interface

Proceedings of the 2013 ACM international joint conference on Pervasive and ubiquitous computing

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper describes a method of comparing corpora which uses frequency profiling. The method can be used to discover key words in the corpora which differentiate one corpus from another. Using annotated corpora, it can be applied to discover key grammatical or word-sense categories. This can be used as a quick way in to find the differences between the corpora and is shown to have applications in the study of social differentiation in the use of English vocabulary, profiling of learner English and document analysis in the software engineering process.