Comparing corpora using frequency profiling

Authors:
Paul Rayson;Roger Garside
Affiliations:
Lancaster University, Lancaster, UK;Lancaster University, Lancaster, UK
Venue:
WCC '00 Proceedings of the workshop on Comparing corpora - Volume 9
Year:
2000

Citing 2
Cited 8

Ethnographically-informed systems design for air traffic control

CSCW '92 Proceedings of the 1992 ACM conference on Computer-supported cooperative work
Accurate methods for the statistics of surprise and coincidence

Computational Linguistics - Special issue on using large corpora: I

REVERE: Support for Requirements Synthesis from Documents

Information Systems Frontiers
CVS: a Correlation-Verification based Smoothing technique on information retrieval and term clustering

Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
Free Riding on Gnutella Revisited: The Bell Tolls?

IEEE Distributed Systems Online
Deriving wishlists from blogs show us your blog, and we'll tell you what books to buy

Proceedings of the 15th international conference on World Wide Web
Using syntactic information to extract relevant terms for multi-document summarization

COLING '04 Proceedings of the 20th international conference on Computational Linguistics
Why we twitter: understanding microblogging usage and communities

Proceedings of the 9th WebKDD and 1st SNA-KDD 2007 workshop on Web mining and social network analysis
A flexible framework to experiment with ontology learning techniques

Knowledge-Based Systems
A Collection of Comparable Corpora for Under-resourced Languages

Proceedings of the 2010 conference on Human Language Technologies -- The Baltic Perspective: Proceedings of the Fourth International Conference Baltic HLT 2010

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper describes a method of comparing corpora which uses frequency profiling. The method can be used to discover key words in the corpora which differentiate one corpus from another. Using annotated corpora, it can be applied to discover key grammatical or word-sense categories. This can be used as a quick way in to find the differences between the corpora and is shown to have applications in the study of social differentiation in the use of English vocabulary, profiling of learner English and document analysis in the software engineering process.