Analyzing word frequencies in large text corpora using inter-arrival times and bootstrapping

Authors:
Jefrey Lijffijt;Panagiotis Papapetrou;Kai Puolamäki;Heikki Mannila
Affiliations:
Department of Information and Computer Science, Aalto University, Helsinki Institute for Information Technology, Finland;Department of Information and Computer Science, Aalto University, Helsinki Institute for Information Technology, Finland;Department of Information and Computer Science, Aalto University, Helsinki Institute for Information Technology, Finland;Department of Information and Computer Science, Aalto University, Helsinki Institute for Information Technology, Finland
Venue:
ECML PKDD'11 Proceedings of the 2011 European conference on Machine learning and knowledge discovery in databases - Volume Part II
Year:
2011

Citing 13
Cited 1

Some simple effective approximations to the 2-Poisson model for probabilistic weighted retrieval

SIGIR '94 Proceedings of the 17th annual international ACM SIGIR conference on Research and development in information retrieval
On power-law relationships of the Internet topology

Proceedings of the conference on Applications, technologies, architectures, and protocols for computer communication
Modern Information Retrieval

Modern Information Retrieval
Bursty and Hierarchical Structure in Streams

Data Mining and Knowledge Discovery
Accurate methods for the statistics of surprise and coincidence

Computational Linguistics - Special issue on using large corpora: I
Identifying similarities, periodicities and bursts for online search queries

SIGMOD '04 Proceedings of the 2004 ACM SIGMOD international conference on Management of data
On the Bursty Evolution of Blogspace

World Wide Web
Parameter free bursty events detection in text streams

VLDB '05 Proceedings of the 31st international conference on Very large data bases
Graph evolution: Densification and shrinking diameters

ACM Transactions on Knowledge Discovery from Data (TKDD)
Analyzing feature trajectories for event detection

SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
Using Burstiness to Improve Clustering of Topics in News Streams

ICDM '07 Proceedings of the 2007 Seventh IEEE International Conference on Data Mining
On burstiness-aware search for document sequences

Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining
Comparing corpora using frequency profiling

CompareCorpora '00 Proceedings of the Workshop on Comparing Corpora

Size matters: finding the most informative set of window lengths

ECML PKDD'12 Proceedings of the 2012 European conference on Machine Learning and Knowledge Discovery in Databases - Volume Part II

Quantified Score

Hi-index	0.00

Visualization

Abstract

Comparing frequency counts over texts or corpora is an important task in many applications and scientific disciplines. Given a text corpus, we want to test a hypothesis, such as "word X is frequent", "word X has become more frequent over time", or "word X is more frequent in male than in female speech". For this purpose we need a null model of word frequencies. The commonly used bag-of-words model, which corresponds to a Bernoulli process with fixed parameter, does not account for any structure present in natural languages. Using this model for word frequencies results in large numbers of words being reported as unexpectedly frequent. We address how to take into account the inherent occurrence patterns of words in significance testing of word frequencies. Based on studies of words in two large corpora, we propose two methods for modeling word frequencies that both take into account the occurrence patterns of words and go beyond the bag-of-words assumption. The first method models word frequencies based on the spatial distribution of individual words in the language. The second method is based on bootstrapping and takes into account only word frequency at the text level. The proposed methods are compared to the current gold standard in a series of experiments on both corpora. We find that words obey different spatial patterns in the language, ranging from bursty to non-bursty/uniform, independent of their frequency, showing that the traditional approach leads to many false positives.