A Generative Model for Statistical Determination of Information Content from Conversation Threads

Authors:
Yingjie Zhou;Malik Magdon-Ismail;William A. Wallace;Mark Goldberg
Affiliations:
Department of Decision Sciences and Engineering Systems, Rensselaer Polytechnic Institute, Troy, NY 12180;Department of Computer Science, Rensselaer Polytechnic Institute, Troy, NY 12180;Department of Decision Sciences and Engineering Systems, Rensselaer Polytechnic Institute, Troy, NY 12180;Department of Computer Science, Rensselaer Polytechnic Institute, Troy, NY 12180
Venue:
PAISI, PACCF and SOCO '08 Proceedings of the IEEE ISI 2008 PAISI, PACCF, and SOCO international workshops on Intelligence and Security Informatics
Year:
2008

Citing 5
Cited 0

Conversation-based mail

ACM Transactions on Computer Systems (TOCS)
Minimizing information overload: the ranking of electronic messages

Journal of Information Science
Cyberspace 2000: dealing with information overload

Communications of the ACM
Threading electronic mail: a preliminary study

Information Processing and Management: an International Journal - Special issue: methods and tools for the automatic construction of hypertext
Understanding sequence and reply relationships within email conversations: a mixed-model visualization

Proceedings of the SIGCHI Conference on Human Factors in Computing Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

We present a generative model for determining the information content of a message without analyzing the message content. Such a tool is useful for automated analysis of the vast contents of online communication which are extensively contaminated by uninformative content, spam, and broadcast. Content analysis is not feasible in such a setting. We propose a purely statistical methodology to determine the information value of a message, which we denote the Information Content Factor (ICF). Underlying our methodology is the definition of information in a message as the message's ability to generate conversation. The generative nature of our model allows us to estimate the ICF of a message without prior information on the participants. We test our approach by applying it to separating spam/broadcast messages from non-spam/non-broadcast. Our algorithms achieve 94% accuracy when tested against a human classifier which analyzed content.