Recursive data mining for role identification

Authors:
Vineet Chaoji;Apirak Hoonlor;Boleslaw K. Szymanski
Affiliations:
Rensselaer Polytechnic Institute, Troy, New York;Rensselaer Polytechnic Institute, Troy, New York;Rensselaer Polytechnic Institute, Troy, New York
Venue:
CSTST '08 Proceedings of the 5th international conference on Soft computing as transdisciplinary science and technology
Year:
2008

Citing 6
Cited 1

Text Categorization with Suport Vector Machines: Learning with Many Relevant Features

ECML '98 Proceedings of the 10th European Conference on Machine Learning
An introduction to variable and feature selection

The Journal of Machine Learning Research
Improving accuracy in word class tagging through the combination of machine learning systems

Computational Linguistics
A stochastic parts program and noun phrase parser for unrestricted text

ANLC '88 Proceedings of the second conference on Applied natural language processing
Identifying hierarchical structure in sequences: a linear-time algorithm

Journal of Artificial Intelligence Research
VOGUE: a novel variable order-gap state machine for modeling sequences

PKDD'06 Proceedings of the 10th European conference on Principle and Practice of Knowledge Discovery in Databases

VOGUE: A variable order hidden Markov model with duration based on frequent sequence mining

ACM Transactions on Knowledge Discovery from Data (TKDD)

Quantified Score

Hi-index	0.00

Visualization

Abstract

We present a text mining approach that enables an extension of a standard authorship assessment problem (the problem in which an author of a text needs to be established) to role identification in communications within some Internet community. More precisely, we want to recognize a group of authors communicating in a specific role within such a community rather than a single author. The challenge here is that the same author may participate in different roles in communications within the group, in each role having different authors as peers. An additional challenge of our problem is the length of communications. Each individual exchange in our intended domain, communications within an Internet community, is relatively short, in the order of several dozens of words, so standard text mining approaches may fail. An example of such a problem is recognizing roles in a collection of emails from an organization in which middle level managers communicate both with superiors and subordinates. To validate our approach we use the Enron email dataset which is such a collection. Our approach is based on discovering patterns at varying degrees of abstraction in a hierarchical fashion. Such discovery process allows for certain degree of approximation in matching patterns, which is necessary for capturing non-trivial structures in realistic datasets. The discovered patterns are used as features to build efficient classifiers. Due to the nature of the pattern discovery process, we call our approach Recursive Data Mining. The results show that a classifier that uses the dominant patterns discovered by Recursive Data Mining performs well in role detection.