Context-Based Text Mining for Insights in Long Documents

Authors:
Hironori Takeuchi;Shiho Ogino;Hideo Watanabe;Yoshiko Shirata
Affiliations:
Tokyo Research Laboratory, IBM Japan, Ltd., IBM Research, Kanagawa, Japan;Tokyo Research Laboratory, IBM Japan, Ltd., IBM Research, Kanagawa, Japan;Tokyo Research Laboratory, IBM Japan, Ltd., IBM Research, Kanagawa, Japan;Graduate School of Business Science, University of Tsukuba, Tokyo, Japan
Venue:
PAKM '08 Proceedings of the 7th International Conference on Practical Aspects of Knowledge Management
Year:
2008

Citing 10
Cited 1

A Comparative Study on Feature Selection in Text Categorization

ICML '97 Proceedings of the Fourteenth International Conference on Machine Learning
Circle Graphs: New Visualization Tools for Text-Mining

PKDD '99 Proceedings of the Third European Conference on Principles of Data Mining and Knowledge Discovery
Extracting Predictors of Corporate Bankruptcy: Empirical Study on Data Mining Methods

PADKK '00 Proceedings of the 4th Pacific-Asia Conference on Knowledge Discovery and Data Mining, Current Issues and New Applications
Frequent term-based text clustering

Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
KeyGraph: Automatic Indexing by Co-occurrence Graph based on Building Construction Metaphor

ADL '98 Proceedings of the Advances in Digital Libraries Conference
GeneWays: a system for extracting, analyzing, visualizing, and integrating molecular pathway data

Journal of Biomedical Informatics
Text analysis and knowledge mining system

IBM Systems Journal
A measure of term representativeness based on the number of co-occurring salient words

COLING '02 Proceedings of the 19th international conference on Computational linguistics - Volume 1
Text Mining Handbook: Advanced Approaches in Analyzing Unstructured Data

Text Mining Handbook: Advanced Approaches in Analyzing Unstructured Data
Introduction to Information Retrieval

Introduction to Information Retrieval

Search and analysis of bankruptcy cause by classification network

MEDI'11 Proceedings of the First international conference on Model and data engineering

Quantified Score

Hi-index	0.00

Visualization

Abstract

In this paper, we consider long documents and try to find differences between document collections. In the analysis of document collections such as project status reports or annual reports, each document and each sentence tend to be relatively long. Therefore, it can be difficult to derive insights by looking only for representative concepts in the selected document collection based on a divergence metric. In this paper, we propose an analysis approach based on contextual information. By extracting pairs of a topic word and a keyword and assessing their representativeness in the selected document collection, we are developing a method to extract insights from these long documents. Applying the proposed method for the analysis between the annual reports of bankrupt companies and those of sound companies, we were able to derive insights that could not be extracted with the conventional methods.