TopCat: Data Mining for Topic Identification in a Text Corpus

Authors:
Chris Clifton;Robert Cooley;Jason Rennie
Affiliations:
IEEE;IEEE;-
Venue:
IEEE Transactions on Knowledge and Data Engineering
Year:
2004

Citing 25
Cited 23

Word association norms, mutual information, and lexicography

Computational Linguistics
Evaluating text categorization

HLT '91 Proceedings of the workshop on Speech and Natural Language
Retrieval performance in Ferret a conceptual information retrieval system

SIGIR '91 Proceedings of the 14th annual international ACM SIGIR conference on Research and development in information retrieval
Mining association rules between sets of items in large databases

SIGMOD '93 Proceedings of the 1993 ACM SIGMOD international conference on Management of data
Automatic structuring and retrieval of large text files

Communications of the ACM
Information extraction as a basis for high-precision text classification

ACM Transactions on Information Systems (TOIS)
Natural language processing for information retrieval

Communications of the ACM
Exploiting Background Information in Knowledge Discovery from Text

Journal of Intelligent Information Systems
Multilevel hypergraph partitioning: application in VLSI domain

DAC '97 Proceedings of the 34th annual Design Automation Conference
Generating association rules from semi-structured documents using an extended concept hierarchy

CIKM '97 Proceedings of the sixth international conference on Information and knowledge management
Query flocks: a generalization of association-rule mining

SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
Grouper: a dynamic clustering interface to Web search results

WWW '99 Proceedings of the eighth international conference on World Wide Web
Efficient mining of association rules in text databases

Proceedings of the eighth international conference on Information and knowledge management
Clustering transactions using large items

Proceedings of the eighth international conference on Information and knowledge management
Machine learning of event segmentation for news on demand

Communications of the ACM
An investigation of linguistic features and clustering algorithms for topical document clustering

SIGIR '00 Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval
Introduction to Modern Information Retrieval

Introduction to Modern Information Retrieval
Beyond Market Baskets: Generalizing Association Rules to Dependence Rules

Data Mining and Knowledge Discovery
Text Categorization with Suport Vector Machines: Learning with Many Relevant Features

ECML '98 Proceedings of the 10th European Conference on Machine Learning
A Comparative Study on Feature Selection in Text Categorization

ICML '97 Proceedings of the Fourteenth International Conference on Machine Learning
Mining in the Phrasal Frontier

PKDD '97 Proceedings of the First European Symposium on Principles of Data Mining and Knowledge Discovery
Fast Algorithms for Mining Association Rules in Large Databases

VLDB '94 Proceedings of the 20th International Conference on Very Large Data Bases
Mining Generalized Association Rules

VLDB '95 Proceedings of the 21th International Conference on Very Large Data Bases
Scalable feature selection, classification and signature generation for organizing large text databases into hierarchical topic taxonomies

The VLDB Journal — The International Journal on Very Large Data Bases
Mixed-initiative development of language processing systems

ANLC '97 Proceedings of the fifth conference on Applied natural language processing

Automated web issue analysis: a nurse prescribing case study

Information Processing and Management: an International Journal - Special issue: Informetrics
Web Document Clustering by Using Automatic Keyphrase Extraction

WI-IATW '07 Proceedings of the 2007 IEEE/WIC/ACM International Conferences on Web Intelligence and Intelligent Agent Technology - Workshops
Multi-level direct K-way hypergraph partitioning with multiple constraints and fixed vertices

Journal of Parallel and Distributed Computing
On the development of a technology intelligence tool for identifying technology opportunity

Expert Systems with Applications: An International Journal
A systematic approach to new mobile service creation

Expert Systems with Applications: An International Journal
Distributed collaborative Web document clustering using cluster keyphrase summaries

Information Fusion
Structuring technological information for technology roadmapping: data mining approach

AIKED'08 Proceedings of the 7th WSEAS International Conference on Artificial intelligence, knowledge engineering and data bases
Towards the Automatic Construction of Conceptual Taxonomies

DaWaK '08 Proceedings of the 10th international conference on Data Warehousing and Knowledge Discovery
Mining and ranking streams of news stories using cross-stream sequential patterns

Proceedings of the 18th ACM conference on Information and knowledge management
Data mining in deductive databases using query flocks

Expert Systems with Applications: An International Journal
Hierarchical document clustering using local patterns

Data Mining and Knowledge Discovery
Development and application of a keyword-based knowledge map for effective R&D planning

Scientometrics
An approach to indexing and clustering news stories using continuous language models

NLDB'10 Proceedings of the Natural language processing and information systems, and 15th international conference on Applications of natural language to information systems
Mining news streams using cross-stream sequential patterns

RIAO '10 Adaptivity, Personalization and Fusion of Heterogeneous Information
A topic identification task for modern standard Arabic

ICCOMP'06 Proceedings of the 10th WSEAS international conference on Computers
Generating headline summary from a document set

CICLing'05 Proceedings of the 6th international conference on Computational Linguistics and Intelligent Text Processing
CorePhrase: keyphrase extraction for document clustering

MLDM'05 Proceedings of the 4th international conference on Machine Learning and Data Mining in Pattern Recognition
Indices of novelty for emerging topic detection

Information Processing and Management: an International Journal
Fine-grained topic detection in news search results

Proceedings of the 27th Annual ACM Symposium on Applied Computing
On macro- and micro-level information in multiple documents and its influence on summarization

International Journal of Information Management: The Journal for Information Professionals
Mining interests for user profiling in electronic conversations

Expert Systems with Applications: An International Journal
Semi-Automatic Ontology Construction by Exploiting Functional Dependencies and Association Rules

International Journal on Semantic Web & Information Systems
Discovering generalized association rules from Twitter

Intelligent Data Analysis

Quantified Score

Hi-index	0.01

Visualization

Abstract

TopCat (Topic Categories) is a technique for identifying topics that recur in articles in a text corpus. Natural language processing techniques are used to identify key entities in individual articles, allowing us to represent an article as a set of items. This allows us to view the problem in a database/data mining context: Identifying related groups of items. This paper presents a novel method for identifying related items based on traditional data mining techniques. Frequent itemsets are generated from the groups of items, followed by clusters formed with a hypergraph partitioning scheme. We present an evaluation against a manually categorized ground truth news corpus; it shows this technique is effective in identifying topics in collections of news articles.