Building a topic hierarchy using the bag-of-related-words representation

Authors:
Rafael Geraldeli Rossi;Solange Oliveira Rezende
Affiliations:
Institute of Mathematics and Computer Science - University of São Paulo, São Carlos, Brazil;Institute of Mathematics and Computer Science - University of São Paulo, São Carlos, Brazil
Venue:
Proceedings of the 11th ACM symposium on Document engineering
Year:
2011

Citing 18
Cited 1

Automatic text processing: the transformation, analysis, and retrieval of information by computer

Automatic text processing: the transformation, analysis, and retrieval of information by computer
A learner-independent evaluation of the usefulness of statistical phrases for automated text categorization

Text databases & document management
Modern Information Retrieval

Modern Information Retrieval
Evaluation of hierarchical clustering algorithms for document datasets

Proceedings of the eleventh international conference on Information and knowledge management
The use of bigrams to enhance text categorization

Information Processing and Management: an International Journal
Fast Algorithms for Mining Association Rules in Large Databases

VLDB '94 Proceedings of the 20th International Conference on Very Large Data Bases
Selecting the right interestingness measure for association patterns

Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
Using Association Features to Enhance the Performance of Naïve Bayes Text Classifier

ICCIMA '03 Proceedings of the 5th International Conference on Computational Intelligence and Multimedia Applications
Using Information-Theoretic Measures to Assess Association Rule Interestingness

ICDM '05 Proceedings of the Fifth IEEE International Conference on Data Mining
Interestingness measures for data mining: A survey

ACM Computing Surveys (CSUR)
Extending the single words-based document model: a comparison of bigrams and 2-itemsets

Proceedings of the 2006 ACM symposium on Document engineering
Quality Measures in Data Mining (Studies in Computational Intelligence)

Quality Measures in Data Mining (Studies in Computational Intelligence)
TaxaMiner: an experimentation framework for automated taxonomy bootstrapping

International Journal of Web and Grid Services
Standardising the lift of an association rule

Computational Statistics & Data Analysis
A New Type of Feature --- Loose N-Gram Feature in Text Categorization

IbPRIA '07 Proceedings of the 3rd Iberian conference on Pattern Recognition and Image Analysis, Part I
Adapting the right measures for K-means clustering

Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining
Non-contiguous word sequences for information retrieval

MWE '04 Proceedings of the Workshop on Multiword Expressions: Integrating Processing
Selecting candidate labels for hierarchical document clusters using association rules

MICAI'10 Proceedings of the 9th Mexican international conference on Artificial intelligence conference on Advances in soft computing: Part II

Measuring media-based social interactions in online civicmobilization against corruption in Brazil

Proceedings of the 18th Brazilian symposium on Multimedia and the web

Quantified Score

Hi-index	0.00

Visualization

Abstract

A simple and intuitive way to organize a huge document collection is by a topic hierarchy. Generally two steps are carried out to build a topic hierarchy automatically: 1) hierarchical document clustering and 2) cluster labeling. For both steps, a good textual document representation is essential. The bag-of-words is the common way to represent text collections. In this representation, each document is represented by a vector where each word in the document collection represents a dimension (feature). This approach has well known problems as the high dimensionality and sparsity of data. Besides, most of the concepts are composed by more than one word, as "document engineering" or "text mining". In this paper an approach called bag-of-related-words is proposed to generate features compounded by a set of related words with a dimensionality smaller than the bag-of-words. The features are extracted from each textual document of a collection using association rules. Different ways to map the document into transactions in order to allow the extraction of association rules and interest measures to prune the number of features are analyzed. To evaluate how much the proposed approach can aid the topic hierarchy building, we carried out an objective evaluation for the clustering structure, and a subjective evaluation for topic hierarchies. All the results were compared with the bag-of-words. The obtained results demonstrated that the proposed representation is better than the bag-of-words for the topic hierarchy building.