Is the contextual information relevant in text clustering by compression?

Authors:
Ana Granados;David Camacho;Francisco Borja Rodríguez
Affiliations:
Escuela Politécnica Superior, Universidad Autónoma de Madrid, Spain;Escuela Politécnica Superior, Universidad Autónoma de Madrid, Spain;Escuela Politécnica Superior, Universidad Autónoma de Madrid, Spain
Venue:
Expert Systems with Applications: An International Journal
Year:
2012

Citing 38
Cited 1

The automatic identification of stop words

Journal of Information Science
Class-based n-gram models of natural language

Computational Linguistics
Noise reduction in a statistical approach to text categorization

SIGIR '95 Proceedings of the 18th annual international ACM SIGIR conference on Research and development in information retrieval
Using corpus statistics to remove redundant words in text categorization

Journal of the American Society for Information Science
Context-sensitive learning methods for text categorization

ACM Transactions on Information Systems (TOIS)
A vector space model for automatic indexing

Communications of the ACM
Machine learning in automated text categorization

ACM Computing Surveys (CSUR)
Contextual word similarity and estimation from sparse data

ACL '93 Proceedings of the 31st annual meeting on Association for Computational Linguistics
Text segmentation based on similarity between words

ACL '93 Proceedings of the 31st annual meeting on Association for Computational Linguistics
Evaluating high accuracy retrieval techniques

Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval
Mining knowledge from text using information extraction

ACM SIGKDD Explorations Newsletter - Natural language processing and text mining
Enhancing Data Analysis with Noise Removal

IEEE Transactions on Knowledge and Data Engineering
Focusing on Context in Network Traffic Analysis

IEEE Computer Graphics and Applications
Recovering 3D Human Body Configurations Using Shape Contexts

IEEE Transactions on Pattern Analysis and Machine Intelligence
Interest-based personalized search

ACM Transactions on Information Systems (TOIS)
Modeling of long distance context dependency

COLING '04 Proceedings of the 20th international conference on Computational Linguistics
Generation of Phonetic Units for Mixed-Language Speech Recognition Based on Acoustic and Contextual Analysis

IEEE Transactions on Computers
Extractive spoken document summarization for information retrieval

Pattern Recognition Letters
Towards a belief-revision-based adaptive and context-sensitive information retrieval system

ACM Transactions on Information Systems (TOIS)
Comments-oriented document summarization: understanding documents with readers' feedback

Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval
An application framework for mobile, context-aware trails

Pervasive and Mobile Computing
Evaluating the Impact of Information Distortion on Normalized Compression Distance

ICMCTA '08 Proceedings of the 2nd international Castle meeting on Coding Theory and Applications
Opinion Mining and Sentiment Analysis

Foundations and Trends in Information Retrieval
A general shape context framework for object identification

Computer Vision and Image Understanding
Exploiting temporal contexts in text classification

Proceedings of the 17th ACM conference on Information and knowledge management
Using contextual information and multidimensional approach for recommendation

Expert Systems with Applications: An International Journal
Context-Based Term Frequency Assessment for Text Classification

PRICAI '08 Proceedings of the 10th Pacific Rim International Conference on Artificial Intelligence: Trends in Artificial Intelligence
Learning concept hierarchies from text corpora using formal concept analysis

Journal of Artificial Intelligence Research
On the design and prototype implementation of a multimodal situation aware system

IEEE Transactions on Multimedia
Music Recommendation Using Content and Context Information Mining

IEEE Intelligent Systems
Noun retrieval effect on text summarization and delivery of personalized news articles to the user's desktop

Data & Knowledge Engineering
Evaluation of contextual information retrieval effectiveness: overview of issues and research

Knowledge and Information Systems
Relevance of contextual information in compression-based text clustering

IDEAL'10 Proceedings of the 11th international conference on Intelligent data engineering and automated learning
Reducing the Loss of Information through Annealing Text Distortion

IEEE Transactions on Knowledge and Data Engineering
Audio-based context recognition

IEEE Transactions on Audio, Speech, and Language Processing
The similarity metric

IEEE Transactions on Information Theory
Clustering by compression

IEEE Transactions on Information Theory
The Information Lost in Erasures

IEEE Transactions on Information Theory

Analysis and study on text representation to improve the accuracy of the normalized compression distance

AI Communications

Quantified Score

Hi-index	12.05

Visualization

Abstract

Usually, when analyzing data that have not been processed or filtered yet, it can be observed that not all the data have equal importance. Thus, it is common to find relevant data surrounded by non relevant one. This occurs when analyzing textual information due to its intrinsic nature: texts contain words that provide a lot of information about the subject matter, whereas they contain other words with a little meaning or relevance. We believe that although in principle the non-relevant words are not as important as the relevant ones, the former constitute the substrate that supports the last. Since this substrate is the context that surrounds the relevant information, we call it the contextual information. In this paper, we analyze the relevance that the contextual information has in textual data, in a clustering by compression scenario. We generate the contextual information applying a distortion technique previously developed by the authors. One of the main characteristics of this technique is that it maintains the contextual information. In this paper we compare this technique with three new distortion techniques that destroy the contextual information in different ways. The experimental results support our hypothesis that the contextual information is relevant at least in the area of text clustering by compression.