Email classification for automated service handling
Proceedings of the 2006 ACM symposium on Applied computing
Artificial Intelligence
Algorithmic complexity bounds on future prediction errors
Information and Computation
An efficient and accurate method for evaluating time series similarity
Proceedings of the 2007 ACM SIGMOD international conference on Management of data
The Google Similarity Distance
IEEE Transactions on Knowledge and Data Engineering
Towards automated record linkage
AusDM '06 Proceedings of the fifth Australasian conference on Data mining and analystics - Volume 61
Multiple-signal duplicate detection for search evaluation
SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining
Information distance from a question to an answer
Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining
An extension of the Burrows–Wheeler Transform
Theoretical Computer Science
Distance measures for biological sequences: Some recent approaches
International Journal of Approximate Reasoning
Measuring the structural similarity of semistructured documents using entropy
VLDB '07 Proceedings of the 33rd international conference on Very large data bases
Artificial Intelligence Review
Machine Learning Tools for Automatic Mapping of Martian Landforms
IEEE Intelligent Systems
AIKED'06 Proceedings of the 5th WSEAS International Conference on Artificial Intelligence, Knowledge Engineering and Data Bases
Spam filtering using Kolmogorov complexity analysis
International Journal of Web and Grid Services
Dictionary based color image retrieval
Journal of Visual Communication and Image Representation
CiE '07 Proceedings of the 3rd conference on Computability in Europe: Computation and Logic in the Real World
Targeting Physically Addressable Memory
DIMVA '07 Proceedings of the 4th international conference on Detection of Intrusions and Malware, and Vulnerability Assessment
Catching the Drift: Using Feature-Free Case-Based Reasoning for Spam Filtering
ICCBR '07 Proceedings of the 7th international conference on Case-Based Reasoning: Case-Based Reasoning Research and Development
On Universal Transfer Learning
ALT '07 Proceedings of the 18th international conference on Algorithmic Learning Theory
Representative Views and Paths for Volume Models
SG '08 Proceedings of the 9th international symposium on Smart Graphics
KES '08 Proceedings of the 12th international conference on Knowledge-Based Intelligent Information and Engineering Systems, Part III
Evaluating the Impact of Information Distortion on Normalized Compression Distance
ICMCTA '08 Proceedings of the 2nd international Castle meeting on Coding Theory and Applications
Information shared by many objects
Proceedings of the 17th ACM conference on Information and knowledge management
Propagation-vectors for trees (PVT): concise yet effective summaries for hierarchical data and trees
Proceedings of the 2008 ACM workshop on Large-Scale distributed systems for information retrieval
Sublinear Algorithms for Approximating String Compressibility
APPROX '07/RANDOM '07 Proceedings of the 10th International Workshop on Approximation and the 11th International Workshop on Randomization, and Combinatorial Optimization. Algorithms and Techniques
On the bit-complexity of Lempel-Ziv compression
SODA '09 Proceedings of the twentieth Annual ACM-SIAM Symposium on Discrete Algorithms
ACM Transactions on Information and System Security (TISSEC)
On universal transfer learning
Theoretical Computer Science
Efficient overlap and content reuse detection in blogs and online news articles
Proceedings of the 18th international conference on World wide web
A survey of modern authorship attribution methods
Journal of the American Society for Information Science and Technology
New information distance measure and its application in question answering system
Journal of Computer Science and Technology
A Compression-Based Method for Stemmatic Analysis
Proceedings of the 2006 conference on ECAI 2006: 17th European Conference on Artificial Intelligence August 29 -- September 1, 2006, Riva del Garda, Italy
Fuzzy Variant of Affinity Propagation in Comparison to Median Fuzzy c-Means
WSOM '09 Proceedings of the 7th International Workshop on Advances in Self-Organizing Maps
Sustaining diversity using behavioral information distance
Proceedings of the 11th Annual conference on Genetic and evolutionary computation
Compression-based document length prior for language models
Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval
On the Value of Multiple Read/Write Streams for Data Compression
CPM '09 Proceedings of the 20th Annual Symposium on Combinatorial Pattern Matching
Measure software - and its evolution - using information content
Proceedings of the joint international and annual ERCIM workshops on Principles of software evolution (IWPSE) and software evolution (Evol) workshops
Graph-Based Representation of Symbolic Musical Data
GbRPR '09 Proceedings of the 7th IAPR-TC-15 International Workshop on Graph-Based Representations in Pattern Recognition
The Normalized Compression Distance as a Distance Measure in Entity Identification
ICDM '09 Proceedings of the 9th Industrial Conference on Advances in Data Mining. Applications and Theoretical Aspects
Data Mining and Knowledge Discovery
Forensic Authorship Attribution Using Compression Distances to Prototypes
IWCF '09 Proceedings of the 3rd International Workshop on Computational Forensics
Universal Estimation of Information Measures for Analog Sources
Foundations and Trends in Communications and Information Theory
Clustering heterogeneous data using clustering by compression
ICCOMP'09 Proceedings of the WSEAES 13th international conference on Computers
International Journal of Knowledge Engineering and Soft Data Paradigms
Towards the validation of plagiarism detection tools by means of grammar evolution
IEEE Transactions on Evolutionary Computation
IEEE Transactions on Information Theory
ICANN '09 Proceedings of the 19th International Conference on Artificial Neural Networks: Part II
Minimum description length and clustering with exemplars
ISIT'09 Proceedings of the 2009 IEEE international conference on Symposium on Information Theory - Volume 2
A new method for clustering heterogeneous data: clustering by compression
WSEAS Transactions on Computers
Median fuzzy c-means for clustering dissimilarity data
Neurocomputing
Detecting visually similar Web pages: Application to phishing detection
ACM Transactions on Internet Technology (TOIT)
A multi-stack based phylogenetic tree building method
ISBRA'07 Proceedings of the 3rd international conference on Bioinformatics research and applications
IWANN'07 Proceedings of the 9th international work conference on Artificial neural networks
Novelty detection in patient histories: experiments with measures based on text compression
IDA'07 Proceedings of the 7th international conference on Intelligent data analysis
A novel framework to detect source code plagiarism: now, students have to work for real!
Proceedings of the 2010 ACM Symposium on Applied Computing
Testing component independence using data compressors
ICANN'07 Proceedings of the 17th international conference on Artificial neural networks
Biological information as set-based complexity
IEEE Transactions on Information Theory - Special issue on information theory in molecular biology and neuroscience
Towards an understanding of locality in genetic programming
Proceedings of the 12th annual conference on Genetic and evolutionary computation
Evaluating machine translations using mNCD
ACLShort '10 Proceedings of the ACL 2010 Conference Short Papers
Topographic mapping of large dissimilarity data sets
Neural Computation
Normalized compression distance based measures for MetricsMATR 2010
WMT '10 Proceedings of the Joint Fifth Workshop on Statistical Machine Translation and MetricsMATR
On the relationship between novelty and popularity of user-generated content
CIKM '10 Proceedings of the 19th ACM international conference on Information and knowledge management
Compressing lists for audio classification
Proceedings of 3rd international workshop on Machine learning and music
Using virtual worlds for behaviour clustering-based analysis
Proceedings of the 2010 ACM workshop on Surreal media and virtual cloning
Towards early warning systems: challenges, technologies and architecture
CRITIS'09 Proceedings of the 4th international conference on Critical information infrastructures security
Extracting features from an electrical signal of a non-intrusive load monitoring system
IDEAL'10 Proceedings of the 11th international conference on Intelligent data engineering and automated learning
Relevance of contextual information in compression-based text clustering
IDEAL'10 Proceedings of the 11th international conference on Intelligent data engineering and automated learning
ITCH: information-theoretic cluster hierarchies
ECML PKDD'10 Proceedings of the 2010 European conference on Machine learning and knowledge discovery in databases: Part I
A Fast Quartet tree heuristic for hierarchical clustering
Pattern Recognition
Clustering based on kolmogorov information
KES'10 Proceedings of the 14th international conference on Knowledge-based and intelligent information and engineering systems: Part I
Nonapproximability of the normalized information distance
Journal of Computer and System Sciences
Relational generative topographic mapping
Neurocomputing
Reuse in the wild: an empirical and ethnographic study of organizational content reuse
Proceedings of the SIGCHI Conference on Human Factors in Computing Systems
Evolving computer-generated music by means of the normalized compression distance
SMO'05 Proceedings of the 5th WSEAS international conference on Simulation, modelling and optimization
Finding software license violations through binary code clone detection
Proceedings of the 8th Working Conference on Mining Software Repositories
Packing it all up in search for a language independent MT quality measure tool - part two
LTC'09 Proceedings of the 4th conference on Human language technology: challenges for computer science and linguistics
Studying software evolution using artefacts' shared information content
Science of Computer Programming
How far is it from here to there? a distance that is coherent with GP operators
EuroGP'11 Proceedings of the 14th European conference on Genetic programming
Model order selection for boolean matrix factorization
Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining
Measuring multi-language software evolution: a case study
Proceedings of the 12th International Workshop on Principles of Software Evolution and the 7th annual ERCIM Workshop on Software Evolution
Coherence progress: a measure of interestingness based on fixed compressors
AGI'11 Proceedings of the 4th international conference on Artificial general intelligence
The minimum code length for clustering using the gray code
ECML PKDD'11 Proceedings of the 2011 European conference on Machine learning and knowledge discovery in databases - Volume Part III
Defining locality as a problem difficulty measure in genetic programming
Genetic Programming and Evolvable Machines
Information distance and its extensions
DS'11 Proceedings of the 14th international conference on Discovery science
Scalable detection of frequent substrings by grammar-based compression
DS'11 Proceedings of the 14th international conference on Discovery science
Finding homoglyphs: a step towards detecting unicode-based visual spoofing attacks
WISE'11 Proceedings of the 12th international conference on Web information system engineering
ESP-index: a compressed index based on edit-sensitive parsing
SPIRE'11 Proceedings of the 18th international conference on String processing and information retrieval
Tweet classification by data compression
Proceedings of the 2011 international workshop on DETecting and Exploiting Cultural diversiTy on the social web
Prototype-based classification of dissimilarity data
IDA'11 Proceedings of the 10th international conference on Advances in intelligent data analysis X
Analysis of EU languages through text compression
FinTAL'06 Proceedings of the 5th international conference on Advances in Natural Language Processing
Image classification via LZ78 based string kernel: a comparative study
PAKDD'06 Proceedings of the 10th Pacific-Asia conference on Advances in Knowledge Discovery and Data Mining
A new combinatorial approach to sequence comparison
ICTCS'05 Proceedings of the 9th Italian conference on Theoretical Computer Science
Monotone conditional complexity bounds on future prediction errors
ALT'05 Proceedings of the 16th international conference on Algorithmic Learning Theory
Automatic upright orientation and good view recognition for 3D man-made models
Pattern Recognition
A fast compression-based similarity measure with applications to content-based image retrieval
Journal of Visual Communication and Image Representation
Similarity of objects and the meaning of words
TAMC'06 Proceedings of the Third international conference on Theory and Applications of Models of Computation
On the foundations of universal sequence prediction
TAMC'06 Proceedings of the Third international conference on Theory and Applications of Models of Computation
Information distance and its applications
CIAA'06 Proceedings of the 11th international conference on Implementation and Application of Automata
Clustering very large dissimilarity data sets
ANNPR'10 Proceedings of the 4th IAPR TC3 conference on Artificial Neural Networks in Pattern Recognition
CPM'05 Proceedings of the 16th annual conference on Combinatorial Pattern Matching
ADMA'05 Proceedings of the First international conference on Advanced Data Mining and Applications
Evaluation of analogical proportions through Kolmogorov complexity
Knowledge-Based Systems
Clustering the normalized compression distance for influenza virus data
Algorithms and Applications
Robustness of greedy type minimum evolution algorithms
ICCS'06 Proceedings of the 6th international conference on Computational Science - Volume Part II
Subseries join: a similarity-based time series match approach
PAKDD'10 Proceedings of the 14th Pacific-Asia conference on Advances in Knowledge Discovery and Data Mining - Volume Part I
Complexity profiles of DNA sequences using finite-context models
USAB'11 Proceedings of the 7th conference on Workgroup Human-Computer Interaction and Usability Engineering of the Austrian Computer Society: information Quality in e-Health
SC spectra: a linear-time soft cardinality approximation for text comparison
MICAI'11 Proceedings of the 10th international conference on Artificial Intelligence: advances in Soft Computing - Volume Part II
Is the contextual information relevant in text clustering by compression?
Expert Systems with Applications: An International Journal
Clustering avatars behaviours from virtual worlds interactions
Proceedings of the 4th International Workshop on Web Intelligence & Communities
Approximation techniques for clustering dissimilarity data
Neurocomputing
A linearly computable measure of string complexity
Theoretical Computer Science
Criticality of spatiotemporal dynamics in contact mediated pattern formation
IPCAT'12 Proceedings of the 9th international conference on Information Processing in Cells and Tissues
Unsupervised learning of patterns in data streams using compression and edit distance
IJCAI'11 Proceedings of the Twenty-Second international joint conference on Artificial Intelligence - Volume Volume Two
LIMES: a time-efficient approach for large-scale link discovery on the web of data
IJCAI'11 Proceedings of the Twenty-Second international joint conference on Artificial Intelligence - Volume Volume Three
On the Relationship between Novelty and Popularity of User-Generated Content
ACM Transactions on Intelligent Systems and Technology (TIST)
A clustering approach for structural k-anonymity in social networks using genetic algorithm
Proceedings of the CUBE International Information Technology Conference
Measuring structural similarity of semistructured data based on information-theoretic approaches
The VLDB Journal — The International Journal on Very Large Data Bases
Text segmentation by language using minimum description length
ACL '12 Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Long Papers - Volume 1
How to quantitatively compare data dissimilarities for unsupervised machine learning?
ANNPR'12 Proceedings of the 5th INNS IAPR TC 3 GIRPR conference on Artificial Neural Networks in Pattern Recognition
Efficient LZ78 factorization of grammar compressed text
SPIRE'12 Proceedings of the 19th international conference on String Processing and Information Retrieval
ICONIP'12 Proceedings of the 19th international conference on Neural Information Processing - Volume Part I
Supervised texture classification using a novel compression-based similarity measure
ICCVG'12 Proceedings of the 2012 international conference on Computer Vision and Graphics
ESP-index: A compressed index based on edit-sensitive parsing
Journal of Discrete Algorithms
Reducing information redundancy in search results
Proceedings of the 28th Annual ACM Symposium on Applied Computing
EARs in the wild: large-scale analysis of execution after redirect vulnerabilities
Proceedings of the 28th Annual ACM Symposium on Applied Computing
Connecting users across social media sites: a behavioral-modeling approach
Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining
Basic techniques in text mining using open-source tools
Proceedings of the 9th International Symposium on Open Collaboration
Legal documents categorization by compression
Proceedings of the Fourteenth International Conference on Artificial Intelligence and Law
Dictionary-based color image retrieval using multiset theory
Journal of Visual Communication and Image Representation
Feature learning for detection and prediction of freezing of gait in parkinson's disease
MLDM'13 Proceedings of the 9th international conference on Machine Learning and Data Mining in Pattern Recognition
A Comparison of String Similarity Measures for Toponym Matching
Proceedings of The First ACM SIGSPATIAL International Workshop on Computational Models of Place
On the value of multiple read/write streams for data compression
Information Theory, Combinatorics, and Search Theory
A systematic approach for detecting and clustering distributed cyber scanning
Computer Networks: The International Journal of Computer and Telecommunications Networking
Learning vector quantization for (dis-)similarities
Neurocomputing
Exploring programmable self-assembly in non-DNA based molecular computing
Natural Computing: an international journal
Hi-index | 754.90 |
We present a new method for clustering based on compression. The method does not use subject-specific features or background knowledge, and works as follows: First, we determine a parameter-free, universal, similarity distance, the normalized compression distance or NCD, computed from the lengths of compressed data files (singly and in pairwise concatenation). Second, we apply a hierarchical clustering method. The NCD is not restricted to a specific application area, and works across application area boundaries. A theoretical precursor, the normalized information distance, co-developed by one of the authors, is provably optimal. However, the optimality comes at the price of using the noncomputable notion of Kolmogorov complexity. We propose axioms to capture the real-world setting, and show that the NCD approximates optimality. To extract a hierarchy of clusters from the distance matrix, we determine a dendrogram (ternary tree) by a new quartet method and a fast heuristic to implement it. The method is implemented and available as public software, and is robust under choice of different compressors. To substantiate our claims of universality and robustness, we report evidence of successful application in areas as diverse as genomics, virology, languages, literature, music, handwritten digits, astronomy, and combinations of objects from completely different domains, using statistical, dictionary, and block sorting compressors. In genomics, we presented new evidence for major questions in Mammalian evolution, based on whole-mitochondrial genomic analysis: the Eutherian orders and the Marsupionta hypothesis against the Theria hypothesis.