Clustering of Distributions: A Case of Patent Citations

  • Authors:
  • Nataša Kejžar;Simona Korenjak-Černe;Vladimir Batagelj

  • Affiliations:
  • University of Ljubljana, Faculty of Medicine, Institute of Biostatistics and Medical Informatics, IBMI, Vrazov trg 2, 1000, Ljubljana, Slovenia;University of Ljubljana, Faculty of Economics, Department of Statistics, Ljubljana, Slovenia;University of Ljubljana, Faculty of Mathematics and Physics, Department of Mathematics, Ljubljana, Slovenia

  • Venue:
  • Journal of Classification
  • Year:
  • 2011

Quantified Score

Hi-index 0.00

Visualization

Abstract

Often the data units are described with discrete distributions (work described with citation distribution over time, population pyramid described as age-sex distribution etc.).When the set of such units is very large, appropriate clustering methods can reveal the typical patterns hidden in the data. In this paper we present an adapted leaders method combined with a compatible adapted agglomerative hierarchical method that are based on relative error measure between a unit and the corresponding cluster representative–leader. The proposed approach is illustrated on citation distributions derived from the data set of US patents from 1980 to 1999. These new methods were developed because clustering of units, described with distributions, with classical k-means method reveals patterns with single high peaks which correspond to a single year. These patterns prevail over other distribution shapes also present in the data. Compared with centers in k-means method, clusters’ representatives obtained with the proposed new methods better detect typical distribution shapes of units. The obtained main cluster types for different sets of units show three main patterns: patents with early or late peak of importance to the community, and patents where the importance is slowly increasing throughout the time period.