This vignette is designed to be used with the ‘ppclust
’
package. You can download the recent version of the package from CRAN
with the following command:
If you have already installed ‘ppclust
’, you can load it
into R working environment by using the following command:
For visualization of the clustering results, some examples in this
vignette use the functions from some cluster analysis packages such as
‘cluster
’, ‘fclust
’ and
‘factoextra
’. Therefore, these packages should be loaded
into R working environment with the following commands:
We demonstrate UPFC on the Iris data set (Anderson, 1935). It is a real data set of the four features (Sepal.Length, Sepal.Width, Petal.Length and Petal.Width) of three species (in the last column as the class variable). This four-dimensional data set contains totally 150 observations as 50 samples from each of three iris species. One of these three natural clusters (Class 1) is linearly well-separated from the other two clusters, while Classes 2 and 3 have some overlap as seen in the plot below.
Plot the data by the classes of iris species
Unsupervised Possibilistic Fuzzy C-Means (UPFC) is an extension of Possibilistic Clustering Algorithm (PCA) by Yang & Wu (2006). Wu et al (2010) reported that PCA is very sensitive to initializations and sometimes generates coincident clusters. They proposed the algorithm UPFC to overcome this problem with inspiration by Pal et al’s PFCM algorithm (Pal et al, 2005). The algorithm UPFC produces both possibilistic and probabilistic memberships simultaneously, and overcomes the noise sensitivity problem of FCM and the coincident clusters problem of PCA.
In order to start UPFC, an initialization step is required to build the initial cluster prototypes matrix and fuzzy membership degrees matrix. Although this task is usually performed in the initialization step of the clustering algorithm, the initial prototypes and memberships can also be directly input by the user.
UPFC is usually started by using an integer specifying the number of
clusters. In this case, the prototypes matrix is internally generated by
using any of the prototype initalization algorithms which are included
in the package ‘inaparc
’. The default initialization
technique is K-means++ with the current version of
fcm
function in this package. In the following code block,
FCM runs for three clusters with the default values of the remaining
arguments.
Although it is a usual way to run UPFC with a pre-determined number of clusters, sometimes the users may wish to use a certain cluster prototypes matrix to start the algorithm. In this case, UPFC uses the user-specified prototype matrix instead of internally generated ones as follows:
v0 <- matrix(nrow=3, ncol=4,
c(5.0, 3.4, 1.4, 0.3,
6.7, 3.0, 5.6, 2.1,
5.8, 2.7, 4.3, 1.4),
byrow=TRUE)
print(v0)
## [,1] [,2] [,3] [,4]
## [1,] 5.0 3.4 1.4 0.3
## [2,] 6.7 3.0 5.6 2.1
## [3,] 5.8 2.7 4.3 1.4
In the following code block, there is another example for
initialization of cluster prototypes for three clusters. v
,
the cluster prototype matrix is initialized by using the
K-means++ initalization algorithm (kmpp
) in the R
package ‘inaparc
’.
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## Cl.1 7.7 2.6 6.9 2.3
## Cl.2 5.4 3.4 1.5 0.4
## Cl.3 6.4 3.2 4.5 1.5
There are many other techniques for initialization of cluster prototype matrices in the R package ‘
inaparc
’. See the package documentation for other alternatives.
UPFC can be alternatively started with a user-specified membership
degrees matrix. In this case, the function fcm
is called
with the argument memberships
as follows:
In order to generate the matrices of cluster prototypes and
membership degrees, users can choose any of existing initialization
algorithms in the R package ‘inaparc
’. For this purpose,
alginitv
and alginitu
can be speficied for
cluster prototypes and membership degrees, respectively.
In general, the squared Euclidean distance metric is used with UPFC
in order to compute the distances between the cluster centers and the
data objects. Users can specify the other distance metrics with the
argument dmetric
as follows:
Although the argument m
, fuzzy exponent is usually
choosen as 2
in most of the applications using PFCM, users
can increase m
to get more fuzzified results as
follows:
However, large values of m
reduce the effect of
memberships on the prototypes and the model will behave more like the
PCM model (Pal et al, 2005). The argument eta
, typicality
exponent should be usually choosen as 2
for UPFC, but users
can change the eta
as follows in order to achieve the
different clustering results.
In UPFC, the relative importance of probabilistic and possibilistic
parts of the objective function can be changed by using a
and b
parameters, respectively. If \(b=0\), and \(\Omega_i=0\), UPFC behaves like to FCM.
Contrarily, if \(a=0\) UPFC becomes
equivalent to PCM. In general these arguments are set to 2
as the default value, users can change them as follows:
There are other input arguments available with the function
upfc
to control the run of UPFC algoritm. For the details, see theupfc
section in the package manual ofppclust
.
The fuzzy membership degrees matrix is one of the main outputs of the
function upfc
.
The typicality degrees matrix of the function upfc
.
Initial and final cluster prototypes matrices can be achieved as follows:
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## Cluster 1 5.6 2.5 3.9 1.1
## Cluster 2 6.1 2.6 5.6 1.4
## Cluster 3 4.8 3.1 1.6 0.2
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## Cluster 1 5.944440 2.791643 4.429674 1.4231306
## Cluster 2 6.629620 3.018968 5.478627 2.0051523
## Cluster 3 5.000576 3.411018 1.480558 0.2503315
The function upfc
returns the clustering results as an
intance of ‘ppclust
’ class. This object is summarized with
the summary
method of the package ‘ppclust
’ as
follows:
## Summary for 'res.upfc'
##
## Number of data objects: 150
##
## Number of clusters: 3
##
## Crisp clustering vector:
## [1] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
## [38] 3 3 3 3 3 3 3 3 3 3 3 3 3 2 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [75] 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 2 2 2 2 1 2 2 2 2
## [112] 2 2 1 2 2 2 2 2 1 2 1 2 1 2 2 1 1 2 2 2 2 2 2 2 2 2 2 1 2 2 2 1 2 2 2 2 2
## [149] 2 1
##
## Initial cluster prototypes:
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## Cluster 1 5.6 2.5 3.9 1.1
## Cluster 2 6.1 2.6 5.6 1.4
## Cluster 3 4.8 3.1 1.6 0.2
##
## Final cluster prototypes:
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## Cluster 1 5.944440 2.791643 4.429674 1.4231306
## Cluster 2 6.629620 3.018968 5.478627 2.0051523
## Cluster 3 5.000576 3.411018 1.480558 0.2503315
##
## Distance between the final cluster prototypes
## Cluster 1 Cluster 2
## Cluster 2 1.960201
## Cluster 3 11.347243 21.871439
##
## Difference between the initial and final cluster prototypes
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## Cluster 1 0.3444400 0.2916432 0.5296736 0.32313065
## Cluster 2 0.5296201 0.4189683 -0.1213729 0.60515234
## Cluster 3 0.2005756 0.3110180 -0.1194416 0.05033154
##
## Root Mean Squared Deviations (RMSD): 0.7255796
## Mean Absolute Deviation (MAD): 5.127157
##
## Membership degrees matrix (top and bottom 5 rows):
## Cluster 1 Cluster 2 Cluster 3
## 1 0.000115388 3.6e-08 0.9797503
## 2 0.000122701 2.6e-08 0.8663850
## 3 0.000049248 8.0e-09 0.8784034
## 4 0.000107585 2.0e-08 0.8200776
## 5 0.000089677 2.6e-08 0.9664585
## ...
## Cluster 1 Cluster 2 Cluster 3
## 146 0.2214961 0.8784889 1.030e-07
## 147 0.5583624 0.6241410 1.453e-06
## 148 0.3772716 0.9302419 4.020e-07
## 149 0.1946746 0.7243457 1.110e-07
## 150 0.6153330 0.5782990 3.483e-06
##
## Descriptive statistics for the membership degrees by clusters
## Size Min Q1 Mean Median Q3 Max
## Cluster 1 58 0.10571370 0.5836813 0.6754658 0.7131837 0.8045532 0.9806048
## Cluster 2 42 0.07319913 0.4725497 0.6552011 0.7293231 0.8715668 0.9708134
## Cluster 3 50 0.31374943 0.7305122 0.8161604 0.8715939 0.9185827 0.9976897
##
## Dunn's Fuzziness Coefficients:
## dunn_coeff normalized
## 0.6169534 0.4254301
##
## Within cluster sum of squares by cluster:
## 1 2 3
## 35.21241 29.10857 15.15100
## (between_SS / total_SS = 87.52%)
##
## Available components:
## [1] "u" "t" "v" "v0" "d"
## [6] "x" "cluster" "csize" "sumsqrs" "k"
## [11] "m" "eta" "a" "b" "beta"
## [16] "iter" "best.start" "func.val" "comp.time" "inpargs"
## [21] "algorithm" "call"
All available components of the ‘ppclust
’ object are
listed at the end of summary. These elements can be accessed using as
the attributes of the object. For example, the execution time of the run
of UPFC is accessed as follows:
## [1] 0.586
In order to find an optimal solution, the function upfc
can be started for multiple times. As seen in the following code chunk,
the argument nstart
is used for this purpose.
When the multiple start is performed, either initial cluster
prototypes or initial membership degrees could be wanted to keep
unchanged between the starts of algorithm. In this case, the arguments
fixcent
and fixmemb
are used for fixing the
initial cluster prototypes matrix and initial membership degrees matrix,
respectively.
Both the prototypes and memberships cannot be fixed at the same time.
The clustering result contains some outputs providing information about some components such as the objective function values, number of iterations and computing time obtained with each start of the algorithm. The following code chunk demonstrates these outputs.
## [1] -186.3218 -149.4653 -186.3218 -186.3218
## [1] 57 132 59 60
## [1] 0.566 1.418 0.641 0.719
Among the outputs from succesive starts of the algorithm, the best solution is obtained from the start giving the minimum value of the objective function, and stored as the final clustering result of the multiple starts of UPFC.
## [1] 4
As described for the single start of UPFC above, the result of
multiple starts of UPFC is displayed by using summary
method of the ‘ppclust
’ package.
## Summary for 'res.upfc'
##
## Number of data objects: 150
##
## Number of clusters: 3
##
## Crisp clustering vector:
## [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [38] 1 1 1 1 1 1 1 1 1 1 1 1 1 3 2 3 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## [75] 2 2 2 3 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 2 3 3 3 3 2 3 3 3 3
## [112] 3 3 2 3 3 3 3 3 2 3 2 3 2 3 3 2 2 3 3 3 3 3 3 3 3 3 3 2 3 3 3 2 3 3 3 3 3
## [149] 3 2
##
## Initial cluster prototypes:
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## Cluster 1 4.9 3.1 1.5 0.2
## Cluster 2 5.6 3.0 4.5 1.5
## Cluster 3 6.3 3.3 6.0 2.5
##
## Final cluster prototypes:
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## Cluster 1 5.000576 3.411018 1.480558 0.2503315
## Cluster 2 5.944440 2.791643 4.429674 1.4231306
## Cluster 3 6.629620 3.018968 5.478627 2.0051523
##
## Distance between the final cluster prototypes
## Cluster 1 Cluster 2
## Cluster 2 11.347243
## Cluster 3 21.871439 1.960201
##
## Difference between the initial and final cluster prototypes
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## Cluster 1 0.1005756 0.3110180 -0.01944160 0.05033154
## Cluster 2 0.3444400 -0.2083568 -0.07032643 -0.07686935
## Cluster 3 0.3296201 -0.2810317 -0.52137286 -0.49484766
##
## Root Mean Squared Deviations (RMSD): 0.5735844
## Mean Absolute Deviation (MAD): 3.744309
##
## Membership degrees matrix (top and bottom 5 rows):
## Cluster 1 Cluster 2 Cluster 3
## 1 0.9797503 0.000115388 3.6e-08
## 2 0.8663850 0.000122701 2.6e-08
## 3 0.8784034 0.000049248 8.0e-09
## 4 0.8200776 0.000107585 2.0e-08
## 5 0.9664585 0.000089677 2.6e-08
## ...
## Cluster 1 Cluster 2 Cluster 3
## 146 1.030e-07 0.2214961 0.8784889
## 147 1.453e-06 0.5583624 0.6241410
## 148 4.020e-07 0.3772716 0.9302419
## 149 1.110e-07 0.1946746 0.7243457
## 150 3.483e-06 0.6153330 0.5782990
##
## Descriptive statistics for the membership degrees by clusters
## Size Min Q1 Mean Median Q3 Max
## Cluster 1 50 0.31374943 0.7305122 0.8161604 0.8715939 0.9185827 0.9976897
## Cluster 2 58 0.10571370 0.5836813 0.6754658 0.7131837 0.8045532 0.9806048
## Cluster 3 42 0.07319913 0.4725497 0.6552011 0.7293231 0.8715668 0.9708134
##
## Dunn's Fuzziness Coefficients:
## dunn_coeff normalized
## 0.6169534 0.4254301
##
## Within cluster sum of squares by cluster:
## 1 2 3
## 15.15100 35.21241 29.10857
## (between_SS / total_SS = 87.52%)
##
## Available components:
## [1] "u" "t" "v" "v0" "d"
## [6] "x" "cluster" "csize" "sumsqrs" "k"
## [11] "m" "eta" "a" "b" "beta"
## [16] "iter" "best.start" "func.val" "comp.time" "inpargs"
## [21] "algorithm" "call"
There are many ways of visual representation of the clustering
results. One common techique is to display the clustering results by
using pairs of the features. The ‘plotcluster
’ can be used
to plot the clustering results as follows:
In the above figure, the plot shows the data objects which have
typicality degrees than 0.5 by using the color palette 1 and transparent
colors. In order to filter the objects for different typicality degrees,
the value of tv
argument can be changed as follows:
As default, the function plotcluster
plots for the
typicality degrees for the algorithm producing both fuzzy membership and
typicality degrees such as FPCM and PFCM including UPFC. In order to
plot using the fuzzy memberships, the argument mt
, the
membership type should be used with “u” option as follows:
There are some nice versions of the cluster plots in some of the R
packages. One of them is the function fviz_cluster
of the
package ‘factoextra
’ (Kassambara & Mundt, 2017). In
order to use this function for the fuzzy clustering result, first the
fuzzy clustering object of ppclust
should be converted to
kmeans
object by using the function ppclust2
of the package ‘ppclust
’ as shown in the first line of the
code chunk below:
User can also use the function clusplot
in the package
‘cluster
’ (Maechler et al, 2017) for plotting the
clustering results. For this purpose, the fuzzy clustering object of
ppclust
should be converted to fanny
object by
using the ppclust2
function of the package
‘ppclust
’ as seen in the following code chunk.
Cluster validation is an evaluation process for the goodness of the
clustering result. For this purpose, various validity indexes have been
proposed in the related literature. Since clustering is an unsupervised
learning analysis which does not use any external information, the
internal indexes are used to validate the clustering results. Although
there are many internal indexes have originally been proposed for
working with hard membership degrees produced by the K-means and its
variants, most of these indexes cannot be used for fuzzy clustering
results. In R environment, Partition Entropy (PE), Partition Coefficient
(PC) and Modified Partition Coefficient (MPC) and Fuzzy Silhouette Index
are available in ‘fclust
’ package (Ferraro & Giordani,
2015), and can be used as follows:
res.upfc4 <- ppclust2(res.upfc, "fclust")
idxsf <- SIL.F(res.upfc4$Xca, res.upfc4$U, alpha=1)
idxpe <- PE(res.upfc4$U)
idxpc <- PC(res.upfc4$U)
idxmpc <- MPC(res.upfc4$U)
## Partition Entropy: 0.3614377
## Partition Coefficient: 0.6169534
## Modified Partition Coefficient: 0.4254301
## Fuzzy Silhouette Index: 0.8262374
Anderson, E. (1935). The irises of the GASPE peninsula, in Bull. Amer. Iris Soc., 59: 2-5.
Ferraro, M.B. & Giordani, P. (2015). A toolbox for fuzzy clustering using the R programming language. Fuzzy Sets and Systems, 279, 1-16. http://dx.doi.org/10.1016/j.fss.2015.05.001
Kassambara, A. & Mundt, F. (2017). factoextra: Extract and Visualize the Results of Multivariate Data Analyses. R package version 1.0.5. https://CRAN.R-project.org/package=factoextra
Maechler, M., Rousseeuw, P., Struyf, A., Hubert, M. & Hornik, K.(2017). cluster: Cluster Analysis Basics and Extensions. R package version 2.0.6. https://CRAN.R-project.org/package=cluster
Pal, N. R., Pal, K. & Bezdek, J. C. (2005). A possibilistic fuzzy c-means clustering algorithm. IEEE Trans. Fuzzy Systems, 13 (4): 517-530
Yang, M. S. & Wu, K. L. (2006). Unsupervised possibilistic clustering. Pattern Recognition, 39(1): 5-21.
Wu, X., Wu, B., Sun, J. & Fu, H. (2010). Unsupervised possibilistic fuzzy clustering. J. of Information & Computational Sci., 7 (5): 1075-1080.