33 Clustering Metrics and Cluster Validity

Cluster analysis is finding similarities between data according to the characteristics found in the data and grouping similar data objects into clusters.

Typical applications

As a stand-alone tool to get insight into data distribution
As a preprocessing step for other algorithms

Dissimilarity/Similarity metric

The similarity is expressed in terms of a distance function, which is typically metric: d(i, j)

There is a separate “quality” function that measures the “goodness” of a cluster.

The definitions of distance functions are usually very different for interval-scaled, boolean, categorical, ordinal and ratio variables.

Weights should be associated with different variables based on applications and data semantics.
It is hard to define “similar enough” or “good enough” and the answer is typically highly subjective.

What is Good Clustering?

A good clustering method will produce high-quality clusters with

high intra-class similarity
low inter-class similarity

The quality of a clustering result depends on

the similarity measure used
implementation of the similarity measure

The quality of a clustering method is also measured by its ability to discover some or all of the hidden patterns

Requirements of Clustering

Scalability
Ability to deal with different types of attributes
Discovery of clusters with arbitrary shape
Minimal requirements for domain knowledge to determine input parameters
Ability to deal with noise and outliers
Insensitivity to the order of input records
High dimensionality
Incorporation of user-specified constraints
Interpretability and usability

Measuring Clustering Quality

Two methods: extrinsic vs. intrinsic

Extrinsic: supervised, i.e., the ground truth is available

•Compare a clustering against the ground truth using certain clustering quality measure

•Ex. Purity, precision and recall metrics, normalized mutual information

Intrinsic: unsupervised, i.e., the ground truth is unavailable

•Evaluate the goodness of a clustering by considering how well the clusters are separated, and how compact the clusters are

•Ex. Silhouette coefficient

License

Building Skills for Data Science Copyright © by Dr. Nouhad Rizk. All Rights Reserved.

Share This Book