Introduction

In Part I of this series, I introduced the idea of membership entropy which, in the context of probabilistic clustering models allows us to construct optimal clusters. I have argued that an optimal set of (fuzzy) clusters is that that minimizes the total membership entropy across our data space. In fact, the purpose of clustering is to systematize relations, or proximities, such that intra-cluster variability is minimized, while inter-cluster variability is maximized. This means less information (number of bits, for instance) is needed when we refer to our data in terms of clusters rather than in its unclustered form. Doing this optimally means minimizing this quantity of information needed to describe our data, which leads to the idea of minimizing the total membership entropy.

In Part I I have illustrated the power of this idea using very simple synthetic datasets. In this second part, I will describe some improvements in the framework needed to properly handle more complex and realistic datasets.

I will finish with some systematic experiments to assess the overall performance of the framework.

I have previously defined the membership entropy $S_x$ and the representative cluster, both derived from the set of membership probabilities $p_x(i)$. Here, $p_x(i)$ is defined as the probability of observation $x$ belonging to cluster $i$. Check the previous post for further details:

from fcmeans import FCM

def run_cluster(n_clusters, data, seed=42):
    # membership probabilities
    model = FCM(n_clusters=n_clusters, random_state=seed)
    model = model.fit(data)
    p = model.u
    centers = model.centers
    # representative cluster
    representative_cluster = np.argmax(p, 1)
    # membership entropy
    Sx = -np.sum(p*np.log(p), 1) / np.log(n_clusters)
    # total membership entropy (across the entire feature space)
    S = np.sum(Sx)        
    
    return centers, p, representative_cluster, Sx, S

Membership entropy regularization

Regularization is a central concept in machine learning. It allows us to mitigate overfitting by penalizing models with high degrees of freedom. We can think of clustering as the problem of being able to describe a dataset not in terms of individual observations but rather in terms of collective clusters. The number of clusters can then be interpreted as the number of degrees of freedom. We may be interested in promoting a low number of clusters to further reduce the complexity of our data. Models with a higher number of clusters should then be penalized.

I have previously defined the membership entropy as

$$ S_{x}(\{ p_x(i) \}) = - \frac{1}{\mathrm{log}(k)} \displaystyle\sum_{i=1}^k p_x(i) \mathrm{log}(p_x(i)), $$

with $k$ the number of clusters. In this context, and while we typically find a local entropy minimum at a given value of $k$ - defined as the optimal number of clusters - note that a model where the number of clusters is equal to the total number of observations has zero entropy (each observation is its own cluster, with probability one). In order to promote lower complexity, we can then define the penalized membership entropy

$$ S_x^{\lambda} = S_x + \lambda k, $$

with $\lambda \geq 0$ the regularization parameter. The minimization of $S_x^{\lambda}$ to find the optimal number of clusters $k$ will, to a degree quantified by $\lambda$, promote models with lower complexity. This may be particularly important for datasets with a small number of observations and/or a high number of clusters.

Experimental results

Random datasets

In order to conduct systematic experiments, we need to be able to generate random datasets with bespoke properties. This can be done in a two-step process:

1) Random partitions:

In order to conduct systematic experiments we need to be able to generate random datasets with bespoke properties. In particular, we want to uniformly sample $(n,k,m)$ partitions, where $n$ is the total number of observations, $k$ is the number of clusters, and $m$ is the minimum number of observations per cluster. The following code, borrowed from (de Prado, 2020) uniformly samples such partitions:

import numpy as np
def construct_random_partition(n, k, m, seed=None):
    rand = np.random.RandomState(seed=seed)
    parts = rand.choice(range(1, n-k*(m-1)), k-1, replace=False)
    parts.sort()
    parts = np.append(parts, n-k*(m-1))
    parts = np.append(parts[0], np.diff(parts))-1+m
    return parts

As an example, let's sample a partition with $(n=500, k=7, m=2)$. Note: Every function involving random sampling takes a user-defined seed parameter, to allow the reproducibility of the results I show here.

partition = construct_random_partition(n=500, k=7, m=2, seed=40)
print(partition)

[153  75  13  33   7  36 183]

2) Random observations:

After the partition has been generated, we now need to sample random observations. Let's define a function that takes the previously generated partition as input, together with n_features (the dimensionality of our data space), and std (the width of our clusters). The coordinates of the cluster centers are uniformly sampled from the $[-1, 1]$ interval.

def generate_random_dataset(partition, n_features, std, seed=41):
    random_state = np.random.RandomState(seed=seed)
    centers = list()
    dataset = list()
    for n in partition:
        # cluster center coordinates
        cluster_center = random_state.uniform(-1, 1, n_features)
        centers.append(cluster_center)
        # observation coordinates
        for observation in range(0, n):
            dataset.append(cluster_center+std*random_state.standard_normal(n_features))
    dataset = np.array(dataset)
    # shuffles the observations
    dataset = dataset[random_state.permutation(dataset.shape[0]), :]
    
    return np.array(dataset), np.array(centers)

Let's generate and plot a random dataset:

dataset, _ = generate_random_dataset(partition=partition, n_features=2, std=0.04, seed=41)

import matplotlib
from matplotlib import cm
import matplotlib.pyplot as plt

fig, axes = plt.subplots(1, 1, figsize=(5, 4))

axes.scatter(dataset[:,0], dataset[:,1], marker='.', s=60, color=(0.5,0.5,0.8,0.5))
axes.set_xlabel('feature 1')
axes.set_ylabel('feature 2')
axes.set_xlim(-1.2,1.2)
axes.set_ylim(-1.2,1.2)

plt.tight_layout()
plt.show()

Dealing with complex datasets

Non-convex optimization

As I have described in Part I, the fc-means algorithm involves the minimization of a particular loss function (Bezdek, 2013). This minimization, however, is not convex and there is the risk of being stuck at a local minimum. While this may not be problematic in simple cases, as the complexity of our dataset increases this becomes a rather serious problem. While a global minimization is always hard to guarantee, we can increase the confidence in the performance by running the fc-means algorithm with different seeds.

Here, we will minimize the total membership entropy like I have described in Part I, but running the algorithm with a total of n_seeds different seeds. Moreover, max_clusters is the maximum number of clusters to consider, and lamb is the regularization parameters introduced above.

def minimize_membership_entropy(dataset, max_clusters=10, lamb=0, seed=42, n_seeds=1):
    
    # The number of clusters to try
    n_clusters_trials = np.arange(2, max_clusters, 1)
    
    # Creates the seeds
    if n_seeds == 1:
        seeds = np.array([seed])
    else:
        rand = np.random.RandomState(seed=seed)
        seeds = rand.randint(low=1, high=1e9, size=n_seeds)
        
    # Runs the clustering for different seeds and number of clusters
    total_entropies = list()
    
    for seed in seeds:
        total_entropies_ = list()
        for trial in n_clusters_trials:
            _, _, _, _, total_entropy = run_cluster(n_clusters=trial, 
                                                    data=dataset, 
                                                    seed=seed)
            total_entropies_.append(total_entropy+lamb*trial)
        total_entropies.append(total_entropies_)
    total_entropies = np.array(total_entropies)
        
    # Finds the optimal number of clusters and the respective seed
    location = np.argwhere(total_entropies == np.min(total_entropies))[0]
    optimal_seed = seeds[location[0]]
    optimal_nclusters = n_clusters_trials[location[1]]
    
    return optimal_seed, optimal_nclusters, total_entropies

Let's run the minimization process for the dataset generated above:

max_clusters = 12
optimal_seed, optimal_nclusters, total_entropies = minimize_membership_entropy(dataset,
                                                                               max_clusters=max_clusters,
                                                                               lamb=1, 
                                                                               n_seeds=100)
print("Optimal number of clusters =", optimal_nclusters)

Optimal number of clusters = 7

We have correctly inferred the total number of clusters. Let's check the results of the (penalized) membership entropy minimization:

fig, axes = plt.subplots(1, 1, figsize=(5, 4))

axes.plot([optimal_nclusters, optimal_nclusters], [0, np.max(total_entropies)], color=(0.8,0.6,0.6), linewidth=2)
for i in range(0, total_entropies.shape[0]):
    axes.plot(np.arange(2, max_clusters, 1), total_entropies[i,:], color=(0.46,0.46,0.46,0.2), linewidth=2)
axes.set_xlabel('Number of clusters')
axes.set_ylabel('Total membership entropy')

plt.tight_layout()
plt.show()

Here, the different lines correspond to different seeds. We can see the huge variability in the results, especially when testing a higher number of clusters. Not performing this ensemble analysis would lead to very far-from-optimal results. For the sake of completeness, let's run the optimal clustering and plot the results:

centers, p, representative_cluster, Sx, S = run_cluster(optimal_nclusters, dataset, seed=optimal_seed)
print("Total membership entropy =", np.round(S, 2))

Total membership entropy = 28.85

def make_rgb_transparent(rgb, alpha):
    bg_rgb = [1, 1, 1]
    return [alpha * c1 + (1 - alpha) * c2 for (c1, c2) in zip(rgb, bg_rgb)]
colormap = cm.get_cmap('Accent')
edgecolors = list()
facecolors = list()
for i in range(0, optimal_nclusters):
    edgecolors.append(make_rgb_transparent(rgb=colormap(1.0*i/(optimal_nclusters-1)), alpha=1))
    facecolors.append(make_rgb_transparent(rgb=colormap(1.0*i/(optimal_nclusters-1)), alpha=0.65))

### Plot
fig, axes = plt.subplots(1, 1, figsize=(5, 4))
color_seq = list()
for j in range(0, dataset.shape[0]):
    color_seq.append(make_rgb_transparent(edgecolors[representative_cluster[j]], 1-Sx[j]))
for i in range(0, optimal_nclusters):
    axes.scatter([], [], label=str(i), color=edgecolors[i])
axes.scatter(dataset[:,0], dataset[:,1], marker='.', s=60, edgecolors=(0.6,0.6,0.6,0.5), c=color_seq)
axes.scatter(centers[:,0], centers[:,1], color=(0.8,0.2,0.2, 0.8), marker="v", label="Cluster center")
axes.set_xlabel('X')
axes.set_ylabel('Y')
axes.set_xlim(-1.2,1.2)
axes.set_ylim(-1.2,1.2)
axes.legend(loc="best")

plt.tight_layout()
plt.show()

Revisiting the concept of membership entropy regularization

As described above, promoting a lower number of clusters can be important when dealing with small datasets and/or a large number of clusters. Let's illustrate this. We consider the following dataset:

n=50
k=10
partition_seed=40
std=0.05
dataset_seed=41

partition = construct_random_partition(n=n, k=k, m=2, seed=partition_seed)
print("Partition:", partition)
dataset, _ = generate_random_dataset(partition=partition, n_features=2, std=std, seed=dataset_seed)

Partition: [ 4  3  2 12  7  3  2  4  7  6]

Let's begin by constructing unpenalized ($\lambda=0$) optimal clusters:

max_clusters=12
lamb=0
n_seeds=200

### Membership entropy minimization
max_clusters = 12
optimal_seed, optimal_nclusters, total_entropies = minimize_membership_entropy(dataset,
                                                                               max_clusters=max_clusters,
                                                                               lamb=lamb, 
                                                                               n_seeds=n_seeds)
print("Optimal number of clusters =", optimal_nclusters)

### Clustering
centers, p, representative_cluster, Sx, S = run_cluster(optimal_nclusters, dataset, seed=optimal_seed)
print("Total membership entropy =", np.round(S, 2))

### Plot
edgecolors = list()
facecolors = list()
for i in range(0, optimal_nclusters):
    edgecolors.append(make_rgb_transparent(rgb=colormap(1.0*i/(optimal_nclusters-1)), alpha=1))
    facecolors.append(make_rgb_transparent(rgb=colormap(1.0*i/(optimal_nclusters-1)), alpha=0.65))

fig, axes = plt.subplots(1, 2, figsize=(10, 4))

axes[0].plot([optimal_nclusters, optimal_nclusters], [0, np.max(total_entropies)], color=(0.8,0.6,0.6), linewidth=2)
for i in range(0, total_entropies.shape[0]):
    axes[0].plot(np.arange(2, max_clusters, 1), total_entropies[i,:], color=(0.46,0.46,0.46,0.2), linewidth=2)
axes[0].set_xlabel('Number of clusters')
axes[0].set_ylabel('Total membership entropy')

color_seq = list()
for j in range(0, dataset.shape[0]):
    color_seq.append(make_rgb_transparent(edgecolors[representative_cluster[j]], 1-Sx[j]))
for i in range(0, optimal_nclusters):
    axes[1].scatter([], [], label=str(i), color=edgecolors[i])
axes[1].scatter(dataset[:,0], dataset[:,1], marker='.', s=60, edgecolors=(0.6,0.6,0.6,0.5), c=color_seq)
axes[1].scatter(centers[:,0], centers[:,1], color=(0.8,0.2,0.2, 0.8), marker="v", label="Cluster center")
axes[1].set_xlabel('X')
axes[1].set_ylabel('Y')
axes[1].set_xlim(-1.2,1.2+0.5)
axes[1].set_ylim(-1.2,1.2)
axes[1].legend(loc="best")

plt.tight_layout()
plt.show()

Optimal number of clusters = 9
Total membership entropy = 14.79

Let's now promote a lower number of clusters by setting $\lambda=1$:

max_clusters=12
lamb=1
n_seeds=200

### Membership entropy minimization
max_clusters = 12
optimal_seed, optimal_nclusters, total_entropies = minimize_membership_entropy(dataset,
                                                                               max_clusters=max_clusters,
                                                                               lamb=lamb, 
                                                                               n_seeds=n_seeds)
print("Optimal number of clusters =", optimal_nclusters)

### Clustering
centers, p, representative_cluster, Sx, S = run_cluster(optimal_nclusters, dataset, seed=optimal_seed)
print("Total membership entropy =", np.round(S, 2))

### Plot
edgecolors = list()
facecolors = list()
for i in range(0, optimal_nclusters):
    edgecolors.append(make_rgb_transparent(rgb=colormap(1.0*i/(optimal_nclusters-1)), alpha=1))
    facecolors.append(make_rgb_transparent(rgb=colormap(1.0*i/(optimal_nclusters-1)), alpha=0.65))

fig, axes = plt.subplots(1, 2, figsize=(10, 4))

axes[0].plot([optimal_nclusters, optimal_nclusters], [0, np.max(total_entropies)], color=(0.8,0.6,0.6), linewidth=2)
for i in range(0, total_entropies.shape[0]):
    axes[0].plot(np.arange(2, max_clusters, 1), total_entropies[i,:], color=(0.46,0.46,0.46,0.2), linewidth=2)
axes[0].set_xlabel('Number of clusters')
axes[0].set_ylabel('Total membership entropy')

color_seq = list()
for j in range(0, dataset.shape[0]):
    color_seq.append(make_rgb_transparent(edgecolors[representative_cluster[j]], 1-Sx[j]))
for i in range(0, optimal_nclusters):
    axes[1].scatter([], [], label=str(i), color=edgecolors[i])
axes[1].scatter(dataset[:,0], dataset[:,1], marker='.', s=60, edgecolors=(0.6,0.6,0.6,0.5), c=color_seq)
axes[1].scatter(centers[:,0], centers[:,1], color=(0.8,0.2,0.2, 0.8), marker="v", label="Cluster center")
axes[1].set_xlabel('X')
axes[1].set_ylabel('Y')
axes[1].set_xlim(-1.2,1.2+0.5)
axes[1].set_ylim(-1.2,1.2)
axes[1].legend(loc="best")

plt.tight_layout()
plt.show()

Optimal number of clusters = 6
Total membership entropy = 17.37

The unpenalized clustering recovers a number of clusters very close to the one used to sample the partition. However, given the combination of the small number of observations, large number of clusters and high proximity between the clusters, the penalized clustering may be considered, on some levels, superior to its unpenalized counterpart, despite its larger entropy.

Performance assessment

Let us now assess the performance of the optimal probabilistic clustering framework in a more systematic and comprehensive manner. Let's consider random datasets with different numbers of clusters k_vals and cluster widths std_vals. Let's however, keep the total number of observations n constant. For each combination we run the experiment n_realizations times. Since we will be only interested in recovering the correct number of clusters, we set $\lambda=0$:

k_vals = np.arange(3, 10, 1)
std_vals = np.arange(0.0025, 0.03, 0.0025)
n_realizations = 25
n = 500
lamb = 0

n_seeds = 100
max_clusters = 12

The following function wraps all the procedures described above:

def run_experiment(n,                  # Number of observations
                   k,                  # Number of clusters
                   m,                  # Minimum number of observation per cluster
                   seed_partition,     # Seed to sample partition
                   n_features,         # Number of features (dimensionality of data space)
                   std,                # Width of clusters
                   seed_dataset,       # Seed to sample observations
                   max_clusters,       # Maximum number of clusters to try
                   lamb,               # Regularization parameters
                   n_seeds):           # Number of seeds to try
    
    # Samples partition
    partition = construct_random_partition(n=n, k=k, m=m, seed=seed_partition)
    
    # Samples dataset
    dataset, _ = generate_random_dataset(partition=partition, 
                                      n_features=n_features, 
                                      std=std, 
                                      seed=seed_dataset)
    
    # Minimizes total membership entropy
    optimal_seed, optimal_nclusters, total_entropies = minimize_membership_entropy(dataset,
                                                                                   max_clusters=max_clusters,
                                                                                   lamb=lamb, 
                                                                                   n_seeds=n_seeds)
                
    return optimal_seed, optimal_nclusters

Let's run the experiment (the following code may take a long time to run):

experimental_results = np.zeros((n_realizations, len(k_vals), len(std_vals)))

rand = np.random.RandomState(seed=41)
for i in range(0, n_realizations):
    
    print("Running realization ->", i)
    seed_partition = rand.randint(low=1, high=1e9, size=1)
    seed_dataset = rand.randint(low=1, high=1e9, size=1)
    for j, k in enumerate(k_vals):
        for h, std in enumerate(std_vals):
            _, optimal_nclusters = run_experiment(n=n,
                                                  k=k,
                                                  m=2,
                                                  seed_partition=seed_partition,
                                                  n_features=2,
                                                  std=std,
                                                  seed_dataset=seed_dataset,
                                                  max_clusters=max_clusters,
                                                  lamb=lamb,
                                                  n_seeds=n_seeds)
            
            experimental_results[i,j,h] = optimal_nclusters

We can now quantify the performance by calculating the relative error in estimating the number of clusters, followed by its mean and standard deviation over the different realizations:

real_solution  = np.repeat([k_vals], repeats=len(std_vals), axis=0).T
deviation_mean = np.mean((experimental_results-real_solution)/real_solution, 0)
deviation_std  = np.std((experimental_results-real_solution)/real_solution, 0)

And finally plotting the results:

k_vals_plot = list(k_vals)
k_vals_plot.append(k_vals[-1]+(k_vals[-1]-k_vals[-2]))
std_vals_plot = list(std_vals)
std_vals_plot.append(std_vals_plot[-1]+(std_vals_plot[-1]-std_vals_plot[-2]))

fig = plt.figure(figsize=(10, 4))

axes1 = plt.axes([0.05, 0.05, 0.40, 0.80])
aux1=axes1.pcolormesh(std_vals_plot, k_vals_plot, deviation_mean, cmap='RdGy', vmin=-0.5, vmax=0.5)
axes1.set_xlabel("Cluster width")
axes1.set_ylabel("Number of clusters")
cbar1 = fig.colorbar(aux1)
cbar1.set_label('Average of relative errors', rotation=90)

axes2 = plt.axes([0.55, 0.05, 0.40, 0.80])
aux2=axes2.pcolormesh(std_vals_plot, k_vals_plot, deviation_std, cmap='gist_earth')
axes2.set_xlabel("Cluster width")
axes2.set_ylabel("Number of clusters")
cbar2 = fig.colorbar(aux2)
cbar2.set_label('Standard deviation of relative errors', rotation=90)

plt.show()

As expected, the error in inferring the correct number of clusters increases when dealing with larger clusters. When increasing the number of clusters, however, the performance remains essentially constant, at least in the range investigated here. We also note a tendency to underestimate the number of clusters. This is not surprising. This apparent decrease in performance is not so much a consequence of a particular limitation in the framework, but rather due to the increasing probability of having cluster overlap when sampling different datasets.

In a future post, I will use the current framework to perform feature clustering, where the observations in the dataspace are features themselves. I will make use of the idea of feature distance, which I describe in a different post.

References:

de Prado, M. M. L. (2020). Machine learning for asset managers. Cambridge University Press.
Bezdek, J. C. (2013). Pattern recognition with fuzzy objective function algorithms. Springer Science & Business Media.