PKBC#

class QuadratiK.spherical_clustering.PKBC(num_clust: int, max_iter: int = 300, stopping_rule: str = 'loglik', init_method: str = 'sampledata', num_init: int = 10, tol: float = 1e-07, random_state: int | None = None, n_jobs: int = 4)#

Poisson kernel-based clustering on the sphere. The class performs the Poisson kernel-based clustering algorithm on the sphere based on the Poisson kernel-based densities. It estimates the parameter of a mixture of Poisson kernel-based densities. The obtained estimates are used for assigning final memberships, identifying the data points.

Parameters#

num_clustint, list, np.ndarray, range

Number of clusters.

max_iterint, optional

Maximum number of iterations before a run is terminated. Defaults to 300.

stopping_rulestr, optional

String describing the stopping rule to be used within each run. Currently must be either ‘max’, ‘membership’, or ‘loglik’. Defaults to loglik.

init_methodstr, optional

String describing the initialization method to be used. Currently must be ‘sampledata’.

num_initint, optional

Number of initializations. Defaults to 10.

tolfloat.

Constant defining threshold by which log likelihood must change to continue iterations, if applicable. Defaults to 1e-7.

random_stateint, None, optional.

Determines random number generation for centroid initialization. Defaults to None.

n_jobsint

Used only for computing the WCSS efficiently. n_jobs specifies the maximum number of concurrently running workers. If 1 is given, no joblib parallelism is used at all, which is useful for debugging. For more information on joblib n_jobs refer to - https://joblib.readthedocs.io/en/latest/generated/joblib.Parallel.html. Defaults to 4.

Attributes#

alpha_dict

Estimated mixing proportions. A dictionary containing key-value pairs, where each key is an element from the num_clust vector, and each value is a numpy.ndarray of shape (n_clusters,).

labels_dict

Final cluster membership assigned by the algorithm to each observation. A dictionary containing key-value pairs, where each key is an element from the num_clust vector, and each value is a numpy.ndarray of shape (n_samples,).

log_lik_vecs_dict

Array of log-likelihood values for each initialization. A dictionary containing key-value pairs, where each key is an element from the num_clust vector, and each value is a numpy.ndarray of shape (num_init, ).

loglik_dict

Maximum value of the log-likelihood function. A dictionary containing key-value pairs, where each key is an element from the num_clust vector, and each value is a float.

mu_dict

Estimated centroids. A dictionary containing key-value pairs, where each key is an element from the num_clust vector, and each value is a numpy.ndarray of shape (n_clusters, n_features).

num_iter_per_runs_dict

Number of E-M iterations per run. A dictionary containing key-value pairs, where each key is an element from the num_clust vector, and each value is a numpy.ndarray of shape (num_init, ).

post_probs_dict

Posterior probabilities of each observation for the indicated clusters. A dictionary containing key-value pairs, where each key is an element from the num_clust vector, and each value is a numpy.ndarray of shape (n_samples, num_clust).

rho_dict

Estimated concentration parameters rho. A dictionary containing key-value pairs, where each key is an element from the num_clust vector, and each value is a numpy.ndarray of shape (n_clusters,).

euclidean_wcss_dict

Values of within-cluster sum of squares computed with Euclidean distance. A dictionary containing key-value pairs, where each key is an element from the num_clust vector, and each value is a float.

cosine_wcss_dict

Values of within-cluster sum of squares computed with cosine similarity. A dictionary containing key-value pairs, where each key is an element from the num_clust vector, and each value is a float.

References#

Golzy M. & Markatou M. (2020) Poisson Kernel-Based Clustering on the Sphere: Convergence Properties, Identifiability, and a Method of Sampling, Journal of Computational and Graphical Statistics, 29:4, 758-770, DOI: 10.1080/10618600.2020.1740713.

Examples#

from QuadratiK.datasets import load_wireless_data
from QuadratiK.spherical_clustering import PKBC
from sklearn.preprocessing import LabelEncoder
X, y = load_wireless_data(return_X_y=True)
cluster_fit = PKBC(num_clust=4, random_state=42).fit(X)
print(cluster_fit)
<QuadratiK.spherical_clustering._pkbc.PKBC object at 0x72b216628790>

Methods

PKBC.fit(dat)

Performs Poisson Kernel-based Clustering.

PKBC.plot(num_clust[, y_true])

The method plot creates a 2D or 3D scatter plot with a circle or sphere as the surface and data points plotted on it.

PKBC.predict(X, num_clust)

Predict the cluster membership for each sample in X.

PKBC.stats_clusters(num_clust)

Function to generate descriptive statistics per variable (and per group if available).

PKBC.summary([print_fmt])

Summary function generates a table for the PKBC clustering.

PKBC.validation([y_true])

Computes validation metrics such as ARI, Macro Precision and Macro Recall when true labels are provided.


PKBC.fit(dat: ndarray | DataFrame) PKBC#

Performs Poisson Kernel-based Clustering.

Parameters#

datnumpy.ndarray, pandas.DataFrame

A numeric array of data values.

Returns#

selfobject

Fitted estimator

PKBC.plot(num_clust: int, y_true: ndarray | list | Series | None = None) Figure | Figure#

The method plot creates a 2D or 3D scatter plot with a circle or sphere as the surface and data points plotted on it.

Parameters#

num_clustint

Specifies the number of clusters to visualize.

y_truenumpy.ndarray, list, pandas.series, optional
  • If y_true is None, then only clusters colored according to the predicted labels.

  • If y_true is provided, clusters are colored according to the predicted and true labels in different subplots.

Returns#

Returns a 2D matplotlib figure object or 3D plotly figure object with data points plotted on it.

PKBC.predict(X: ndarray | DataFrame, num_clust: int) tuple[ndarray, ndarray]#

Predict the cluster membership for each sample in X.

Parameters#

Xnumpy.ndarray, pandas.DataFrame

New data to predict membership.

num_clustint

Number of clusters to be used for prediction.

Returns#

(Cluster Probabilities, Membership)tuple

The first element of the tuple is the cluster probabilities of the input samples. The second element of the tuple is the predicted cluster membership of the new data.

PKBC.stats_clusters(num_clust: int) DataFrame#

Function to generate descriptive statistics per variable (and per group if available).

Parameters#

num_clustint

Number of clusters for which the summary statistics should be shown.

Returns#

summary_stats_dfpandas.DataFrame

Dataframe of descriptive statistics.

PKBC.summary(print_fmt: str = 'simple') str#

Summary function generates a table for the PKBC clustering.

Parameters#

print_fmtstr, optional.

Used for printing the output in the desired format. Supports all available options in tabulate, see here: https://pypi.org/project/tabulate/. Defaults to “simple_grid”.

Returns#

summarystr

A string formatted in the desired output format with the Loglikelihood, Euclidean WCSS, Cosine WCSS, Number of data points in each cluster, and mixing proportion for the different number of clusters.

PKBC.validation(y_true: ndarray | None = None) tuple[DataFrame, Figure]#

Computes validation metrics such as ARI, Macro Precision and Macro Recall when true labels are provided.

Parameters#

y_truenumpy.ndarray.

Array of true memberships to clusters, Defaults to None.

Returns#

validation metricstuple

Return a tuple of a dataframe and elbow plots. The dataframe contains the following for different number of clusters:

  • Adjusted Rand Indexfloat (returned only when y_true is provided)

    Adjusted Rand Index computed between the true and predicted cluster memberships.

  • Macro Precisionfloat (returned only when y_true is provided)

    Macro Precision computed between the true and predicted cluster memberships.

  • Macro Recallfloat (returned only when y_true is provided)

    Macro Recall computed between the true and predicted cluster memberships.

  • Average Silhouette Scorefloat

    Mean Silhouette Coefficient of all samples.

References#

Rousseeuw, P.J. (1987) Silhouettes: A graphical aid to the interpretation and validation of cluster analysis. Journal of Computational and Applied Mathematics, 20, 53-65.

Notes#

We have taken a naive approach to map the predicted cluster labels to the true class labels (if provided). This might not work in cases where num_clust is large. Please use sklearn.metrics for computing metrics in such cases, and provide the correctly matched labels.

See also#

sklearn.metrics : Scikit-learn metrics functionality support a wide range of metrics.