PKBC#

class QuadratiK.spherical_clustering.PKBC(num_clust: int, max_iter: int = 300, stopping_rule: str = 'loglik', init_method: str = 'sampledata', num_init: int = 10, tol: float = 1e-07, random_state: int | None = None, n_jobs: int = 4)#

Poisson kernel-based clustering on the sphere. The class performs the Poisson kernel-based clustering algorithm on the sphere based on the Poisson kernel-based densities. It estimates the parameter of a mixture of Poisson kernel-based densities. The obtained estimates are used for assigning final memberships, identifying the data points.

Parameters#

num_clustint, list, np.ndarray, range: Number of clusters.
max_iterint, optional: Maximum number of iterations before a run is terminated. Defaults to 300.
stopping_rulestr, optional: String describing the stopping rule to be used within each run. Currently must be either ‘max’, ‘membership’, or ‘loglik’. Defaults to loglik.
init_methodstr, optional: String describing the initialization method to be used. Currently must be ‘sampledata’.
num_initint, optional: Number of initializations. Defaults to 10.
tolfloat.: Constant defining threshold by which log likelihood must change to continue iterations, if applicable. Defaults to 1e-7.
random_stateint, None, optional.: Determines random number generation for centroid initialization. Defaults to None.
n_jobsint: Used only for computing the WCSS efficiently. n_jobs specifies the maximum number of concurrently running workers. If 1 is given, no joblib parallelism is used at all, which is useful for debugging. For more information on joblib n_jobs refer to - https://joblib.readthedocs.io/en/latest/generated/joblib.Parallel.html. Defaults to 4.

Attributes#

alpha_dict: Estimated mixing proportions. A dictionary containing key-value pairs, where each key is an element from the num_clust vector, and each value is a numpy.ndarray of shape (n_clusters,).
labels_dict: Final cluster membership assigned by the algorithm to each observation. A dictionary containing key-value pairs, where each key is an element from the num_clust vector, and each value is a numpy.ndarray of shape (n_samples,).
log_lik_vecs_dict: Array of log-likelihood values for each initialization. A dictionary containing key-value pairs, where each key is an element from the num_clust vector, and each value is a numpy.ndarray of shape (num_init, ).
loglik_dict: Maximum value of the log-likelihood function. A dictionary containing key-value pairs, where each key is an element from the num_clust vector, and each value is a float.
mu_dict: Estimated centroids. A dictionary containing key-value pairs, where each key is an element from the num_clust vector, and each value is a numpy.ndarray of shape (n_clusters, n_features).
num_iter_per_runs_dict: Number of E-M iterations per run. A dictionary containing key-value pairs, where each key is an element from the num_clust vector, and each value is a numpy.ndarray of shape (num_init, ).
post_probs_dict: Posterior probabilities of each observation for the indicated clusters. A dictionary containing key-value pairs, where each key is an element from the num_clust vector, and each value is a numpy.ndarray of shape (n_samples, num_clust).
rho_dict: Estimated concentration parameters rho. A dictionary containing key-value pairs, where each key is an element from the num_clust vector, and each value is a numpy.ndarray of shape (n_clusters,).
euclidean_wcss_dict: Values of within-cluster sum of squares computed with Euclidean distance. A dictionary containing key-value pairs, where each key is an element from the num_clust vector, and each value is a float.
cosine_wcss_dict: Values of within-cluster sum of squares computed with cosine similarity. A dictionary containing key-value pairs, where each key is an element from the num_clust vector, and each value is a float.

References#

Golzy M. & Markatou M. (2020) Poisson Kernel-Based Clustering on the Sphere: Convergence Properties, Identifiability, and a Method of Sampling, Journal of Computational and Graphical Statistics, 29:4, 758-770, DOI: 10.1080/10618600.2020.1740713.

Examples#

from QuadratiK.datasets import load_wireless_data
from QuadratiK.spherical_clustering import PKBC
from sklearn.preprocessing import LabelEncoder
X, y = load_wireless_data(return_X_y=True)
cluster_fit = PKBC(num_clust=4, random_state=42).fit(X)
print(cluster_fit)

<QuadratiK.spherical_clustering._pkbc.PKBC object at 0x72b216628790>

Methods

`PKBC.fit`(dat)	Performs Poisson Kernel-based Clustering.
`PKBC.plot`(num_clust[, y_true])	The method plot creates a 2D or 3D scatter plot with a circle or sphere as the surface and data points plotted on it.
`PKBC.predict`(X, num_clust)	Predict the cluster membership for each sample in X.
`PKBC.stats_clusters`(num_clust)	Function to generate descriptive statistics per variable (and per group if available).
`PKBC.summary`([print_fmt])	Summary function generates a table for the PKBC clustering.
`PKBC.validation`([y_true])	Computes validation metrics such as ARI, Macro Precision and Macro Recall when true labels are provided.

PKBC.fit(dat: ndarray | DataFrame) → PKBC#

Performs Poisson Kernel-based Clustering.

Parameters#

datnumpy.ndarray, pandas.DataFrame: A numeric array of data values.

Returns#

selfobject: Fitted estimator

PKBC.plot(num_clust: int, y_true: ndarray | list | Series | None = None) → Figure | Figure#

The method plot creates a 2D or 3D scatter plot with a circle or sphere as the surface and data points plotted on it.

Parameters#

num_clustint

Specifies the number of clusters to visualize.

y_truenumpy.ndarray, list, pandas.series, optional

If y_true is None, then only clusters colored according to the predicted labels.
If y_true is provided, clusters are colored according to the predicted and true labels in different subplots.

Returns#

Returns a 2D matplotlib figure object or 3D plotly figure object with data points plotted on it.

PKBC.predict(X: ndarray | DataFrame, num_clust: int) → tuple[ndarray, ndarray]#

Predict the cluster membership for each sample in X.

Parameters#

Xnumpy.ndarray, pandas.DataFrame: New data to predict membership.
num_clustint: Number of clusters to be used for prediction.

Returns#

(Cluster Probabilities, Membership)tuple: The first element of the tuple is the cluster probabilities of the input samples. The second element of the tuple is the predicted cluster membership of the new data.

PKBC.stats_clusters(num_clust: int) → DataFrame#

Function to generate descriptive statistics per variable (and per group if available).

Parameters#

num_clustint: Number of clusters for which the summary statistics should be shown.

Returns#

summary_stats_dfpandas.DataFrame: Dataframe of descriptive statistics.

PKBC.summary(print_fmt: str = 'simple') → str#

Summary function generates a table for the PKBC clustering.

Parameters#

print_fmtstr, optional.: Used for printing the output in the desired format. Supports all available options in tabulate, see here: https://pypi.org/project/tabulate/. Defaults to “simple_grid”.

Returns#

summarystr: A string formatted in the desired output format with the Loglikelihood, Euclidean WCSS, Cosine WCSS, Number of data points in each cluster, and mixing proportion for the different number of clusters.

PKBC.validation(y_true: ndarray | None = None) → tuple[DataFrame, Figure]#

Computes validation metrics such as ARI, Macro Precision and Macro Recall when true labels are provided.

Parameters#

y_truenumpy.ndarray.: Array of true memberships to clusters, Defaults to None.

Returns#

validation metricstuple

Return a tuple of a dataframe and elbow plots. The dataframe contains the following for different number of clusters:

Adjusted Rand Indexfloat (returned only when y_true is provided)
Adjusted Rand Index computed between the true and predicted cluster memberships.
Macro Precisionfloat (returned only when y_true is provided)
Macro Precision computed between the true and predicted cluster memberships.
Macro Recallfloat (returned only when y_true is provided)
Macro Recall computed between the true and predicted cluster memberships.
Average Silhouette Scorefloat
Mean Silhouette Coefficient of all samples.

References#

Rousseeuw, P.J. (1987) Silhouettes: A graphical aid to the interpretation and validation of cluster analysis. Journal of Computational and Applied Mathematics, 20, 53-65.

Notes#

We have taken a naive approach to map the predicted cluster labels to the true class labels (if provided). This might not work in cases where num_clust is large. Please use sklearn.metrics for computing metrics in such cases, and provide the correctly matched labels.

PKBC

Contents

PKBC#

Parameters#

Attributes#

References#

Examples#

Parameters#

Returns#

Parameters#

Returns#

Parameters#

Returns#

Parameters#

Returns#

Parameters#

Returns#

Parameters#

Returns#

References#

Notes#

See also#