PKBC#
- class QuadratiK.spherical_clustering.PKBC(num_clust: int, max_iter: int = 300, stopping_rule: str = 'loglik', init_method: str = 'sampledata', num_init: int = 10, tol: float = 1e-07, random_state: int | None = None, n_jobs: int = 4)#
Poisson kernel-based clustering on the sphere. The class performs the Poisson kernel-based clustering algorithm on the sphere based on the Poisson kernel-based densities. It estimates the parameter of a mixture of Poisson kernel-based densities. The obtained estimates are used for assigning final memberships, identifying the data points.
Parameters#
- num_clustint, list, np.ndarray, range
Number of clusters.
- max_iterint, optional
Maximum number of iterations before a run is terminated. Defaults to 300.
- stopping_rulestr, optional
String describing the stopping rule to be used within each run. Currently must be either ‘max’, ‘membership’, or ‘loglik’. Defaults to loglik.
- init_methodstr, optional
String describing the initialization method to be used. Currently must be ‘sampledata’.
- num_initint, optional
Number of initializations. Defaults to 10.
- tolfloat.
Constant defining threshold by which log likelihood must change to continue iterations, if applicable. Defaults to 1e-7.
- random_stateint, None, optional.
Determines random number generation for centroid initialization. Defaults to None.
- n_jobsint
Used only for computing the WCSS efficiently. n_jobs specifies the maximum number of concurrently running workers. If 1 is given, no joblib parallelism is used at all, which is useful for debugging. For more information on joblib n_jobs refer to - https://joblib.readthedocs.io/en/latest/generated/joblib.Parallel.html. Defaults to 4.
Attributes#
- alpha_dict
Estimated mixing proportions. A dictionary containing key-value pairs, where each key is an element from the num_clust vector, and each value is a numpy.ndarray of shape (n_clusters,).
- labels_dict
Final cluster membership assigned by the algorithm to each observation. A dictionary containing key-value pairs, where each key is an element from the num_clust vector, and each value is a numpy.ndarray of shape (n_samples,).
- log_lik_vecs_dict
Array of log-likelihood values for each initialization. A dictionary containing key-value pairs, where each key is an element from the num_clust vector, and each value is a numpy.ndarray of shape (num_init, ).
- loglik_dict
Maximum value of the log-likelihood function. A dictionary containing key-value pairs, where each key is an element from the num_clust vector, and each value is a float.
- mu_dict
Estimated centroids. A dictionary containing key-value pairs, where each key is an element from the num_clust vector, and each value is a numpy.ndarray of shape (n_clusters, n_features).
- num_iter_per_runs_dict
Number of E-M iterations per run. A dictionary containing key-value pairs, where each key is an element from the num_clust vector, and each value is a numpy.ndarray of shape (num_init, ).
- post_probs_dict
Posterior probabilities of each observation for the indicated clusters. A dictionary containing key-value pairs, where each key is an element from the num_clust vector, and each value is a numpy.ndarray of shape (n_samples, num_clust).
- rho_dict
Estimated concentration parameters rho. A dictionary containing key-value pairs, where each key is an element from the num_clust vector, and each value is a numpy.ndarray of shape (n_clusters,).
- euclidean_wcss_dict
Values of within-cluster sum of squares computed with Euclidean distance. A dictionary containing key-value pairs, where each key is an element from the num_clust vector, and each value is a float.
- cosine_wcss_dict
Values of within-cluster sum of squares computed with cosine similarity. A dictionary containing key-value pairs, where each key is an element from the num_clust vector, and each value is a float.
References#
Golzy M. & Markatou M. (2020) Poisson Kernel-Based Clustering on the Sphere: Convergence Properties, Identifiability, and a Method of Sampling, Journal of Computational and Graphical Statistics, 29:4, 758-770, DOI: 10.1080/10618600.2020.1740713.
Examples#
from QuadratiK.datasets import load_wireless_data from QuadratiK.spherical_clustering import PKBC from sklearn.preprocessing import LabelEncoder X, y = load_wireless_data(return_X_y=True) cluster_fit = PKBC(num_clust=4, random_state=42).fit(X) print(cluster_fit)
<QuadratiK.spherical_clustering._pkbc.PKBC object at 0x72b216628790>
Methods
|
Performs Poisson Kernel-based Clustering. |
|
The method plot creates a 2D or 3D scatter plot with a circle or sphere as the surface and data points plotted on it. |
|
Predict the cluster membership for each sample in X. |
|
Function to generate descriptive statistics per variable (and per group if available). |
|
Summary function generates a table for the PKBC clustering. |
|
Computes validation metrics such as ARI, Macro Precision and Macro Recall when true labels are provided. |
- PKBC.fit(dat: ndarray | DataFrame) PKBC#
Performs Poisson Kernel-based Clustering.
Parameters#
- datnumpy.ndarray, pandas.DataFrame
A numeric array of data values.
Returns#
- selfobject
Fitted estimator
- PKBC.plot(num_clust: int, y_true: ndarray | list | Series | None = None) Figure | Figure#
The method plot creates a 2D or 3D scatter plot with a circle or sphere as the surface and data points plotted on it.
Parameters#
- num_clustint
Specifies the number of clusters to visualize.
- y_truenumpy.ndarray, list, pandas.series, optional
If y_true is None, then only clusters colored according to the predicted labels.
If y_true is provided, clusters are colored according to the predicted and true labels in different subplots.
Returns#
Returns a 2D matplotlib figure object or 3D plotly figure object with data points plotted on it.
- PKBC.predict(X: ndarray | DataFrame, num_clust: int) tuple[ndarray, ndarray]#
Predict the cluster membership for each sample in X.
Parameters#
- Xnumpy.ndarray, pandas.DataFrame
New data to predict membership.
- num_clustint
Number of clusters to be used for prediction.
Returns#
- (Cluster Probabilities, Membership)tuple
The first element of the tuple is the cluster probabilities of the input samples. The second element of the tuple is the predicted cluster membership of the new data.
- PKBC.stats_clusters(num_clust: int) DataFrame#
Function to generate descriptive statistics per variable (and per group if available).
Parameters#
- num_clustint
Number of clusters for which the summary statistics should be shown.
Returns#
- summary_stats_dfpandas.DataFrame
Dataframe of descriptive statistics.
- PKBC.summary(print_fmt: str = 'simple') str#
Summary function generates a table for the PKBC clustering.
Parameters#
- print_fmtstr, optional.
Used for printing the output in the desired format. Supports all available options in tabulate, see here: https://pypi.org/project/tabulate/. Defaults to “simple_grid”.
Returns#
- summarystr
A string formatted in the desired output format with the Loglikelihood, Euclidean WCSS, Cosine WCSS, Number of data points in each cluster, and mixing proportion for the different number of clusters.
- PKBC.validation(y_true: ndarray | None = None) tuple[DataFrame, Figure]#
Computes validation metrics such as ARI, Macro Precision and Macro Recall when true labels are provided.
Parameters#
- y_truenumpy.ndarray.
Array of true memberships to clusters, Defaults to None.
Returns#
- validation metricstuple
Return a tuple of a dataframe and elbow plots. The dataframe contains the following for different number of clusters:
- Adjusted Rand Indexfloat (returned only when y_true is provided)
Adjusted Rand Index computed between the true and predicted cluster memberships.
- Macro Precisionfloat (returned only when y_true is provided)
Macro Precision computed between the true and predicted cluster memberships.
- Macro Recallfloat (returned only when y_true is provided)
Macro Recall computed between the true and predicted cluster memberships.
- Average Silhouette Scorefloat
Mean Silhouette Coefficient of all samples.
References#
Rousseeuw, P.J. (1987) Silhouettes: A graphical aid to the interpretation and validation of cluster analysis. Journal of Computational and Applied Mathematics, 20, 53-65.
Notes#
We have taken a naive approach to map the predicted cluster labels to the true class labels (if provided). This might not work in cases where num_clust is large. Please use sklearn.metrics for computing metrics in such cases, and provide the correctly matched labels.
See also#
sklearn.metrics : Scikit-learn metrics functionality support a wide range of metrics.