QuadratiK#
Introduction#
The QuadratiK package is implemented in both R and Python, providing a comprehensive set of goodness-of-fit tests and a clustering technique using kernel-based quadratic distances, and algorithms for generating random samples from a PKBD distribution. It includes:
Goodness-of-Fit Tests : The software implements one, two, and k-sample tests for goodness of fit, offering an efficient and mathematically sound way to assess the fit of probability distributions. Expanded capabilities include supporting tests for uniformity on the \(d\)-dimensional Sphere based on Poisson kernel densities. Our tests are particularly useful for large, high-dimensional datasets where the assessment of fit of probability models is of interest. Specifically, we offer tests for normality, as well as two- and k-sample tests, where testing equality of two or more distributions is of interest, i.e. \(H_0: F_1 = F_2\) and \(H_0: F_1 = \ldots = F_k\) respectively. The proposed tests perform well in terms of level and power for contiguous alternatives, heavy tailed distributions and in higher dimensions.
Poisson Kernel-based Distribution (PKBD) : The package also includes functionality for generating random samples from PKBD and computing the density value. A short guide on PKBD is included in User Guide. For more details please see Golzy and Markatou (2020) and Sablica et al. (2023).
Clustering Algorithm for Spherical Data: The package incorporates a unique clustering algorithm specifically tailored for spherical data. This algorithm leverages a mixture of Poisson-kernel-based densities on the sphere, enabling effective clustering of spherical data or data that has been spherically transformed. This facilitates the uncovering of underlying patterns and relationships in the data. The clustering algorithm is especially useful in the presence of noise in the data and the presence of non-negligible overlap between clusters.
Additional Features: Alongside these functionalities, the software includes additional graphical functions, aiding users in validating cluster results as well as visualizing and representing clustering results. This enhances the interpretability and usability of the analysis.
User Interface: We also provide a dashboard application built using
streamlit
allowing users to access the methods implemented in the package without the need for programming.
The R implementation can be found on CRAN and the corresponding GitHub repository is available here.
Documentation#
The documentation is hosted on Read the Docs at - https://quadratik.readthedocs.io/en/latest/
Installation using pip
#
The package can be installed from PyPI using pip install QuadratiK
Usage Examples#
QuadratiK Examples: A collection of basic examples that demonstrate how to use the core functionalities of the QuadratiK package. Ideal for new users to get started quickly.
An Introduction to Poisson Kernel-Based distributions: A short introduction to the Poisson Kernel-Based distributions.
Random sampling from the Poisson kernel-based density: Learn how to generate random samples from the Poisson kernel-based density and visualize the results.
Usage Instructions for Dashboard Application: Step-by-step instructions on how to set up and use the QuadratiK dashboard application. This guide helps you interactively explore and analyze data using the dashboard’s features.
Community#
Development Version Installation#
To install the development version of QuadratiK
, you will need to download the code files from the master branch of the GitHub repository. Keep in mind that the development version may contain bugs or unstable features. For the latest stable release, we recommend installing via pip or downloading a release from GitHub.
Cloning the Repository#
To clone the master branch from GitHub, use the following command:
git clone https://github.com/rmj3197/QuadratiK.git
Poetry Setup#
QuadratiK
uses the poetry
package manager for dependency management and installation. If you don’t have Poetry
installed, you can install it by following the instructions in the Poetry Documentation.
Setting Up a Virtual Environment#
We strongly recommend creating a new virtual environment to isolate the QuadratiK
installation and its dependencies from your system-wide Python environment. You can create a virtual environment using venv
, virtualenv
, or any other virtual environment manager of your choice. For example, using venv
:
python3 -m venv quadratik-env
source quadratik-env/bin/activate # On Windows: quadratik-env\Scripts\activate
Activating the Poetry Environment#
After installation, you can activate the Poetry-managed virtual environment by running:
poetry shell
This ensures that any commands you run are executed within the isolated environment.
Please note that if managing your own virtual environment externally, you do not need to use poetry shell since you will already have activated that virtual environment and made available the correct python instance.
Installing Dependencies with Poetry#
After setting up your virtual environment and cloning the repository, navigate to the QuadratiK directory:
cd QuadratiK
You can install the project dependencies and set up the development environment by running:
poetry install
This command will install the dependencies specified in pyproject.toml and the package, and set up the project for development.
Running Tests#
To verify that everything is set up correctly, you can run the project’s test suite. This will help ensure that the development environment is correctly configured:
poetry run pytest
This command uses Poetry to run pytest within the virtual environment, executing all the tests defined in the project.
Additional Notes#
If you encounter any issues during installation or while using the development version, please report them on the GitHub Issues page.
To keep your development environment up-to-date, you can periodically pull the latest changes from the master branch and run poetry update to update dependencies.
Contributing Guide#
For contributing to QuadratiK
, please follow the contribution guidelines provided in the repository.
Code of Conduct#
The code of conduct can be found at Code of Conduct.
License#
This project uses the GPL-3.0 license, with a full version of the license included in the repository.
Citation#
If you use this package, please consider citing it using the following entry:
@misc{saraceno2024goodnessoffitclusteringsphericaldata,
title={Goodness-of-Fit and Clustering of Spherical Data: the QuadratiK package in R and Python},
author={Giovanni Saraceno and Marianthi Markatou and Raktim Mukhopadhyay and Mojgan Golzy},
year={2024},
eprint={2402.02290},
archivePrefix={arXiv},
primaryClass={stat.CO},
url={https://arxiv.org/abs/2402.02290},
}
Funding Information#
The work has been supported by Kaleida Health Foundation and National Science Foundation.
References#
Saraceno G., Markatou M., Mukhopadhyay R., Golzy M. (2024). Goodness-of-Fit and Clustering of Spherical Data: the QuadratiK package in R and Python. arXiv preprint arXiv:2402.02290.
Ding Y., Markatou M., Saraceno G. (2023). “Poisson Kernel-Based Tests for Uniformity on the d-Dimensional Sphere.” Statistica Sinica. DOI: 10.5705/ss.202022.0347.
Golzy M. & Markatou M. (2020) Poisson Kernel-Based Clustering on the Sphere: Convergence Properties, Identifiability, and a Method of Sampling, Journal of Computational and Graphical Statistics, 29:4, 758-770, DOI: 10.1080/10618600.2020.1740713.
Sablica, L., Hornik, K., & Leydold, J. (2023). Efficient sampling from the PKBD distribution. Electronic Journal of Statistics, 17(2), 2180-2209.
Markatou, M., & Saraceno, G. (2024). A unified framework for multivariate two-sample and k-sample kernel-based quadratic distance goodness-of-fit tests. DOI: 10.48550/arXiv.2407.16374v1