select_h#

QuadratiK.kernel_test.select_h(x: ndarray | DataFrame, y: ndarray | DataFrame | None = None, alternative: str = 'location', method: str = 'subsampling', b: float = 0.8, num_iter: int = 150, delta_dim: ndarray | int = 1, delta: ndarray | None = None, h_values: ndarray | None = None, n_rep: int = 50, n_jobs: int = 8, quantile: float = 0.95, k_threshold: int = 10, power_plot: bool = False, random_state: int | None = None, mu: ndarray | None = None, sigma: ndarray | None = None) → tuple[float, DataFrame] | tuple[float, DataFrame, Figure]#

This function computes the kernel bandwidth of the Gaussian kernel for the one sample, two-sample and k-sample kernel-based quadratic distance (KBQD) tests.

The function performs the selection of the optimal value for the tuning parameter h of the normal kernel function, for the two-sample and k-sample KBQD tests. It performs a small simulation study, generating samples according to the family of a specified alternative, for the chosen values of h_values and delta.

We consider target alternatives \(F_\delta(\hat{\mathbf{\mu}}, \hat{\mathbf{\Sigma}}, \hat{\mathbf{\lambda}})\), where \(\hat{\mathbf{\mu}}, \hat{\mathbf{\Sigma}}\) and \(\hat{\mathbf{\lambda}}\) indicate the location, covariance, and skewness parameter estimates from the pooled sample.

The available alternative options are:

location alternatives, \(F_\delta = SN_d(\hat{\mu} + \delta, \hat{\Sigma}, \hat{\lambda})\),

with \(\delta = 0.2, 0.3, 0.4\);

scale alternatives, \(F_\delta = SN_d(\hat{\mu}, \hat{\Sigma} \cdot \delta, \hat{\lambda})\),

with \(\delta = 1.1, 1.3, 1.5\);

skewness alternatives, \(F_\delta = SN_d(\hat{\mu}, \hat{\Sigma}, \hat{\lambda} + \delta)\),

with \(\delta = 0.2, 0.3, 0.6\). Note: Skewness is not available for the normality test.

Please see User Guide for more details.

Parameters#

xnumpy.ndarray or pandas.DataFrame: Data set of observations from X.
ynumpy.ndarray or pandas.DataFrame, optional: Data set of observations from Y for two sample test or set of labels in case of k-sample test.
alternativestr, optional: Family of alternative chosen for selecting h, must be one of “location”, “scale” and “skewness”. Defaults to “location”.
methodstr, optional.: The method used for critical value estimation, must be one of “subsampling”, “bootstrap”, or “permutation”. Defaults to “subsampling”.
bfloat, optional.: The size of the subsamples used in the subsampling algorithm. Defaults to 0.8 i.e. 0.8N samples are used, where N represents the total sample size.
num_iterint, optional.: The number of iterations to use for critical value estimation. Defaults to 150.
delta_dimint, numpy.ndarray, optional.: Array of coefficient of alternative with respect to each dimension. Defaults to 1.
deltanumpy.ndarray, optional.: Array of parameter values indicating chosen alternatives. Defaults to None.
h_valuesnumpy.ndarray, optional.: Values of the tuning parameter used for the selection. Defaults to None.
n_repint, optional. Defaults to 50.: Number of bootstrap replications.
n_jobsint, optional.: n_jobs specifies the maximum number of concurrently running workers. If 1 is given, no joblib parallelism is used at all, which is useful for debugging. For more information on joblib n_jobs refer to - https://joblib.readthedocs.io/en/latest/generated/joblib.Parallel.html. Defaults to 8.
quantilefloat, optional.: Quantile to use for critical value estimation. Defaults to 0.95.
k_thresholdint.: Maximum number of groups allowed. Defaults to 10.
power_plotboolean, optional.: If True, plot is displayed the plot of power for values in h\_values and delta. Defaults to False.
random_stateint, None, optional.: Seed for random number generation. Defaults to None.
munumpy.ndarray, optional: Mean vector for the reference distribution. Mandatory for the normality test. Defaults to None.
sigmanumpy.ndarray, optional: Covariance matrix of the reference distribution. Mandatory for the normality test. Defaults to None.

Returns#

hfloat: The selected value of tuning parameter h.
h vs Power tablepandas.DataFrame: A table containing the h, delta and corresponding powers.

References#

Markatou, M., & Saraceno, G. (2024). A unified framework for multivariate two-sample and k-sample kernel-based quadratic distance goodness-of-fit tests. arXiv preprint arXiv:2407.16374.

Examples#

import numpy as np
from QuadratiK.kernel_test import select_h
np.random.seed(42)
X = np.random.randn(200, 2)
np.random.seed(42)
y = np.random.randint(0, 2, 200)
h_selected, all_values, power_plot = select_h(
    X, y, alternative='location', power_plot=True, random_state=42)
print("Selected h is: ", h_selected)

Selected h is:  0.8

../../_images/QuadratiK.kernel_test.select_h_0_1.png

select_h

Contents

select_h#

Parameters#

Returns#

References#

Examples#