select_h

Contents

select_h#

QuadratiK.kernel_test.select_h(x: ndarray | DataFrame, y: ndarray | DataFrame | None = None, alternative: str = 'location', method: str = 'subsampling', b: float = 0.8, num_iter: int = 150, delta_dim: ndarray | int = 1, delta: ndarray | None = None, h_values: ndarray | None = None, n_rep: int = 50, n_jobs: int = 8, quantile: float = 0.95, k_threshold: int = 10, power_plot: bool = False, random_state: int | None = None, mu: ndarray | None = None, sigma: ndarray | None = None) tuple[float, DataFrame] | tuple[float, DataFrame, Figure]#

This function computes the kernel bandwidth of the Gaussian kernel for the one sample, two-sample and k-sample kernel-based quadratic distance (KBQD) tests.

The function performs the selection of the optimal value for the tuning parameter h of the normal kernel function, for the two-sample and k-sample KBQD tests. It performs a small simulation study, generating samples according to the family of a specified alternative, for the chosen values of h_values and delta.

We consider target alternatives \(F_\delta(\hat{\mathbf{\mu}}, \hat{\mathbf{\Sigma}}, \hat{\mathbf{\lambda}})\), where \(\hat{\mathbf{\mu}}, \hat{\mathbf{\Sigma}}\) and \(\hat{\mathbf{\lambda}}\) indicate the location, covariance, and skewness parameter estimates from the pooled sample.

The available alternative options are:

  • location alternatives, \(F_\delta = SN_d(\hat{\mu} + \delta, \hat{\Sigma}, \hat{\lambda})\),

with \(\delta = 0.2, 0.3, 0.4\);

  • scale alternatives, \(F_\delta = SN_d(\hat{\mu}, \hat{\Sigma} \cdot \delta, \hat{\lambda})\),

with \(\delta = 1.1, 1.3, 1.5\);

  • skewness alternatives, \(F_\delta = SN_d(\hat{\mu}, \hat{\Sigma}, \hat{\lambda} + \delta)\),

with \(\delta = 0.2, 0.3, 0.6\). Note: Skewness is not available for the normality test.

Please see User Guide for more details.

Parameters#

xnumpy.ndarray or pandas.DataFrame

Data set of observations from X.

ynumpy.ndarray or pandas.DataFrame, optional

Data set of observations from Y for two sample test or set of labels in case of k-sample test.

alternativestr, optional

Family of alternative chosen for selecting h, must be one of “location”, “scale” and “skewness”. Defaults to “location”.

methodstr, optional.

The method used for critical value estimation, must be one of “subsampling”, “bootstrap”, or “permutation”. Defaults to “subsampling”.

bfloat, optional.

The size of the subsamples used in the subsampling algorithm. Defaults to 0.8 i.e. 0.8N samples are used, where N represents the total sample size.

num_iterint, optional.

The number of iterations to use for critical value estimation. Defaults to 150.

delta_dimint, numpy.ndarray, optional.

Array of coefficient of alternative with respect to each dimension. Defaults to 1.

deltanumpy.ndarray, optional.

Array of parameter values indicating chosen alternatives. Defaults to None.

h_valuesnumpy.ndarray, optional.

Values of the tuning parameter used for the selection. Defaults to None.

n_repint, optional. Defaults to 50.

Number of bootstrap replications.

n_jobsint, optional.

n_jobs specifies the maximum number of concurrently running workers. If 1 is given, no joblib parallelism is used at all, which is useful for debugging. For more information on joblib n_jobs refer to - https://joblib.readthedocs.io/en/latest/generated/joblib.Parallel.html. Defaults to 8.

quantilefloat, optional.

Quantile to use for critical value estimation. Defaults to 0.95.

k_thresholdint.

Maximum number of groups allowed. Defaults to 10.

power_plotboolean, optional.

If True, plot is displayed the plot of power for values in h\_values and delta. Defaults to False.

random_stateint, None, optional.

Seed for random number generation. Defaults to None.

munumpy.ndarray, optional

Mean vector for the reference distribution. Mandatory for the normality test. Defaults to None.

sigmanumpy.ndarray, optional

Covariance matrix of the reference distribution. Mandatory for the normality test. Defaults to None.

Returns#

hfloat

The selected value of tuning parameter h.

h vs Power tablepandas.DataFrame

A table containing the h, delta and corresponding powers.

References#

Markatou, M., & Saraceno, G. (2024). A unified framework for multivariate two-sample and k-sample kernel-based quadratic distance goodness-of-fit tests. arXiv preprint arXiv:2407.16374.

Examples#

import numpy as np
from QuadratiK.kernel_test import select_h
np.random.seed(42)
X = np.random.randn(200, 2)
np.random.seed(42)
y = np.random.randint(0, 2, 200)
h_selected, all_values, power_plot = select_h(
    X, y, alternative='location', power_plot=True, random_state=42)
print("Selected h is: ", h_selected)
Selected h is:  0.8
../../_images/QuadratiK.kernel_test.select_h_0_1.png