Kernel partial correlation (KPC) coefficient measures the strength of conditional association between Y and Z given X, with X, Y, Z being random variables taking values in general topological spaces. As the name suggests, KPC is defined in terms of kernels on reproducing kernel Hilbert spaces (RKHSs). The population KPC is a deterministic number between 0 and 1; it is 0 if and only if Y is conditionally independent of Z given X, and it is 1 if and only if Y is a measurable function of Z and X. This R package provides implementations of two empirical versions of KPC when X, Z are Euclidean, and Y, possibly taking values in a general topological space, can be stored as a vector.
One empirical KPC estimator is based on geometric graphs, such as K-nearest neighbor graphs (K-NN) and minimum spanning trees (MST), and is consistent under very weak conditions. The other empirical estimator, defined using conditional mean embeddings (CMEs) as used in the RKHS literature, is also consistent under suitable conditions. Users are allowed to define arbitrary kernels (see examples below).
Using KPC we also provide a stepwise forward variable selection algorithm KFOCI (using the graph based estimator of KPC), as well as a similar stepwise forward selection algorithm based on the RKHS based estimator. For more details on KPC, its empirical estimators and its application on variable selection, see https://www.jmlr.org/papers/v23/21-493.html.
When X is empty, KPC measures the unconditional dependence between Y
and Z, which is also implemented in the package. The unconditional
graph-based estimator has been described in Deb, N., P. Ghosal, and
B. Sen (2020). Measuring association on topological spaces using kernels
and geometric graphs. It is implemented in the functions
KMAc
and Klin
in this package. The latter can
be computed in near linear time.
This package depends on R (>= 4.0.0). You can install the package
KPC by (the package devtools
needs to be installed
first):
::install_github("zh2395/KPC") devtools
You can uninstall the package by:
remove.packages("KPC")
Here we briefly introduce the functions in this package. See the documentation (help page) of the R package for more details.
KPCgraph
implements the KPC estimator based on geometric
graphs. The inputs are: Y
: a matrix of n rows;
X
: a matrix of n rows, or NULL if X is empty, in which case
it will return KMAc(Y,Z,k,Knn)
, which measures the
unconditional dependence between Y and Z. Z
: a matrix of n
rows; k
: a function of class kernel. It can be the kernel
implemented in kernlab
e.g., Gaussian kernel
rbfdot(sigma = 1)
, linear kernel vanilladot()
;
In practice, Gaussian kernel with empirical bandwidth
kernlab::rbfdot(1/(2*stats::median(stats::dist(Y))^2))
may
be a good choice. Knn
: a positive integer indicating the
number of nearest neighbor to use; or “MST”. A small Knn (e.g., Knn=1)
is recommended for an accurate estimate of the population KPC.
trans_inv
: whether k(y, y) is free of y.
library(kernlab)
library(KPC)
= 1000
n set.seed(1)
= runif(n)
x = runif(n)
z = (x + z) %% 1
y KPCgraph(Y = y, X = x, Z = z, k = rbfdot(5), Knn = 1, trans_inv = T)
# 0.9725613
# Theoretical KPC is 1 since y is a measurable function of x and z
KPCRKHS
implements the KPC estimator based on RKHS
method using CME formula. The inputs are Y
: a matrix of n
rows; X
: a matrix of n rows, or NULL if X is empty, in
which case the coefficient measures the unconditional association
between Y and Z; Z
: a matrix of n rows; ky
,
kx
, kxz
: the kernels used for the space of Y,
X, (X,Z) respectively; eps
: a small positive regularization
parameter for inverting the empirical cross-covariance operator;
appro
: whether to use incomplete Cholesky decomposition for
approximation; tol
: tolerance used for incomplete Cholesky
decomposition (implemented by the function inchol
in the
package kernlab
).
= 1000
n set.seed(1)
= rnorm(n)
x = rnorm(n)
z = x + z + rnorm(n,1,1)
y library(kernlab)
= vanilladot() # linear kernel
k KPCRKHS(y, x, z, k, k, k, 1e-3/n^(0.4), appro = F)
# 0.5158324 (Population quantity = 0.5)
KPCRKHS(y, x, z, k, k, k, 1e-3/n^(0.4), appro = T, tol = 1e-5)
# 0.5158324 (Population quantity = 0.5)
KFOCI
implements variable selection with KPC using
directed K-NN graph or MST. The inputs are Y
: a matrix of
responses (n by dy); X
: a matrix of predictors (n by dx);
k
: the kernel function used for Y; Knn
: a
positive integer indicating the number of nearest neighbor; or “MST”.
The suggested choice of Knn is 0.05n for samples up to a few hundred
observations. For large n, the suggested Knn is sublinear in n. That is,
it may grow slower than any linear function of n. The computing time is
approximately linear in Knn. A smaller Knn takes less time.
num_features
: the number of variables to be selected (which
cannot be larger than dx). The default value of
num_features
is NULL
and in that case it will
be set equal to dx; stop
: If stop == TRUE
,
then the automatic stopping criterion (stops at the first instance of
negative Tn, as mentioned in the paper) will be implemented and
continued till num_features
many variables are selected. If
stop == FALSE
then exactly num_features
many
variables are selected; numCores
: number of cores that are
going to be used for parallelizing the process; verbose
:
whether to print each selected variables during the forward stepwise
algorithm (default FALSE
);
It is suggested to normalize the predictors before applying
KFOCI
. KFOCI
returns a vector of the indices,
from 1,…,dx, of the selected variables in the same order that they were
selected. The variables at the front are expected to be more informative
in predicting Y.
= 200
n = 100
p set.seed(1)
= matrix(rnorm(n * p), ncol = p)
X = X[, 1] * X[, 2] + sin(X[, 1] * X[, 3])
Y KFOCI(Y, X, kernlab::rbfdot(1), Knn=1, numCores = 1)
# 1 2 3
# Install package 'olsrr'
= olsrr::surgical
surgical for (i in 1:9) surgical[,i] = (surgical[,i] - mean(surgical[,i]))/sd(surgical[,i])
colnames(surgical)[KFOCI(surgical[,9],surgical[,1:8],kernlab::rbfdot(1/(2*median(dist(surgical$y))^2)),Knn=1)]
# "enzyme_test" "pindex" "liver_test" "alc_heavy"
KPCRKHS_VS
performs a forward stepwise variable
selection using the RKHS based estimator of KPC. One needs to
pre-specify the number of variables to be selected. The inputs are:
Y
: a matrix of responses (n by dy); X
: a
matrix of predictors (n by dx); num_features
: the number of
variables to be selected, cannot be larger than dx; ky
: the
kernel function for Y; kS
: a function that takes X and a
subset of indices S as inputs, and then outputs the kernel for X_S. The
first argument of kS is X, and the second argument is a vector of
positive integer. If kS == NULL
, Gaussian kernel with
empitical bandwidth will be used, i.e.,
kernlab::rbfdot(1/(2*stats::median(stats::dist(X[,S]))^2))
;
eps
: a positive number, the regularization parameter for
RKHS based KPC estimator; appro
whether to use incomplete
Cholesky decomposition for approximation; tol
: tolerance
used for incomplete Cholesky decomposition (implemented by
inchol
in package kernlab
);
numCores
: number of cores that are going to be used for
parallelizing the process; verbose
: whether to print each
selected variables during the forward stepwise algorithm
The algorithm returns a vector of the indices from 1,…,dx of the selected variables in the same order that they were selected. The variables at the front are expected to be more informative in predicting Y.
= 200
n = 100
p set.seed(1)
= matrix(rnorm(n * p), ncol = p)
X = X[, 1] * X[, 2] + sin(X[, 1] * X[, 3])
Y library(kernlab)
= function(X,S) return(rbfdot(1/length(S)))
kS KPCRKHS_VS(Y, X, num_features = 3, rbfdot(1), kS, eps = 1e-3, appro = FALSE, numCores = 1)
# 1 2 3
= function(X,S) return(rbfdot(1/(2*stats::median(stats::dist(X[,S]))^2)))
kS KPCRKHS_VS(Y, X, num_features = 3, rbfdot(1), kS, eps = 1e-3, appro = FALSE, numCores = 1)
# 1 2 3
The medical data included in the package are collected from Edwards, D. (2012). Introduction to graphical modelling, Section 3.1.4, Springer Science & Business Media.
# load medical data
data("med")
for (i in 1:3) med[,i] = (med[,i] - mean(med[,i]))/sd(med[,i]) # normalization
library(kernlab)
KPCgraph(med$D,med$C,med$U,rbfdot(1/(2*median(dist(med$D))^2)),Knn=1,trans_inv=T)
# 0.04069334
# Theoretical KPC is 0 since D is independent of U given C
set.seed(1) # There is randomness in breaking the ties
KPCgraph(med$D,med$U,med$C,rbfdot(1/(2*median(dist(med$D))^2)),Knn=1,trans_inv=T)
# 0.3255831
# D is associated with C controlling U
# RKHS estimator
KPCRKHS(med$D,med$C,med$U,rbfdot(1/(2*median(dist(med$D))^2)),rbfdot(1/(2*median(dist(med$C))^2)), rbfdot(1/(2*median(dist(cbind(med$C,med$U)))^2)), 1e-2,F)
# 0.1502744
KPCRKHS(med$D,med$U,med$C,rbfdot(1/(2*median(dist(med$D))^2)),rbfdot(1/(2*median(dist(med$U))^2)), rbfdot(1/(2*median(dist(cbind(med$C,med$U)))^2)), 1e-2,F)
# 0.3852009
KPC can be defined for variables X, Y, Z taking values on general spaces, and if the variables can be stored as vectors, then the above functions can also be used. Here we consider the special orthogonal group SO(3), which consists of 3 by 3 orthogonal matrices with determinant 1. We store the matrices by concatenating the columns, and define a kernel on the vectorized SO(3) by
= function(vecA,vecB){
SO3ker = matrix(vecA,3,3)
A = matrix(vecB,3,3)
B = base::acos(min(1,max((sum(diag(t(B)%*%A))-1)/2,0)))
theta if (theta == 0 | theta == pi){
return(pi)
}return(theta*(pi-theta)/sin(theta))
}class(SO3ker) <- "kernel"
Let R1
, R3
be the rotation matrices around
x-axis and z-axis.
= function(x) {
R1 = cos(x)
cos_x = sin(x)
sin_x return(matrix(c(1,0,0,0,cos_x,sin_x,0,-sin_x,cos_x),3,3))
}= function(z) {
R3 = cos(z)
cos_z = sin(z)
sin_z return(matrix(c(cos_z,sin_z,0,-sin_z,cos_z,0,0,0,1),3,3))
}
= 1000
n set.seed(1)
= rnorm(n)
x = rnorm(n)
z = y2 = matrix(0,n,9)
y1 for (i in 1:n) {
= as.numeric(R1(x[i])%*%R3(z[i]))
y1[i,] = as.numeric(R1(x[i])%*%R3(rnorm(1)))
y2[i,]
}KPCgraph(y1,x,z,SO3ker,Knn = 1,trans_inv=T)
# 0.8547098
# y1 is a function of x and z
KPCgraph(y2,x,z,SO3ker,Knn = 1,trans_inv=T)
# 0.00914022
# y2 is conditionally independent of z given x
KPCRKHS(y1, x, z, SO3ker, rbfdot(1), rbfdot(0.5), 1e-5, appro = F)
# 0.6198004
KPCRKHS(y2, x, z, SO3ker, rbfdot(1), rbfdot(0.5), 1e-5, appro = F)
# 0.05227157
# Variable selection
= 100
n = 500
p set.seed(1)
= matrix(rnorm(n * p), ncol = p)
X = matrix(0,n,9)
y for (i in 1:n) {
= as.numeric(R1(X[i,1])%*%R3(X[i,2]))
y[i,]
}KFOCI(y, X, SO3ker, Knn=1, numCores = 1)
# 2 1
The 2017 Korea presidential election data, collected from https://github.com/OhmyNews/2017-Election, consists of the voting results earned by the top five candidates from 250 electoral districts in Korea. The top three candidates from three major parties representing progressivism, conservatism and centrism earned most of the votes, so we will focus on the proportion of votes earned by each of these three candidates among them (Y), which can be viewed as a histogram-valued response. The demographic information average age (X1), average years of education (X2), average housing price per square meter (X3) and average paid national health insurance premium (X4) are available for each electoral district.
data(ElecData)
= dim(ElecData)[1]/5
n = matrix(0, n, 3)
Y # Each row of Y is the proportion of votes earned by each of the top three candidates among them
for (i in 1:n) {
= ElecData$NumVote[(1+(i-1)*5):(3+(i-1)*5)]
num_vote = num_vote/sum(num_vote)
Y[i,]
}= ElecData[5*(1:n),4:7]
X for (i in 1:4) X[,i] = (X[,i] - mean(X[,i]))/sd(X[,i]) # normalize the data
set.seed(1)
5:8] = rnorm(n*4)
X[,
library(kernlab)
KFOCI(Y, X, rbfdot(1/(2*median(dist(Y))^2)), Knn=1, numCores = 1)
# 1 2 3 4
# define two kernels on histograms
= function(a,b) return(1/prod(a+b+1))
k1 = function(a,b) return(exp(-sum(sqrt(a+b))))
k2 class(k1) = class(k2) = "kernel"
KFOCI(Y, X, k1, Knn=4, numCores = 1)
# 1 2 4 3
KFOCI(Y, X, k2, Knn=4, numCores = 1)
# 1 2 4 3
KPCgraph(Y,X[,c(2,3,4)],X[,1],rbfdot(1/(2*median(dist(Y))^2)),Knn = 2,trans_inv=TRUE)
# 0.1543532
KPCRKHS(Y,X[,c(2,3,4)],X[,1],rbfdot(1/(2*median(dist(Y))^2)),rbfdot(1/(2*median(dist(X[,c(2,3,4)]))^2)), rbfdot(1/(2*median(dist(X[,1:4]))^2)), eps=1e-4, appro=F)
# 0.1473899
KPCgraph(Y,X[,c(1,2,4)],X[,3],rbfdot(1/(2*median(dist(Y))^2)),Knn = 2,trans_inv=TRUE)
# 0.05542749
KPCRKHS(Y,X[,c(1,2,4)],X[,3],rbfdot(1/(2*median(dist(Y))^2)),rbfdot(1/(2*median(dist(X[,c(1,2,4)]))^2)), rbfdot(1/(2*median(dist(X[,1:4]))^2)), eps=1e-4, appro=F)
# 0.06338199
# X3 and X4 are both measures of richness. The conditional association between X3 and Y given all other variables is weaker than that of X1 and Y.