The univariate entropy for discrete variable \(X\) with \(r\) outcomes is defined by \[H(X) = \sum_x p(x) \log_2\frac{1}{p(x)} \] with which we can check for redundancy and uniformity: a discrete random variable with minimal zero entropy has no uncertainty and is always equal to the same single outcome. Thus, it is a constant that contributes nothing to further analysis and can be omitted. Maximum entropy is \(\log_2r\) and it corresponds to a uniform probability distribution over the outcomes.
The bivariate entropy for discrete variable \(X\) and \(Y\) is defined by \[H(X,Y) = \sum_x \sum_y p(x,y) \log_2\frac{1}{p(x,y)}\] with which we can check for redundancy, functional relationships and stochastic independence between pairs of variables. It is bounded according to \[H(X) \leq H(X,Y) \leq H(X)+H(Y)\] where we have
equality to the left iff there is a functional relationship \(Y = f(X)\) such that each unique outcome of \(X\) yields a unique outcome of \(Y\)
equality to the right iff \(X\) and \(Y\) are stochastically independent \(X\bot Y\) such that the probability of any bivariate outcome is the product of the probabilities of the univariate outcomes.
Note that when the bivariate entropy of two variables is equal to the univariate entropy of either one alone, then one of these variables should be omitted as they are redundant providing no additional information.
These results on bivariate entropies are directly linked to joint entropies and association graphs.
Similarly, trivariate entropies (and higher order entropies) allows us to check for functional relationships and stochastic independence between three (or more) variables. The trivariate entropy of three variables \(X\), \(Y\) and \(Z\) is defined by \[H(X,Y,Z) = \sum_x \sum_y \sum_z p(x,y,z) \log_2\frac{1}{p(x,y,z)}\]
and bounded by \[ H(X,Y) \leq H(X,Y,Z) \leq H(X,Z) + H(Y,Z) - H(Z). \]
The results on bivariate and trivariate entropies are directly linked to prediction power and expected conditional entropies.
Examples of computing univariate, bivariate and trivariate entropies are given in the following.
We create a dataframe dyad.var
consisting of dyad
variables as described and created in variable domains and data editing.
Similar analyses can be perfomed on observed and/or transformed
dataframes with vertex or triad variables.
## status gender office years age practice lawschool cowork advice friend
## 1 3 3 0 8 8 1 0 0 3 2
## 2 3 3 3 5 8 3 0 0 0 0
## 3 3 3 3 5 8 2 0 0 1 0
## 4 3 3 0 8 8 1 6 0 1 2
## 5 3 3 0 8 8 0 6 0 1 1
## 6 3 3 1 7 8 1 6 0 1 1
The function entropy_bivar()
computes the bivariate
entropies of all pairs of variables in the dataframe. The output is
given as an upper triangular matrix with cells giving the bivariate
entropies of row and column variables. The diagonal thus gives the
univariate entropies for each variable in the dataframe:
## status gender office years age practice lawschool cowork advice
## status 1.493 2.868 3.640 3.370 3.912 3.453 4.363 2.092 2.687
## gender NA 1.547 3.758 3.939 4.274 3.506 4.439 2.158 2.785
## office NA NA 2.239 4.828 4.901 4.154 5.058 2.792 3.388
## years NA NA NA 2.671 4.857 4.582 5.422 3.268 3.868
## age NA NA NA NA 2.801 4.743 5.347 3.411 4.028
## practice NA NA NA NA NA 1.962 4.880 2.530 3.127
## lawschool NA NA NA NA NA NA 2.953 3.567 4.186
## cowork NA NA NA NA NA NA NA 0.615 1.687
## advice NA NA NA NA NA NA NA NA 1.248
## friend NA NA NA NA NA NA NA NA NA
## friend
## status 2.324
## gender 2.415
## office 3.044
## years 3.483
## age 3.637
## practice 2.831
## lawschool 3.812
## cowork 1.456
## advice 1.953
## friend 0.881
Bivariate entropies can be used to detect redundant variables that
should be omitted from the dataframe for further analysis. When
calculating bivariate entropies, one can check of whether the diagonal
values are equal to any of the other values in the rows an columns. As
seen above, the dataframe dyad.var
has no redundant
variables. This can also be checked using the function
redundancy()
which yields a binary matrix as output
indicating which row and column variables are hold the same
information:
## no redundant variables
## NULL
To illustrate an example with redundancy, we use the dataframe
att.var
with node attributes as described and created in variable domains and data editing. Note
however that we now keep the variable senior
in this
dataframe:
## senior status gender office years age practice lawschool
## 1 1 0 1 0 2 2 1 0
## 2 2 0 1 0 2 2 0 0
## 3 3 0 1 1 1 2 1 0
## 4 4 0 1 0 2 2 0 2
## 5 5 0 1 1 2 2 1 1
## 6 6 0 1 1 2 2 1 0
Checking redundancy on this dataframe yields the following output:
## senior status gender office years age practice lawschool
## senior 0 1 1 1 1 1 1 1
## status 0 0 0 0 0 0 0 0
## gender 0 0 0 0 0 0 0 0
## office 0 0 0 0 0 0 0 0
## years 0 0 0 0 0 0 0 0
## age 0 0 0 0 0 0 0 0
## practice 0 0 0 0 0 0 0 0
## lawschool 0 0 0 0 0 0 0 0
As seen, senior
has been flagged as a redundant variable
which is not surprising since it only consists of unique values. This
redudancy can also be noted by computing the bivariate entropies and
noting that the univariate entropy for this variable is equal to the
bivariate entropies of pairs including this variable:
## senior status gender office years age practice lawschool
## senior 6.15 6.15 6.150 6.150 6.150 6.150 6.150 6.150
## status NA 1.00 1.695 2.084 2.007 2.276 1.981 2.459
## gender NA NA 0.817 1.927 2.226 2.383 1.799 2.323
## office NA NA NA 1.125 2.693 2.668 2.088 2.607
## years NA NA NA NA 1.585 2.750 2.555 3.012
## age NA NA NA NA NA 1.585 2.558 2.876
## practice NA NA NA NA NA NA 0.983 2.513
## lawschool NA NA NA NA NA NA NA 1.533
Trivariate entropies can be computed using the function
entropy_trivar()
which returns a dataframe with the first
three columns representing possible triples of variables
V1
,V2
, and V3
from the dataframe
in question, and their entropies H(V1,V2,V3)
as the fourth
column. We illustrated this on the dataframe dyad.var
:
## V1 V2 V3 H(V1,V2,V3)
## 1 status gender office 4.938
## 2 status gender years 4.609
## 3 status gender age 5.129
## 4 status gender practice 4.810
## 5 status gender lawschool 5.664
## 6 status gender cowork 3.464
## 7 status gender advice 4.048
## 8 status gender friend 3.685
## 9 status office years 5.321
## 10 status office age 5.721
## 11 status office practice 5.528
## 12 status office lawschool 6.303
## 13 status office cowork 4.165
## 14 status office advice 4.713
## 15 status office friend 4.378
## 16 status years age 5.430
## 17 status years practice 5.264
## 18 status years lawschool 5.976
## 19 status years cowork 3.959
## 20 status years advice 4.535
## 21 status years friend 4.167
## 22 status age practice 5.832
## 23 status age lawschool 6.305
## 24 status age cowork 4.498
## 25 status age advice 5.080
## 26 status age friend 4.695
## 27 status practice lawschool 6.268
## 28 status practice cowork 3.989
## 29 status practice advice 4.537
## 30 status practice friend 4.258
## 31 status lawschool cowork 4.957
## 32 status lawschool advice 5.523
## 33 status lawschool friend 5.162
## 34 status cowork advice 3.087
## 35 status cowork friend 2.867
## 36 status advice friend 3.360
## 37 gender office years 5.984
## 38 gender office age 6.277
## 39 gender office practice 5.641
## 40 gender office lawschool 6.418
## 41 gender office cowork 4.301
## 42 gender office advice 4.873
## 43 gender office friend 4.539
## 44 gender years age 5.973
## 45 gender years practice 5.837
## 46 gender years lawschool 6.558
## 47 gender years cowork 4.532
## 48 gender years advice 5.120
## 49 gender years friend 4.731
## 50 gender age practice 6.130
## 51 gender age lawschool 6.654
## 52 gender age cowork 4.872
## 53 gender age advice 5.459
## 54 gender age friend 5.072
## 55 gender practice lawschool 6.301
## 56 gender practice cowork 4.062
## 57 gender practice advice 4.638
## 58 gender practice friend 4.349
## 59 gender lawschool cowork 5.044
## 60 gender lawschool advice 5.632
## 61 gender lawschool friend 5.266
## 62 gender cowork advice 3.217
## 63 gender cowork friend 2.983
## 64 gender advice friend 3.469
## 65 office years age 6.786
## 66 office years practice 6.552
## 67 office years lawschool 7.259
## 68 office years cowork 5.344
## 69 office years advice 5.861
## 70 office years friend 5.528
## 71 office age practice 6.737
## 72 office age lawschool 7.272
## 73 office age cowork 5.428
## 74 office age advice 5.988
## 75 office age friend 5.622
## 76 office practice lawschool 6.876
## 77 office practice cowork 4.645
## 78 office practice advice 5.185
## 79 office practice friend 4.934
## 80 office lawschool cowork 5.595
## 81 office lawschool advice 6.149
## 82 office lawschool friend 5.811
## 83 office cowork advice 3.798
## 84 office cowork friend 3.569
## 85 office advice friend 4.045
## 86 years age practice 6.624
## 87 years age lawschool 7.187
## 88 years age cowork 5.442
## 89 years age advice 6.005
## 90 years age friend 5.618
## 91 years practice lawschool 7.181
## 92 years practice cowork 5.117
## 93 years practice advice 5.665
## 94 years practice friend 5.360
## 95 years lawschool cowork 5.999
## 96 years lawschool advice 6.557
## 97 years lawschool friend 6.174
## 98 years cowork advice 4.274
## 99 years cowork friend 4.020
## 100 years advice friend 4.505
## 101 age practice lawschool 7.140
## 102 age practice cowork 5.290
## 103 age practice advice 5.849
## 104 age practice friend 5.538
## 105 age lawschool cowork 5.940
## 106 age lawschool advice 6.501
## 107 age lawschool friend 6.108
## 108 age cowork advice 4.453
## 109 age cowork friend 4.191
## 110 age advice friend 4.672
## 111 practice lawschool cowork 5.436
## 112 practice lawschool advice 6.000
## 113 practice lawschool friend 5.706
## 114 practice cowork advice 3.544
## 115 practice cowork friend 3.358
## 116 practice advice friend 3.810
## 117 lawschool cowork advice 4.613
## 118 lawschool cowork friend 4.381
## 119 lawschool advice friend 4.836
## 120 cowork advice friend 2.389
Frank, O., & Shafie, T. (2016). Multivariate entropy analysis of network data. Bulletin of Sociological Methodology/Bulletin de Méthodologie Sociologique, 129(1), 45-63. link