Variable Domains and Data Editing

Data editing should guarantee that all variables are carefully defined with specified domains where they are observed or measured and with specified range spaces where their possible values are given. A usual set-up is that all variables have the same domain, for instance a set of individuals for which several attributes are registered. If a binary relationship like friendship or not friendship is registered for all pairs of individuals, then the domain for this relationship is the set of dyads (pairs without regard to order) of individuals.

Variables with different domains can sometimes be combined. For instance, node or vertex variable \(X\) with value \(X_u\) for node \(u\) can be extended to the domain of dyads by defining it as the pair \(X_{uv} = (X_u,X_v)\) for dyad \((u,v)\). Examples of such variable transformations are given in Frank & Shafie (2016) (link).

Note that is also possible to create a new variable on units from a tie variable \(Y\) with values \(Y_{uv}\) on dyads by aggregating in some way the values on dyads incident to a node \(u\) to get a value \(Y_u\) at that node. For instance, the edge indicators in a graph can be aggregated to degrees or other centrality measures at the vertices.

Thus, variables can be observed and transformed on three variable domains: vertex, dyad and triad variables. This is exemplified in the following.


Load internal data set

library('netropy')

To load the internal data set (link), extract each object and assign the correct names to them:

data(lawdata) 
adj.advice <- lawdata[[1]]
adj.friend <- lawdata[[2]]
adj.cowork <-lawdata[[3]]
df.att <- lawdata[[4]]

Example: observed and transformed vertex variables

Observed node variables are the attributes in the data frame df.att with 71 observations. All variables, except years and age, are categorical with finite range spaces and therefore kept in their original form. The variable years and age need to be categorized using their cumulative distribution functions (cdf) and creating approximately equally sized categories. These cdf’s can be obtained by

# for years:
x <- table(df.att$years) 
values<-as.numeric(names(x))
prop.x <- round(prop.table(x),2)
cum.prop=cumsum(prop.x)
frq.years = data.frame(value=values,freq=as.vector(x),
                       rel.freq=as.vector(prop.x), 
                       cum.rel.freq=as.vector(cum.prop))

# for age:
x <- table(df.att$age) 
values<-as.numeric(names(x))
prop.x <- round(prop.table(x),2)
cum.prop=cumsum(prop.x)
frq.age = data.frame(value=values,freq=as.vector(x),
                       rel.freq=as.vector(prop.x), 
                       cum.rel.freq=as.vector(cum.prop))

By looking at these cdf’s we can find values with which we base the categories on:

frq.years
##    value freq rel.freq cum.rel.freq
## 1      1    8     0.11         0.11
## 2      2    7     0.10         0.21
## 3      3    9     0.13         0.34
## 4      4    4     0.06         0.40
## 5      5    5     0.07         0.47
## 6      6    2     0.03         0.50
## 7      7    2     0.03         0.53
## 8      8    5     0.07         0.60
## 9      9    1     0.01         0.61
## 10    10    2     0.03         0.64
## 11    11    1     0.01         0.65
## 12    13    2     0.03         0.68
## 13    15    3     0.04         0.72
## 14    16    1     0.01         0.73
## 15    17    1     0.01         0.74
## 16    18    1     0.01         0.75
## 17    19    2     0.03         0.78
## 18    20    1     0.01         0.79
## 19    21    1     0.01         0.80
## 20    22    1     0.01         0.81
## 21    23    2     0.03         0.84
## 22    24    1     0.01         0.85
## 23    25    2     0.03         0.88
## 24    28    1     0.01         0.89
## 25    29    2     0.03         0.92
## 26    31    3     0.04         0.96
## 27    32    1     0.01         0.97
frq.age
##    value freq rel.freq cum.rel.freq
## 1     26    2     0.03         0.03
## 2     28    1     0.01         0.04
## 3     29    4     0.06         0.10
## 4     30    1     0.01         0.11
## 5     31    5     0.07         0.18
## 6     32    1     0.01         0.19
## 7     33    4     0.06         0.25
## 8     34    4     0.06         0.31
## 9     35    2     0.03         0.34
## 10    36    2     0.03         0.37
## 11    37    2     0.03         0.40
## 12    38    7     0.10         0.50
## 13    39    1     0.01         0.51
## 14    41    1     0.01         0.52
## 15    42    1     0.01         0.53
## 16    43    4     0.06         0.59
## 17    44    2     0.03         0.62
## 18    45    3     0.04         0.66
## 19    46    2     0.03         0.69
## 20    47    2     0.03         0.72
## 21    48    1     0.01         0.73
## 22    49    2     0.03         0.76
## 23    50    2     0.03         0.79
## 24    52    1     0.01         0.80
## 25    53    5     0.07         0.87
## 26    55    1     0.01         0.88
## 27    56    1     0.01         0.89
## 28    57    1     0.01         0.90
## 29    59    2     0.03         0.93
## 30    62    1     0.01         0.94
## 31    63    1     0.01         0.95
## 32    64    1     0.01         0.96
## 33    67    1     0.01         0.97

We base the categorization of these two variables variables on values yielding three approximately equally sized categories (i.e. approximately 30% of the cdf) and merge all variables into a new dataframe att_var:

att.var <-
  data.frame(
    status   = df.att$status-1,
    gender   = df.att$gender,
    office   = df.att$office-1,
    years    = ifelse(df.att$years <= 3,0,
                      ifelse(df.att$years <= 13,1,2)),
    age      = ifelse(df.att$age <= 35,0,
                      ifelse(df.att$age <= 45,1,2)),
    practice = df.att$practice,
    lawschool= df.att$lawschool-1
    )
head(att.var)
##   status gender office years age practice lawschool
## 1      0      1      0     2   2        1         0
## 2      0      1      0     2   2        0         0
## 3      0      1      1     1   2        1         0
## 4      0      1      0     2   2        0         2
## 5      0      1      1     2   2        1         1
## 6      0      1      1     2   2        1         0

Note that we for sake of consistency also edit all variables such that their outcomes start from the value 0. The variable senior only has unique values, thus it is redundant and omitted (later we illustrate how such redundant variables can be detected using bivariate entropies).

To transform observed dyad variables into node variables, node degrees of each network (in- and out-degree for directed advice and friendship) can be computed and categorized as shown above for years and age.

Example: observed and transformed dyad variables

Dyad variables are given as pairs of incident vertex variables with \(\binom{71}{2}=2485\) observations (number of rows in the dataframes created in the following). Observed node attribute in the dataframe att_var are thus given by pairs of individual attributes. For example, status with binary outcomes is transformed into dyads having 4 possible outcomes \((0,0), (0,1), (1,0), (1,1)\) and office with three categorical outcomes gives dyads with 9 possible outcome \((0,0), (0,1), (0,2), (1,0), (1,1), (1,2),(2,0),(2,1),(2,2)\). These transformations can be done using the function get_dyad_variables() for each vertex variable using the argument type = att which specifies that we are using vertex attributes as input variable:

dyad.status    <- get_dyad_var(att.var$status, type = 'att')
dyad.gender    <- get_dyad_var(att.var$gender, type = 'att')
dyad.office    <- get_dyad_var(att.var$office, type = 'att')
dyad.years     <- get_dyad_var(att.var$years, type = 'att')
dyad.age       <- get_dyad_var(att.var$age, type = 'att')
dyad.practice  <- get_dyad_var(att.var$practice, type = 'att')
dyad.lawschool <- get_dyad_var(att.var$lawschool, type = 'att')

Note that the outcomes are recoded to numerical values to avoid character objects when performing the entropy analysis (in practice though, the actual values of variables are irrelevant for the entropy analysis as we only care about frequencies of occurrence). Thus, status has outcomes 0-3 and office has 0-8.

Similarly, dyad variables can be created based on observed ties. For the undirected edges, we use indicator variables read directly from the adjacency matrix for the dyad in question, while for the directed ones (advice and friendship) we have pairs of indicators representing sending and receiving ties with 4 possible outcomes:

dyad.cwk    <- get_dyad_var(adj.cowork, type = 'tie')
dyad.adv    <- get_dyad_var(adj.advice, type = 'tie')
dyad.frn    <- get_dyad_var(adj.friend, type = 'tie')

All 10 dyad variables are merged into one data frame for subsequent entropy analysis:

dyad.var <-
  data.frame(cbind(status   = dyad.status$var,
                  gender    = dyad.gender$var,
                  office    = dyad.office$var,
                  years     = dyad.years$var,
                  age       = dyad.age$var,
                  practice  = dyad.practice$var,
                  lawschool = dyad.lawschool$var,
                  cowork    = dyad.cwk$var,
                  advice    = dyad.adv$var,
                  friend    = dyad.frn$var)
                  )
head(dyad.var)
##   status gender office years age practice lawschool cowork advice friend
## 1      3      3      0     8   8        1         0      0      3      2
## 2      3      3      3     5   8        3         0      0      0      0
## 3      3      3      3     5   8        2         0      0      1      0
## 4      3      3      0     8   8        1         6      0      1      2
## 5      3      3      0     8   8        0         6      0      1      1
## 6      3      3      1     7   8        1         6      0      1      1

Example: transformed triad variables

A similar function get_triad_var() is implemented for transforming vertex variables and different relation types into triad variables. These triad variables have \(\binom{71}{3}=57155\) observations in the law data set and are given as triples of individual attributes or by the relations among the three nodes. Similarly as for dyad variables, we call the function and specify argument for type of variable (a column vector as input when considering vertex attributes and an adjacency matrix when considering ties). For the vertex variables we thus obtain the triad variables by:

triad.status    <- get_triad_var(att.var$status, type = 'att')
triad.gender    <- get_triad_var(att.var$gender, type = 'att')
triad.office    <- get_triad_var(att.var$office, type = 'att')
triad.years     <- get_triad_var(att.var$years, type = 'att')
triad.age       <- get_triad_var(att.var$age, type = 'att')
triad.practice  <- get_triad_var(att.var$practice, type = 'att')
triad.lawschool <- get_triad_var(att.var$lawschool,type = 'att')

Note that binary attributes have 8 possible triadic outcomes \[ (0, 0, 0), (1, 0, 0), (0, 1, 0), (1, 1, 0), (0, 0, 1), (1, 0, 1), (0, 1, 1), (1, 1, 1)\] coded 0-7 and attributes with three possible outcomes will yield triads with 27 possible outcomes coded 0-26.

The undirected ties are transformed from having binary outcomes into triad variables with 8 possible outcomes, and directed ties are transformed from having 4 possible outcomes into triad variables with 64 possible outcome representing possible triadic combinations of sending and receiving ties.

triad.cwk    <- get_triad_var(adj.cowork, type = 'tie')
triad.adv    <- get_triad_var(adj.advice, type = 'tie')
triad.frn    <- get_triad_var(adj.friend, type = 'tie')

All triad variables are then merged into one data frame for subsequent entropy analysis.

triad.var <- data.frame(cbind(
             status    = triad.status$var,
             gender    = triad.gender$var,
             office    = triad.office$var,
             years     = triad.years$var,
             age       = triad.age$var,
             practice  = triad.practice$var,
             lawschool = triad.lawschool$var,
             cowork    = triad.cwk$var,
             advice    = triad.adv$var,
             friend    = triad.frn$var)
             )
head(triad.var)
##   status gender office years age practice lawschool cowork advice friend
## 1      7      7      9    17  26        5         0      0     35      1
## 2      7      7      0    26  26        1        18      0     43     37
## 3      7      7      9    26  26        5         9      0     11      1
## 4      7      7      9    26  26        5         0      0     19      1
## 5      7      7      9    26  26        1        18      4     35      1
## 6      7      7      0    26  26        5        18      0     11      5

References

Frank, O., & Shafie, T. (2016). Multivariate entropy analysis of network data. Bulletin of Sociological Methodology/Bulletin de Méthodologie Sociologique, 129(1), 45-63. link