Data editing should guarantee that all variables are carefully defined with specified domains where they are observed or measured and with specified range spaces where their possible values are given. A usual set-up is that all variables have the same domain, for instance a set of individuals for which several attributes are registered. If a binary relationship like friendship or not friendship is registered for all pairs of individuals, then the domain for this relationship is the set of dyads (pairs without regard to order) of individuals.
Variables with different domains can sometimes be combined. For instance, node or vertex variable \(X\) with value \(X_u\) for node \(u\) can be extended to the domain of dyads by defining it as the pair \(X_{uv} = (X_u,X_v)\) for dyad \((u,v)\). Examples of such variable transformations are given in Frank & Shafie (2016) (link).
Note that is also possible to create a new variable on units from a tie variable \(Y\) with values \(Y_{uv}\) on dyads by aggregating in some way the values on dyads incident to a node \(u\) to get a value \(Y_u\) at that node. For instance, the edge indicators in a graph can be aggregated to degrees or other centrality measures at the vertices.
Thus, variables can be observed and transformed on three variable domains: vertex, dyad and triad variables. This is exemplified in the following.
To load the internal data set (link), extract each object and assign the correct names to them:
Observed node variables are the attributes in the data frame
df.att
with 71 observations. All variables, except
years
and age
, are categorical with finite
range spaces and therefore kept in their original form. The variable
years
and age
need to be categorized using
their cumulative distribution functions (cdf) and creating approximately
equally sized categories. These cdf’s can be obtained by
# for years:
x <- table(df.att$years)
values<-as.numeric(names(x))
prop.x <- round(prop.table(x),2)
cum.prop=cumsum(prop.x)
frq.years = data.frame(value=values,freq=as.vector(x),
rel.freq=as.vector(prop.x),
cum.rel.freq=as.vector(cum.prop))
# for age:
x <- table(df.att$age)
values<-as.numeric(names(x))
prop.x <- round(prop.table(x),2)
cum.prop=cumsum(prop.x)
frq.age = data.frame(value=values,freq=as.vector(x),
rel.freq=as.vector(prop.x),
cum.rel.freq=as.vector(cum.prop))
By looking at these cdf’s we can find values with which we base the categories on:
## value freq rel.freq cum.rel.freq
## 1 1 8 0.11 0.11
## 2 2 7 0.10 0.21
## 3 3 9 0.13 0.34
## 4 4 4 0.06 0.40
## 5 5 5 0.07 0.47
## 6 6 2 0.03 0.50
## 7 7 2 0.03 0.53
## 8 8 5 0.07 0.60
## 9 9 1 0.01 0.61
## 10 10 2 0.03 0.64
## 11 11 1 0.01 0.65
## 12 13 2 0.03 0.68
## 13 15 3 0.04 0.72
## 14 16 1 0.01 0.73
## 15 17 1 0.01 0.74
## 16 18 1 0.01 0.75
## 17 19 2 0.03 0.78
## 18 20 1 0.01 0.79
## 19 21 1 0.01 0.80
## 20 22 1 0.01 0.81
## 21 23 2 0.03 0.84
## 22 24 1 0.01 0.85
## 23 25 2 0.03 0.88
## 24 28 1 0.01 0.89
## 25 29 2 0.03 0.92
## 26 31 3 0.04 0.96
## 27 32 1 0.01 0.97
## value freq rel.freq cum.rel.freq
## 1 26 2 0.03 0.03
## 2 28 1 0.01 0.04
## 3 29 4 0.06 0.10
## 4 30 1 0.01 0.11
## 5 31 5 0.07 0.18
## 6 32 1 0.01 0.19
## 7 33 4 0.06 0.25
## 8 34 4 0.06 0.31
## 9 35 2 0.03 0.34
## 10 36 2 0.03 0.37
## 11 37 2 0.03 0.40
## 12 38 7 0.10 0.50
## 13 39 1 0.01 0.51
## 14 41 1 0.01 0.52
## 15 42 1 0.01 0.53
## 16 43 4 0.06 0.59
## 17 44 2 0.03 0.62
## 18 45 3 0.04 0.66
## 19 46 2 0.03 0.69
## 20 47 2 0.03 0.72
## 21 48 1 0.01 0.73
## 22 49 2 0.03 0.76
## 23 50 2 0.03 0.79
## 24 52 1 0.01 0.80
## 25 53 5 0.07 0.87
## 26 55 1 0.01 0.88
## 27 56 1 0.01 0.89
## 28 57 1 0.01 0.90
## 29 59 2 0.03 0.93
## 30 62 1 0.01 0.94
## 31 63 1 0.01 0.95
## 32 64 1 0.01 0.96
## 33 67 1 0.01 0.97
We base the categorization of these two variables variables on values
yielding three approximately equally sized categories
(i.e. approximately 30% of the cdf) and merge all variables into a new
dataframe att_var
:
att.var <-
data.frame(
status = df.att$status-1,
gender = df.att$gender,
office = df.att$office-1,
years = ifelse(df.att$years <= 3,0,
ifelse(df.att$years <= 13,1,2)),
age = ifelse(df.att$age <= 35,0,
ifelse(df.att$age <= 45,1,2)),
practice = df.att$practice,
lawschool= df.att$lawschool-1
)
head(att.var)
## status gender office years age practice lawschool
## 1 0 1 0 2 2 1 0
## 2 0 1 0 2 2 0 0
## 3 0 1 1 1 2 1 0
## 4 0 1 0 2 2 0 2
## 5 0 1 1 2 2 1 1
## 6 0 1 1 2 2 1 0
Note that we for sake of consistency also edit all variables such
that their outcomes start from the value 0. The variable
senior
only has unique values, thus it is redundant and
omitted (later we illustrate how such redundant variables can be
detected using bivariate entropies).
To transform observed dyad variables into node variables, node
degrees of each network (in- and out-degree for directed advice and
friendship) can be computed and categorized as shown above for
years
and age
.
Dyad variables are given as pairs of incident vertex variables with
\(\binom{71}{2}=2485\) observations
(number of rows in the dataframes created in the following). Observed
node attribute in the dataframe att_var
are thus given by
pairs of individual attributes. For example, status
with
binary outcomes is transformed into dyads having 4 possible outcomes
\((0,0), (0,1), (1,0), (1,1)\) and
office
with three categorical outcomes gives dyads with 9
possible outcome \((0,0), (0,1), (0,2), (1,0),
(1,1), (1,2),(2,0),(2,1),(2,2)\). These transformations can be
done using the function get_dyad_variables()
for each
vertex variable using the argument type = att
which
specifies that we are using vertex attributes as input variable:
dyad.status <- get_dyad_var(att.var$status, type = 'att')
dyad.gender <- get_dyad_var(att.var$gender, type = 'att')
dyad.office <- get_dyad_var(att.var$office, type = 'att')
dyad.years <- get_dyad_var(att.var$years, type = 'att')
dyad.age <- get_dyad_var(att.var$age, type = 'att')
dyad.practice <- get_dyad_var(att.var$practice, type = 'att')
dyad.lawschool <- get_dyad_var(att.var$lawschool, type = 'att')
Note that the outcomes are recoded to numerical values to avoid
character objects when performing the entropy analysis (in practice
though, the actual values of variables are irrelevant for the entropy
analysis as we only care about frequencies of occurrence). Thus,
status
has outcomes 0-3 and office
has
0-8.
Similarly, dyad variables can be created based on observed ties. For
the undirected edges, we use indicator variables read directly from the
adjacency matrix for the dyad in question, while for the directed ones
(advice
and friendship
) we have pairs of
indicators representing sending and receiving ties with 4 possible
outcomes:
dyad.cwk <- get_dyad_var(adj.cowork, type = 'tie')
dyad.adv <- get_dyad_var(adj.advice, type = 'tie')
dyad.frn <- get_dyad_var(adj.friend, type = 'tie')
All 10 dyad variables are merged into one data frame for subsequent entropy analysis:
dyad.var <-
data.frame(cbind(status = dyad.status$var,
gender = dyad.gender$var,
office = dyad.office$var,
years = dyad.years$var,
age = dyad.age$var,
practice = dyad.practice$var,
lawschool = dyad.lawschool$var,
cowork = dyad.cwk$var,
advice = dyad.adv$var,
friend = dyad.frn$var)
)
head(dyad.var)
## status gender office years age practice lawschool cowork advice friend
## 1 3 3 0 8 8 1 0 0 3 2
## 2 3 3 3 5 8 3 0 0 0 0
## 3 3 3 3 5 8 2 0 0 1 0
## 4 3 3 0 8 8 1 6 0 1 2
## 5 3 3 0 8 8 0 6 0 1 1
## 6 3 3 1 7 8 1 6 0 1 1
A similar function get_triad_var()
is implemented for
transforming vertex variables and different relation types into triad
variables. These triad variables have \(\binom{71}{3}=57155\) observations in the
law data set and are given as triples of individual attributes or by the
relations among the three nodes. Similarly as for dyad variables, we
call the function and specify argument for type of variable (a column
vector as input when considering vertex attributes and an adjacency
matrix when considering ties). For the vertex variables we thus obtain
the triad variables by:
triad.status <- get_triad_var(att.var$status, type = 'att')
triad.gender <- get_triad_var(att.var$gender, type = 'att')
triad.office <- get_triad_var(att.var$office, type = 'att')
triad.years <- get_triad_var(att.var$years, type = 'att')
triad.age <- get_triad_var(att.var$age, type = 'att')
triad.practice <- get_triad_var(att.var$practice, type = 'att')
triad.lawschool <- get_triad_var(att.var$lawschool,type = 'att')
Note that binary attributes have 8 possible triadic outcomes \[ (0, 0, 0), (1, 0, 0), (0, 1, 0), (1, 1, 0), (0, 0, 1), (1, 0, 1), (0, 1, 1), (1, 1, 1)\] coded 0-7 and attributes with three possible outcomes will yield triads with 27 possible outcomes coded 0-26.
The undirected ties are transformed from having binary outcomes into triad variables with 8 possible outcomes, and directed ties are transformed from having 4 possible outcomes into triad variables with 64 possible outcome representing possible triadic combinations of sending and receiving ties.
triad.cwk <- get_triad_var(adj.cowork, type = 'tie')
triad.adv <- get_triad_var(adj.advice, type = 'tie')
triad.frn <- get_triad_var(adj.friend, type = 'tie')
All triad variables are then merged into one data frame for subsequent entropy analysis.
triad.var <- data.frame(cbind(
status = triad.status$var,
gender = triad.gender$var,
office = triad.office$var,
years = triad.years$var,
age = triad.age$var,
practice = triad.practice$var,
lawschool = triad.lawschool$var,
cowork = triad.cwk$var,
advice = triad.adv$var,
friend = triad.frn$var)
)
head(triad.var)
## status gender office years age practice lawschool cowork advice friend
## 1 7 7 9 17 26 5 0 0 35 1
## 2 7 7 0 26 26 1 18 0 43 37
## 3 7 7 9 26 26 5 9 0 11 1
## 4 7 7 9 26 26 5 0 0 19 1
## 5 7 7 9 26 26 1 18 4 35 1
## 6 7 7 0 26 26 5 18 0 11 5
Frank, O., & Shafie, T. (2016). Multivariate entropy analysis of network data. Bulletin of Sociological Methodology/Bulletin de Méthodologie Sociologique, 129(1), 45-63. link