Descriptive Analysis

Published

March 31, 2026

Getting Started

# load custom functions
source("src/utils/custom_functions.r")

# clear the global environment and set dependencies
.clear_global_environment()
.load_quarto_dependencies()

# load and activate packages
library(tidyverse)
library(RSiena)
library(igraph)
library(dplyr)
library(tidyr)
library(purrr)
library(broom)
library(forcats)
library(scales)
library(patchwork)
library(gt)

Functions

`limit_observation_period()`

Restrict the edge table to co-authorships published on or after a given start date.

Arguments

data

Named list containing an edge_table element with a publication_date column.

start_date

Character string (parseable by ymd()) or date giving the earliest publication date to keep. Defaults to "2020-01-01".

Method

The function filters data[['edge_table']] to rows where publication_date >= ymd(start_date) and returns the updated list.

limit_observation_period <- function(data, start_date = "2020-01-01"){
  data[['edge_table']] <- data[['edge_table']] |>
    filter(publication_date >= ymd(start_date))

  return(data)
}

`determine_complete_cases()`

Flag scholars by data completeness and create a nodes table of analysis-ready scholars.

Arguments

data

Named list containing scholars_oa (list of tibbles with has_author_id and uid), edge_table (with uid and a__uid), and demographics (with uid, position columns, and other scholar metadata).

Method

The function identifies two sets of UIDs: scholars with a valid OpenAlex author ID (has_oa) and scholars who appear as a focal author in at least one co-authored work during the observation period (has_work). It adds a flag column to demographics coded "is_complete", "no_works", or "no_oa", and an is_staff indicator that is TRUE when all three position columns (position_22, position_24, position_25) are either NA or "staff". A nodes table is created by retaining only scholars flagged "is_complete" and not is_staff.

determine_complete_cases <-function(data) {
  # authors with valid OpenAlex id
  has_oa <- data[['scholars_oa']] |>
    bind_rows() |>
    filter(has_author_id) |>
    distinct(uid) |> pull()

  # authors with published works during observation period
  has_work <- data[['edge_table']] |>
    filter(uid != a__uid) |>
    distinct(uid) |> pull()


  data[['demographics']] <-data[['demographics']] |>
    mutate(
      flag = ifelse(
        uid %in% has_work, 'is_complete',
        ifelse(
          uid %in% has_oa, 'no_works', 'no_oa'
        )
      ),
      is_staff = dplyr::if_all(
        c(position_22, position_24, position_25),
        ~ is.na(.x) | (.x == "staff")
      )
    )

  data[['nodes']] = data[['demographics']] |>
    filter(flag == 'is_complete', !is_staff)

  return(data)
}

`select_sample_matrix()`

Subset an existing matrix to the rows and columns corresponding to a given sample of UIDs, padding with NA for UIDs not present in the original.

Arguments

d0

A named numeric matrix with rows and columns indexed by UID strings.

sample

Character vector of UIDs defining the desired output dimensions.

Method

The function initializes an NA matrix of dimension length(sample) × length(sample). It finds the intersection of sample with the existing row and column names of d0 and copies the overlapping block into the corresponding positions of the new matrix.

select_sample_matrix <- function(d0, sample){
  # target: full samp x samp, default NA
  d <- matrix(NA_real_, nrow = length(sample), ncol = length(sample),
              dimnames = list(sample, sample))

  # overlap between samp and existing d0 row/col names
  r <- intersect(sample, rownames(d0))
  c <- intersect(sample, colnames(d0))
  keep <- intersect(r, c)

  # copy existing values into the right positions
  d[keep, keep] <- d0[keep, keep, drop = FALSE]

  return(d)
}

`select_sample_intdis()`

Subset the interdisciplinarity matrix to a given sample of scholars, padding with NA for missing rows.

Arguments

topics

Named list containing an intdis sub-list whose subfield element is a matrix with rows indexed by UID and columns indexed by year.

sample

Character vector of UIDs defining the desired row set.

Method

The function extracts topics[["intdis"]][["subfield"]], initializes a new matrix with sample as row names and the same column names as the original, and copies rows that are present in both sample and the original. Rows not found in the original are left as NA.

select_sample_intdis <- function(topics, sample){
  t0 <- topics[["intdis"]][["subfield"]]
  t <- matrix(NA_real_, nrow = length(sample), ncol = length(colnames(t0)),
              dimnames = list(sample, colnames(t0)))

  # fill rows that exist in original t0
  keep <- intersect(sample, rownames(t0))
  t[keep, ] <- t0[keep, , drop = FALSE]

  return(t)
}

`make_adjacency_matrix()`

Build a co-authorship adjacency matrix for a specified set of scholars and time window.

Arguments

data

Named list with a works element (named list of per-scholar works tibbles containing uid, work_id, authorships, publication_date, and publication_year).

uids

Character vector of focal scholar UIDs defining the rows and columns of the output matrix.

type

Character string controlling which author position is treated as the focal node. "first" and "last" use the author at that position; "all" creates a full crossing of all co-authors on each work. Defaults to "first".

min_year

Integer year or Date giving the lower bound of the publication window (inclusive).

max_year

Integer year or Date giving the upper bound of the publication window (inclusive).

weighted

Logical. If TRUE, matrix values reflect the count of shared publications; if FALSE, values are binarized to 1L. Defaults to FALSE.

Method

The function binds the works for the requested uids, filters to the specified publication window (using publication_date for Date inputs and publication_year for integers), selects uid, work_id, and authorships, deduplicates by work_id, and unnests authorships. Depending on type, it either builds ego–alter pairs using the author at the named position or crosses all author pairs per work. Co-author pairs are counted and optionally binarized. The result is pivoted to a wide adjacency matrix, padded to the full uids dimension with zeros for absent pairs, and the diagonal is set to 0.

make_adjacency_matrix <- function(
    data, uids, type = 'first', min_year = 2020, max_year = 2026, weighted = FALSE
  ) {

  # make edgelist from works
  edges <- data[['works']][uids] |>
    purrr::keep(~ nrow(.x) > 0) |>
    bind_rows()

  if (inherits(min_year, "Date")) {
    edges <- edges |> filter(publication_date >= min_year, publication_date <= max_year)
  } else {
    edges <- edges |> filter(publication_year >= min_year, publication_year <= max_year)
  }
  edges <- edges |>
    select(uid, work_id, authorships) |>
    rename(e_uid = uid) |>
    distinct(work_id, .keep_all = TRUE) |>
    unnest(authorships)

  if (type != 'all') {
    edges <- edges |>
      group_by(work_id) |>
      mutate(
        e_uid = uid[author_position == type][1]
      ) |> ungroup() |>
      filter(!is.na(e_uid), e_uid != uid) |>
      select(work_id, e_uid, uid)
  } else {
    edges <- edges |>
      group_by(work_id) |>
      reframe(
        tidyr::crossing(
          e_uid = uid,
          uid   = uid
        )
      ) |>
      ungroup() |>
      filter(!is.na(e_uid), e_uid != uid) |>
      select(work_id, e_uid, uid)
  }

  # calculate the number of edges between ego and alters.
  edge_counts <- edges |> count(e_uid, uid, name="w")

  # convert counts to incidences
  if (!weighted) {
    edge_counts <- edge_counts %>% mutate(w = 1L)
  }

  # all nodes appearing as ego or alter
  nodes <- sort(unique(c(edge_counts$e_uid, edge_counts$uid)))

  # complete matrix grid and pivot to wide
  adj <- edge_counts %>%
    rename(from = e_uid, to = uid) %>%
    complete(from = nodes, to = nodes, fill = list(w = 0)) %>%
    pivot_wider(names_from = to, values_from = w) %>%
    arrange(from)

  # convert to matrix and set rownames
  mat <- adj %>% as.data.frame()

  row_ids <- mat$from
  mat$from <- NULL

  mat <- as.matrix(mat)
  rownames(mat) <- row_ids
  diag(mat) <- 0

  # make sure the matrices are consistent across waves
  all_uids <- as.character(uids)

  # indices of existing rows/cols in desired order
  ri <- match(all_uids, rownames(mat))
  ci <- match(all_uids, colnames(mat))

  # start with full NA matrix
  mat_full <- matrix(0L, nrow = length(all_uids), ncol = length(all_uids),
                    dimnames = list(all_uids, all_uids))

  # copy overlap block
  mat_full[!is.na(ri), !is.na(ci)] <- mat[ri[!is.na(ri)], ci[!is.na(ci)], drop = FALSE]
  mat <- mat_full

  mat
}

`fcolnet2()`

Build a multi-wave co-authorship network filtered by university, position, and discipline.

Arguments

data

Named list with a demographics element (containing uid, university_*, discipline_*, and position_* columns) and a works element used by make_adjacency_matrix().

university

Character vector of university abbreviations to include. Defaults to all eight valid Dutch universities.

position

Character vector of academic positions to include. Defaults to all five valid positions.

discipline

Character vector of disciplines to include. Defaults to "Sociology" and "Political Sciences".

type

Character string or NULL. If NULL, adjacency matrices are built for "first", "last", and "all" authorship types; otherwise the single specified type is used.

waves

List of two-element vectors specifying the start and end of each observation wave. Defaults to three waves roughly corresponding to 2019–2022, 2023–2024, and 2025–2026.

Method

The function validates each filter argument against its allowed values using an internal helper. It filters demographics to scholars who appear with at least one of the specified universities, disciplines, and positions across the three measurement years. For each wave (and optionally each authorship type), it calls make_adjacency_matrix() with the corresponding year bounds. The function returns a list with data (the filtered demographics tibble) and nets (the nested list of adjacency matrices).

fcolnet2 <- function(
    data,
    university = NULL,
    position   = NULL,
    discipline = NULL,
    type = NULL,
    waves = list(c(2019, 2022), c(2023, 2024), c(2025, 2026))
  ){

  valid_disciplines  <- c("Sociology", "Political Sciences")
  valid_universities <- c("UVA","EUR","UL","VU","UVG","UU","RU","UVT")
  valid_positions    <- c("Associate Professor","PhD Candidate","Full Professor",
                          "Researcher or Lecturer","Assistant Professor")
  valid_dates        <- list(c(ymd('20201219'), ymd('20221219')),
                             c(ymd('20221220'), ymd('20240419')),
                             c(ymd('20240420'), ymd('20260201')))

  # helper: set default + validate
  set_and_check <- function(x, valid, name = deparse(substitute(x))) {
    x <- if (is.null(x)) valid else x
    bad <- setdiff(unique(na.omit(x)), valid)
    if (length(bad)) stop("Invalid ", name, ": ", paste(bad, collapse = ", "), call. = FALSE)
    x
  }

  # set and check values
  discipline <- set_and_check(discipline, valid_disciplines,  "discipline")
  university <- set_and_check(university, valid_universities, "university")
  position   <- set_and_check(position,   valid_positions,    "position")
  waves      <- if(is.null(waves)) valid_dates else waves
  min_year   <- min(unlist(waves))
  max_year   <- max(unlist(waves))

  # step 1: make selection of nodes
  authors <- data[["demographics"]] |>
    filter(
      if_any(c(university_22, university_24, university_25),
            ~ .x |>
              toupper() |>
              str_detect(paste(university, collapse = "|"))),
      if_any(c(discipline_22, discipline_24, discipline_25),
            ~ .x %in% discipline),
      if_any(c(position_22, position_24, position_25),
            ~ .x %in% position)
    ) |>
    arrange(uid)
  uids <- authors |> pull(uid) |> sort()

  # step 3: create empty matrixes (wave, i, j)
  nwaves <- length(waves)
  nets <- list()
  if (is.null(type)){
    types = c('first', 'last', 'all')
    for (t in types){
      for (w in 1:length(waves)){
        nets[[t]][[w]] <- make_adjacency_matrix(
          data, uids, type = t, min_year = waves[[w]][1], max_year = waves[[w]][2])
      }
    }
  } else {
    for (w in 1:length(waves)){
      nets[[w]] <- make_adjacency_matrix(
        data, uids, type, min_year = waves[[w]][1], max_year = waves[[w]][2])
    }
  }

  # step 4: fill nets
  output <- list(
    data = authors,
    nets = nets
  )

  return(output)
}

`compute_ego_statistics()`

Compute per-scholar network centrality statistics from an igraph network object.

Arguments

nets

A list with at least a graph element (an igraph object) and optionally a data element (a demographics tibble with a uid column).

Method

The function checks whether the graph is directed and computes total, in-, and out-degree (filling NA for undirected graphs), normalized betweenness centrality, and normalized closeness centrality for every vertex. These are assembled into a tibble indexed by vertex name as uid. If nets$data contains a uid column, the centrality tibble is right-joined to demographics and a subset of personal identifier columns is dropped before returning the result.

compute_ego_statistics <- function(nets) {
  g <- nets$graph
  stopifnot(inherits(g, "igraph"))

  is_dir <- igraph::is_directed(g)

  # 1) Basic degrees
  degree_total <- igraph::degree(
    g,
    mode = if (is_dir) "all" else "total"
  )

  degree_in <- if (is_dir) {
    igraph::degree(g, mode = "in")
  } else {
    rep(NA_real_, igraph::vcount(g))
  }

  degree_out <- if (is_dir) {
    igraph::degree(g, mode = "out")
  } else {
    rep(NA_real_, igraph::vcount(g))
  }

  # 2) Betweenness (normalized)
  betweenness <- igraph::betweenness(
    g,
    directed   = is_dir,
    normalized = TRUE
  )

  # 3) Closeness (normalized)
  closeness <- suppressWarnings(
    igraph::closeness(
      g,
      mode       = if (is_dir) "all" else "all",
      normalized = TRUE
    )
  )
  # 4) Assemble into a tibble
  stats <- tibble::tibble(
    uid          = igraph::V(g)$name,
    degree       = as.numeric(degree_total),
    indegree     = as.numeric(degree_in),
    outdegree    = as.numeric(degree_out),
    betweenness  = as.numeric(betweenness),
    closeness    = as.numeric(closeness),
  )

  # 9) Optionally add demographics stored in nets$data
  if (!is.null(nets$data) && "uid" %in% colnames(nets$data)) {
    stats <- nets$data |>
      right_join(stats, by = "uid") |>
      arrange(uid) |>
      select(!initials:maiden_name) |>
      select(!gender_prob:origin)
  }

  return(stats)
}

`group_means()`

Summarize mean network centrality statistics by a grouping variable.

Arguments

df

A tibble containing degree, betweenness, and closeness columns (as produced by compute_ego_statistics()) along with the grouping column.

group_var

Character string naming the column to group by (e.g. "gender", "last_position").

Method

The function filters out rows where group_var is NA, groups by that variable, and computes the count of scholars and the mean of degree, betweenness, and closeness. The grouping column is renamed to group and a group_var identifier column is prepended to facilitate downstream bind_rows() calls.

group_means <- function(df, group_var) {
  df |>
    filter(!is.na(.data[[group_var]])) |>
    group_by(.data[[group_var]]) |>
    summarise(
      n = n(),
      degree_mean      = mean(degree, na.rm = TRUE),
      betweenness_mean = mean(betweenness, na.rm = TRUE),
      closeness_mean   = mean(closeness, na.rm = TRUE),
      .groups = "drop"
    ) |>
    rename(group = 1) |>
    mutate(group_var = group_var, .before = group)
}

`pairwise_ttests_with_stats()`

Run pairwise Welch t-tests across all group level combinations for multiple network metrics.

Arguments

df

A tibble with scholar-level metrics and at least the columns named in group_var and metrics.

group_var

Character string naming the grouping column (e.g. "last_position", "migrant").

metrics

Character vector of metric column names to test. Defaults to c("degree", "betweenness", "closeness").

p_adjust

Character string naming the p-value adjustment method passed to p.adjust(). Defaults to "holm".

Method

The function converts group_var to a factor (dropping unused levels after filtering NAs), enumerates all unordered pairs of group levels using combn(), and for each pair runs a Welch t-test per metric using broom::tidy() to extract results. The output is a tibble with columns for the group variable, metric, the two group labels, t-statistic, degrees of freedom, mean difference, confidence interval bounds, raw p-value, and a Holm-adjusted p-value computed within each (group_var, metric) combination.

pairwise_ttests_with_stats <- function(
    df, group_var, metrics = c("degree","betweenness","closeness"),
    p_adjust = "holm"
  ) {
  df2 <- df |>
    filter(!is.na(.data[[group_var]])) |>
    select(all_of(c(group_var, metrics))) |>
    mutate("{group_var}" := droplevels(as.factor(.data[[group_var]])))

  # all unordered pairs of group levels
  lvls <- levels(df2[[group_var]])
  pairs <- combn(lvls, 2, simplify = FALSE)

  # long metrics
  df_long <- df2 |>
    pivot_longer(all_of(metrics), names_to = "metric", values_to = "value")

  out <- map_dfr(pairs, function(pr) {
    g1 <- pr[[1]]; g2 <- pr[[2]]

    df_pair <- df_long |>
      filter(.data[[group_var]] %in% c(g1, g2)) |>
      mutate(g = droplevels(.data[[group_var]]))

    df_pair |>
      group_by(metric) |>
      summarise(
        test = list(t.test(value ~ g)),  # Welch by default
        .groups = "drop"
      ) |>
      mutate(tidy = map(test, broom::tidy)) |>
      unnest(tidy) |>
      transmute(
        group_var = group_var,
        metric,
        group1 = g1,
        group2 = g2,
        statistic = as.numeric(statistic), # this is t
        parameter = as.numeric(parameter), # df
        estimate = estimate,               # diff in means (group2 - group1) per broom
        conf.low, conf.high,
        p_value = p.value,
        method
      )
  })

  # adjust p-values within each (group_var, metric)
  out |>
    group_by(group_var, metric) |>
    mutate(p_adj = p.adjust(p_value, method = p_adjust)) |>
    ungroup()
}

Application

data <- freadRDS2('data', location = "./data/analysis/")
topics <-freadRDS2('topics', location = './data/analysis/')

data <- data |>
  limit_observation_period() |>
  determine_complete_cases()

data[['demographics']] |>
  filter(!is_staff) |>
  count(flag)

colnet <- fcolnet2(data,
  type = 'all',
  waves = list(c(ymd('20201219'), ymd('20260201')))
)

n = colnet[['nets']][[1]]

n[upper.tri(n)] |> sum(na.rm = TRUE)

[1] 936

n[lower.tri(n)] |> sum(na.rm = TRUE)

[1] 936

net <- fcolnet2(data,
  type = 'first',
  waves = list(c(2020, 2026))
)

noisolate <- rowSums(net$nets[[1]], na.rm = TRUE) > 0
net$nets[[1]] = net$nets[[1]][noisolate, noisolate]

net$graph  <- igraph::graph_from_adjacency_matrix(
  net$nets[[1]],
  mode = "directed",
  weighted = NULL,
  diag = FALSE,
  add.colnames = NULL)

df_ego <- net$data |>
  filter(uid %in% colnames(net$nets[[1]]))|>
  mutate(
    uni = coalesce(university_22_first, university_24_first, university_25_first) |>
      toupper()
  ) |>
  arrange(uid)

l <- layout_with_fr(net$graph, niter = 1500)

plot(net$graph,
  vertex.color = ifelse(df_ego$uni == "UVA", "red",
                 ifelse(df_ego$uni == "UVG", "lemonchiffon",
                 ifelse(df_ego$uni == "VU", "darkgrey",
                 ifelse(df_ego$uni == "UVT", "blue",
                 ifelse(df_ego$uni == "RU", "purple",
                 ifelse(df_ego$uni == "EUR", "yellow",
                 ifelse(df_ego$uni == "UU", "green",
                 ifelse(df_ego$uni == "UL", "coral", "white")))))))),
  vertex.shape = ifelse(df_ego$gender == 'female', "circle", 'square'),
  vertex.size = 2 + igraph::degree(net$g, mode = 'in') ^ 0.5,
  vertex.label = NA,
  vertex.size = 4,
  edge.width = 0.2,
  edge.arrow.size = 0.1,
  layout = l)

net_first <- fcolnet2(data,
  type = 'first',
  waves = list(c(2019, 2026))
)
noisolate = rowSums(net_first$nets[[1]]) > 0
# net_first$nets[[1]] <- net_first$nets[[1]][noisolate, noisolate]

net_first$graph  <- igraph::graph_from_adjacency_matrix(
  net_first$nets[[1]],
  mode = c("directed"),
  weighted = NULL,
  diag = FALSE,
  add.colnames = NULL,)

net_last <- fcolnet2(data,
  type = 'last',
  waves = list(c(2019, 2026))
)
noisolate <- rowSums(net_last$nets[[1]]) > 0
# net_last$nets[[1]] <- net_last$nets[[1]][noisolate, noisolate]

net_last$graph <- igraph::graph_from_adjacency_matrix(
  net_last$nets[[1]],
  mode = c("directed"),
  weighted = NULL,
  diag = FALSE,
  add.colnames = NULL,)

desc_data <- data[["demographics"]] |>
  mutate(
    first_affiliation = coalesce(university_22_first, university_24_first, university_25_first) |> toupper(),
    last_position = factor(
      coalesce(position_25, position_24, position_22),
      levels = c(
        "Full Professor", "Associate Professor", "Assistant Professor",
        "Researcher or Lecturer", "PhD Candidate"
      )
    ),
    gender  = factor(gender,  levels = c("female", "male")),
    migrant = factor(migrant, levels = c("NL", "M4", "other"))
  )

desc <- desc_data |>
  select(gender, migrant, first_affiliation, last_position) |>
  pivot_longer(everything(), names_to = "variable", values_to = "value") |>
  group_by(variable) |>
  mutate(
    # preserve existing factor levels when present; otherwise make a factor
    value = if (is.factor(value)) value else factor(value),
    value = fct_explicit_na(value, na_level = "Missing"),
    # make sure "Missing" goes last
    value = fct_relevel(value, "Missing", after = Inf)
  ) |>
  count(variable, value, name = "n") |>
  mutate(
    prop = n / sum(n),
    pct  = percent(prop, accuracy = 0.1)
  ) |>
  ungroup() |>
  arrange(variable, value)

desc

net$stats$ego = compute_ego_statistics(net)
net_first$stats$ego = compute_ego_statistics(net_first)
net_last$stats$ego = compute_ego_statistics(net_last)

plot_df <- data[["demographics"]] |>
  mutate(
    position = factor(
      coalesce(position_25, position_24, position_22),
      levels = c(
        "Full Professor", "Associate Professor", "Assistant Professor",
        "Researcher or Lecturer", "PhD Candidate"
      )
    ),
    flag = forcats::fct_na_value_to_level(as.factor(position))
  ) |>
  filter(!is.na(position)) |>
  count(position, flag)

ggplot(plot_df, aes(x = position, y = n, fill = flag)) +
  geom_col() +
  labs(x = NULL, y = "Count", fill = "Flag") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

plot_df <- data[["demographics"]] |>
  mutate(
    # position = factor(
    #   coalesce(position_25, position_24, position_22),
    #   levels = c(
    #     "Full Professor", "Associate Professor", "Assistant Professor",
    #     "Researcher or Lecturer", "PhD Candidate"
    #   )
    # ),
    flag = forcats::fct_na_value_to_level(as.factor(migrant))
  ) |>
  filter(!is.na(migrant)) |>
  count(migrant, flag)

ggplot(plot_df, aes(x = migrant, y = n, fill = flag)) +
  geom_col() +
  labs(x = NULL, y = "Count", fill = "Flag") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

ego <- net$stats$ego |>
  mutate(
    first_affiliation = coalesce(university_22_first, university_24_first, university_25_first) |> toupper(),
    last_position = factor(
      coalesce(position_25, position_24, position_22),
      levels = c(
        "Full Professor", "Associate Professor", "Assistant Professor",
        "Researcher or Lecturer", "PhD Candidate"
      )
    ),
    gender  = factor(gender,  levels = c("female", "male")),
    migrant = factor(migrant, levels = c("NL", "M4", "other"))
  )

means_by_gender   <- group_means(ego, "gender")
means_by_position <- group_means(ego, "last_position")
means_by_migrant  <- group_means(ego, "migrant")
means_by_affiliation <- group_means(ego, "first_affiliation")

means_all <- bind_rows(means_by_gender, means_by_position, means_by_migrant, means_by_affiliation)
means_all

plot_data <- means_all |>
  pivot_longer(
    cols = c(degree_mean, betweenness_mean, closeness_mean),
    names_to = "metric",
    values_to = "mean"
  ) |>
  mutate(
    metric = factor(
      metric,
      levels = c("degree_mean", "betweenness_mean", "closeness_mean"),
      labels = c("Degree", "Betweenness", "Closeness")
    )
  )

base_theme <- theme_minimal() +
  theme(
    axis.text.x = element_text(angle = 45, hjust = 1),
    panel.grid.major.x = element_blank(),
    legend.position = "none"
  )

pal <- c(
  "female" = "#fca5a5",
  "male" = "#b91c1c",
  "NL" = "#93c5fd",
  "M4" = "#3b82f6",
  "other" = "#1d4ed8",
  "Full Professor" = "#14532d",
  "Associate Professor" = "#166534",
  "Assistant Professor" = "#16a34a",
  "Researcher or Lecturer" = "#4ade80",
  "PhD Candidate" = "#bbf7d0"
)

p_degree <- plot_data |>
  filter(metric == "Degree") |>
  ggplot(aes(x = group, y = mean, fill = group)) +
  geom_col() +
  facet_wrap(~ group_var, scales = "free_x") +
  coord_cartesian(ylim = c(0, 6)) +
  scale_fill_manual(values = pal) +
  labs(
    x = NULL,
    y = "Mean degree",
    title = "Average degree by group"
  ) +
  base_theme

p_betweenness <- plot_data |>
  filter(metric == "Betweenness") |>
  ggplot(aes(x = group, y = mean, fill = group)) +
  geom_col() +
  facet_wrap(~ group_var, scales = "free_x") +
  coord_cartesian(ylim = c(0, 0.01)) +
  scale_fill_manual(values = pal) +
  labs(
    x = NULL,
    y = "Mean betweenness",
    title = "Average betweenness by group"
  ) +
  base_theme

p_closeness <- plot_data |>
  filter(metric == "Closeness") |>
  ggplot(aes(x = group, y = mean, fill = group)) +
  geom_col() +
  facet_wrap(~ group_var, scales = "free_x") +
  coord_cartesian(ylim = c(0, 1)) +
  scale_fill_manual(values = pal) +
  labs(
    x = NULL,
    y = "Mean closeness",
    title = "Average closeness by group"
  ) +
  base_theme

p_degree

p_betweenness

p_closeness

tt_gender <- ego |>
  filter(!is.na(gender)) |>
  pivot_longer(c(degree, betweenness, closeness), names_to = "metric", values_to = "value") |>
  group_by(metric) |>
  summarise(
    test = list(t.test(value ~ gender)),
    .groups = "drop"
  ) |>
  mutate(tidy = map(test, broom::tidy)) |>
  select(metric, tidy) |>
  unnest(tidy) |>
  mutate(
    group_var = 'gender',
    group1 = 'female',
    group2 = 'male'
  ) |>
  relocate(group_var, .before = metric) |>
  relocate(group1, group2, .after = metric) |>
  select(-estimate1, -estimate2, -alternative) |>
  rename(p_value = p.value)

tt_gender

tt_position <- pairwise_ttests_with_stats(ego, "last_position")
tt_migrant  <- pairwise_ttests_with_stats(ego, "migrant")
tt_affiliation  <- pairwise_ttests_with_stats(ego, "first_affiliation")

tt_position

tt_migrant

tt_affiliation

ego2 <- net_first$stats$ego |>
  mutate(
    first_affiliation = coalesce(university_22_first, university_24_first, university_25_first) |> toupper(),
    last_position = factor(
      coalesce(position_25, position_24, position_22),
      levels = c(
        "Full Professor", "Associate Professor", "Assistant Professor",
        "Researcher or Lecturer", "PhD Candidate"
      )
    ),
    gender  = factor(gender,  levels = c("female", "male")),
    migrant = factor(migrant, levels = c("NL", "M4", "other"))
  )

tt_gender2 <- ego2 |>
  filter(!is.na(gender)) |>
  pivot_longer(c(indegree, outdegree), names_to = "metric", values_to = "value") |>
  group_by(metric) |>
  summarise(
    test = list(t.test(value ~ gender)),
    .groups = "drop"
  ) |>
  mutate(tidy = map(test, broom::tidy)) |>
  select(metric, tidy) |>
  unnest(tidy)|>
  mutate(
    group_var = 'gender',
    group1 = 'female',
    group2 = 'male'
  ) |>
  relocate(group_var, .before = metric) |>
  relocate(group1, group2, .after = metric)  |>
  select(-estimate1, -estimate2, -alternative) |>
  rename(p_value = p.value)

tt_position2 <- ego2|>
  pairwise_ttests_with_stats('last_position', metrics = c('indegree', 'outdegree'))
tt_migrant2 <- ego2|>
  pairwise_ttests_with_stats('migrant', metrics = c('indegree', 'outdegree'))
tt_affiliation2 <- ego2|>
  pairwise_ttests_with_stats('first_affiliation', metrics = c('indegree', 'outdegree'))

tt_gender2

tt_position2

tt_migrant2

tt_affiliation2

desc_table = desc |>
  select(-pct) |>
  left_join(
    means_all |> select(-n),
    by = join_by(
      variable == group_var,
      value == 'group'
    )
  ) |>
  mutate(
    variable = case_when(
      variable == 'first_affiliation' ~ 'Affiliation',
      variable == 'gender' ~ 'Gender',
      variable == 'last_position' ~ 'Position',
      variable == 'migrant' ~ 'Migration Background'
    ),
    value = case_when(
      value == 'M4' ~ 'Minority',
      value == 'NL' ~ 'Majority',
      value == 'other' ~ 'Other',
      value == 'female' ~ 'Female',
      value == 'male' ~ 'Male',
      .default = value
    ),
    variable = factor(
      variable,
      levels = c('Gender', 'Migration Background', 'Affiliation', 'Position')
    ),
    value = factor(
      value,
      levels = c(
        'Female', 'Male', 'Majority', 'Minority', 'Other', 'EUR', 'RU', 'UL',
        'UU', 'UVA', 'UVG', 'UVT', 'VU', 'Full Professor', 'Associate Professor',
        'Assistant Professor', 'Researcher or Lecturer', 'PhD Candidate', 'Missing'
      )
    )
  ) |>
  arrange(variable, value)

saveRDS(desc_table, file = file.path('results', 'tables', 'desc.Rds'))

gt_desc_table = desc_table |>
  group_by(variable) |>
  gt() |>
  cols_label(
    .list = c(
      "variable" = "",
      "value" = "",
      "n" = 'N',
      "prop" = '%',
      'degree_mean' = 'Degree',
      'betweenness_mean' = 'Betweenness',
      'closeness_mean' = 'Closeness'
    )
  ) |>
  tab_spanner(
    label = 'Centrality',
    columns = c(degree_mean, betweenness_mean, closeness_mean)) |>
  fmt_number(decimals = 3, drop_trailing_zeros = FALSE) |>
  fmt_number(columns = 3, decimals = 0) |>
  fmt_percent(columns = 4) |>
  fmt_missing( missing_text = '') |>
  cols_align(columns = 2, align = 'left') |>
  cols_align(
    columns = c(degree_mean, betweenness_mean, closeness_mean),
    align = 'center'
  )
saveRDS(gt_desc_table, file = file.path('results', 'tables', 'gtdesc.Rds'))

gt_desc_table

	N	%	Centrality
	N	%	Degree	Betweenness	Closeness
Gender
Female	669	51.34%	2.585	0.001	0.285
Male	619	47.51%	3.349	0.002	0.304
Missing	15	1.15%
Migration Background
Majority	681	52.26%	3.298	0.002	0.283
Minority	65	4.99%	2.154	0.001	0.301
Other	551	42.29%	2.486	0.001	0.312
Missing	6	0.46%
Affiliation
EUR	131	10.05%	2.517	0.000	0.232
RU	145	11.13%	3.344	0.002	0.261
UL	155	11.90%	2.241	0.001	0.557
UU	89	6.83%	3.927	0.003	0.175
UVA	292	22.41%	2.819	0.002	0.326
UVG	176	13.51%	3.231	0.001	0.162
UVT	68	5.22%	3.278	0.003	0.161
VU	244	18.73%	2.513	0.000	0.441
Missing	3	0.23%
Position
Full Professor	166	12.74%	5.545	0.004	0.280
Associate Professor	119	9.13%	3.550	0.002	0.308
Assistant Professor	214	16.42%	3.150	0.002	0.292
Researcher or Lecturer	301	23.10%	2.179	0.000	0.316
PhD Candidate	376	28.86%	1.713	0.000	0.285
Missing	127	9.75%

tt <- bind_rows(
    tt_gender, tt_gender2,
    tt_position, tt_position2,
    tt_migrant, tt_migrant2,
    tt_affiliation, tt_affiliation2
  ) |>
  mutate(
    metric = factor(
      metric, levels = c(
        'betweenness', 'closeness', 'degree', 'indegree', 'outdegree')),
    group_var = case_when(
        group_var == 'first_affiliation' ~ 'Affiliation',
        group_var == 'gender' ~ 'Gender',
        group_var == 'last_position' ~ 'Position',
        group_var == 'migrant' ~ 'Migration Background'
      ) |>
      factor(
        levels = c('Gender', 'Migration Background', 'Affiliation', 'Position')
      ),
    group1 = case_when(
      group1 == 'female' ~ 'Female',
      group1 == 'male' ~ 'Male',
      group1 == 'NL' ~ 'Majority',
      group1 == 'M4' ~ 'Minority',
      group1 == 'other' ~ 'Other',
      .default = group1
    ),
    group2 = case_when(
      group2 == 'female' ~ 'Female',
      group2 == 'male' ~ 'Male',
      group2 == 'NL' ~ 'Majority',
      group2 == 'M4' ~ 'Minority',
      group2 == 'other' ~ 'Other',
      .default = group2
    ),
    stars = case_when(
        p_value < 0.001 ~ "***",
        p_value < 0.01  ~ "**",
        p_value < 0.05  ~ "*",
        TRUE            ~ ""
      ),
    f = sprintf("%.3f(%.0f)%s", statistic, parameter, stars)
  ) |>
  arrange(group_var, metric) |>
  select(!statistic:stars) |>
  pivot_wider(
    id_cols = c(group_var, group1, group2),
    names_from = metric,
    values_from = c(estimate, f),
    names_glue = "{metric}_{.value}"
  ) |>
  (\(x) dplyr::select(
    x,
    group_var,
    group1,
    group2,
    all_of(sort(setdiff(names(x), c("group_var", "group1", "group2"))))
  ))()

saveRDS(tt, file = file.path('results', 'tables', 'ttest.Rds'))

gt_tt = tt |>
  group_by(group_var) |>
  gt() |>
  cols_label_with(
    fn = \(x) x |>
      stringr::str_replace("^degree", "") |>
      stringr::str_replace("^betweenness", "") |>
      stringr::str_replace("^closeness", "") |>
      stringr::str_replace("^indegree", "") |>
      stringr::str_replace("^outdegree", "") |>
      stringr::str_replace("_estimate$", "ΔM") |>
      stringr::str_replace("_f$", "t(df)")
  ) |>
  cols_label(
    group_var = "",
    group1 = "",
    group2 = ""
  ) |>
  cols_merge(
    columns = c(group1, group2),
    pattern = "({1} - {2})"
  ) |>
  tab_spanner(
    label = 'Degree',
    columns = c(degree_estimate, degree_f)) |>
  tab_spanner(
    label = 'In-Degree',
    columns = c(indegree_estimate, indegree_f)) |>
  tab_spanner(
    label = 'Out-Degree',
    columns = c(outdegree_estimate, outdegree_f)) |>
  tab_spanner(
    label = 'Closeness',
    columns = c(closeness_estimate, closeness_f)) |>
  tab_spanner(
    label = 'Betweenness',
    columns = c(betweenness_estimate, betweenness_f)) |>
  fmt_number(decimals = 3)

saveRDS(gt_tt, file = file.path('results', 'tables', 'gtttest.Rds'))

gt_tt

	Betweenness		Closeness		Degree		In-Degree		Out-Degree
	ΔM	t(df)	ΔM	t(df)	ΔM	t(df)	ΔM	t(df)	ΔM	t(df)
Gender
(Female - Male)	−0.001	-1.192(322)	−0.019	-0.662(363)	−0.764	-2.639(357)**	−0.237	-2.241(1051)*	0.030	0.419(1155)
Migration Background
(Majority - Minority)	0.001	1.159(41)	−0.018	-0.370(29)	1.144	2.844(46)**	0.655	4.547(151)***	0.072	0.433(76)
(Majority - Other)	0.001	1.511(386)	−0.029	-0.882(240)	0.812	2.849(370)**	0.618	5.970(978)***	0.311	4.309(1106)***
(Minority - Other)	0.000	-0.262(35)	−0.010	-0.189(38)	−0.332	-0.837(43)	−0.037	-0.294(91)	0.239	1.456(73)
Affiliation
(EUR - RU)	−0.002	-2.385(67)*	−0.029	-0.649(107)	−0.827	-1.600(99)	−0.169	-0.650(223)	−0.407	-2.388(228)*
(EUR - UL)	−0.001	-0.676(29)	−0.325	-4.005(36)***	0.275	0.638(58)	0.657	4.252(175)***	0.545	4.669(235)***
(EUR - UU)	−0.002	-2.657(42)*	0.057	1.647(55)	−1.410	-2.083(53)*	−0.567	-1.430(97)	−0.390	-2.077(128)*
(EUR - UVA)	−0.001	-2.358(79)*	−0.093	-1.741(115)	−0.303	-0.749(128)	0.376	2.323(207)*	0.235	1.913(285)
(EUR - UVG)	−0.001	-0.767(54)	0.070	2.024(55)*	−0.714	-1.401(82)	0.073	0.313(255)	−0.019	-0.133(267)
(EUR - UVT)	−0.002	-1.866(17)	0.071	2.038(56)*	−0.761	-1.199(24)	0.268	1.168(130)	0.250	1.400(106)
(EUR - VU)	0.000	0.986(110)	−0.208	-3.876(122)***	0.004	0.011(138)	0.297	1.770(231)	0.229	1.948(254)
(RU - UL)	0.001	0.810(55)	−0.295	-3.751(33)***	1.102	1.933(89)	0.826	3.660(151)***	0.952	6.108(187)***
(RU - UU)	−0.001	-0.500(88)	0.086	2.982(59)**	−0.583	-0.754(79)	−0.399	-0.928(127)	0.017	0.080(174)
(RU - UVA)	0.000	0.499(119)	−0.064	-1.282(110)	0.524	0.953(115)	0.545	2.362(164)*	0.642	4.000(209)***
(RU - UVG)	0.001	1.029(109)	0.099	3.434(59)**	0.113	0.179(113)	0.242	0.845(261)	0.388	2.198(246)*
(RU - UVT)	−0.001	-0.455(29)	0.100	3.436(61)**	0.066	0.090(40)	0.437	1.548(180)	0.657	3.182(152)**
(RU - VU)	0.002	2.618(65)*	−0.179	-3.567(118)***	0.831	1.569(108)	0.466	1.983(175)*	0.635	4.065(192)***
(UL - UU)	−0.002	-1.174(59)	0.382	5.198(25)***	−1.685	-2.345(60)*	−1.225	-3.260(79)**	−0.935	-5.355(101)***
(UL - UVA)	−0.001	-0.495(43)	0.231	2.751(41)**	−0.578	-1.227(73)	−0.281	-2.841(398)**	−0.310	-3.046(392)**
(UL - UVG)	0.000	0.064(60)	0.395	5.376(25)***	−0.989	-1.754(79)	−0.584	-2.968(170)**	−0.564	-4.495(237)***
(UL - UVT)	−0.002	-1.025(37)	0.396	5.381(25)***	−1.036	-1.527(29)	−0.389	-2.048(72)*	−0.295	-1.785(82)
(UL - VU)	0.001	0.826(28)	0.116	1.381(42)	−0.271	-0.606(66)	−0.360	-3.313(379)**	−0.316	-3.325(368)***
(UU - UVA)	0.001	0.997(71)	−0.150	-3.652(63)***	1.107	1.576(61)	0.943	2.491(82)*	0.625	3.496(111)***
(UU - UVG)	0.002	1.420(88)	0.013	2.901(89)**	0.696	0.906(75)	0.641	1.544(114)	0.371	1.919(140)
(UU - UVT)	0.000	-0.072(33)	0.014	2.372(28)*	0.649	0.758(51)	0.835	2.027(107)*	0.640	2.896(132)**
(UU - VU)	0.003	2.852(41)**	−0.265	-6.417(69)***	1.414	2.058(56)*	0.865	2.267(84)*	0.618	3.534(103)***
(UVA - UVG)	0.001	0.688(93)	0.163	3.970(63)***	−0.411	-0.757(98)	−0.303	-1.495(189)	−0.254	-1.936(282)
(UVA - UVT)	−0.001	-0.802(23)	0.164	3.975(64)***	−0.458	-0.692(28)	−0.108	-0.552(81)	0.015	0.090(91)
(UVA - VU)	0.001	2.678(75)**	−0.115	-1.977(130)	0.307	0.729(146)	−0.079	-0.664(475)	−0.006	-0.061(483)
(UVG - UVT)	−0.002	-1.174(31)	0.001	0.152(28)	−0.047	-0.064(38)	0.194	0.749(169)	0.269	1.458(118)
(UVG - VU)	0.001	0.963(52)	−0.278	-6.733(69)***	0.718	1.373(90)	0.224	1.079(206)	0.248	1.964(253)
(UVT - VU)	0.003	1.991(17)	−0.279	-6.727(70)***	0.765	1.186(25)	0.029	0.146(89)	−0.022	-0.131(83)
Position
(Full Professor - Associate Professor)	0.002	1.882(96)	−0.028	-0.532(110)	1.995	2.868(103)**	0.713	2.199(270)*	−0.174	-1.005(259)
(Full Professor - Assistant Professor)	0.002	1.700(101)	−0.012	-0.259(134)	2.395	3.635(90)***	1.418	4.961(207)***	0.111	0.751(340)
(Full Professor - Researcher or Lecturer)	0.003	3.204(70)**	−0.036	-0.701(123)	3.366	5.391(75)***	2.078	7.662(170)***	0.502	3.945(246)***
(Full Professor - PhD Candidate)	0.004	3.437(66)**	−0.005	-0.123(122)	3.832	6.272(69)***	2.149	7.950(168)***	0.203	1.596(248)
(Associate Professor - Assistant Professor)	0.000	-0.248(137)	0.016	0.312(110)	0.400	0.909(118)	0.705	3.445(186)***	0.285	1.778(235)
(Associate Professor - Researcher or Lecturer)	0.001	2.097(75)*	−0.008	-0.141(114)	1.371	3.548(85)***	1.365	7.430(126)***	0.676	4.762(163)***
(Associate Professor - PhD Candidate)	0.001	2.565(64)*	0.023	0.494(91)	1.837	5.042(70)***	1.436	7.872(122)***	0.377	2.657(164)**
(Assistant Professor - Researcher or Lecturer)	0.001	2.332(98)*	−0.024	-0.479(126)	0.971	3.079(129)**	0.660	6.506(263)***	0.391	3.603(361)***
(Assistant Professor - PhD Candidate)	0.002	2.783(85)**	0.007	0.178(147)	1.437	4.991(104)***	0.731	7.380(241)***	0.092	0.851(368)
(Researcher or Lecturer - PhD Candidate)	0.000	0.894(107)	0.031	0.687(108)	0.466	2.372(120)*	0.071	1.740(579)	−0.299	-3.773(664)***