Case study 2: Leveraging PGS models for downstream biological inference

In the following case study we shows the next step after model selection: taking a PGS model and using it for downstream biological inference in an independent cohort. Starting from the AD model selected in the previous scenario, we couple the model to ROSMAP genotype and phenotype data from the Religious Orders Study (Bennett, Schneider, Arvanitakis, et al. 2012) and the Rush Memory and Aging Project (Bennett, Schneider, Buchman, et al. 2012), compute participant-level scores together with relevant covariates, and then test how the AD-PGS relates to clinical diagnosis, neuropathology, and survival-related outcomes.
Show setup code
library(dplyr)
library(data.table)
library(ggplot2)
library(patchwork)

library(PolyGenius)

Select the prioritized AD model

We take the best-performing AD-PGS model from the previous scenario and couple it to the ROSMAP cohort data. The resulting PolyGeniusData object combines the selected model, the ROSMAP genotype data, and the accompanying phenotype table in a single analysis container.

selected.model <- models[[4]]

data <- PolyGeniusData(
  name = "ROSMAP",
  models = selected.model,
  genotypes = GenotypeInfo(
    path = "resources/cohorts/ROSMAP/genotypes",
    format = "pfile",
    build = "GRCh38",
    name = "ROSMAP"
  ),
  obsm = list(phenotypes = fread("resources/cohorts/ROSMAP/phenotypes/phenotypes.csv") %>% 
    mutate(demented = factor(demented, levels = 0:1)))
)
data <- data[!is.na(demented), ]
data

Perform genotype-based computations

Next, we compute individual-level variables such as AD-PGS values, PCA based population structure, and sample kinship, all stroed in the PolyGeniusData object.

# Compute quantities
data$scores$X <- compute$scores(data)
data$obsm$PCA <- compute$populationStructure(data, npcs = 5)
data$obsp$kinship <- compute$kinship(data)

# Visualize results
p1 <- visualize$data$scores$distribution(data, models = 1, group.by = demented, type = "density") + 
  scale_fill_manual(
    values = c(`0` = "darkgreen", `1` = "darkorchid4"),
    labels = c(`0` = "Non demented", `1` = "AD demented"))
p2 <- visualize$data$embedding$obs(data, "PCA")
  
p1 + p2 +
  plot_layout(guides = "collect") &
  theme(legend.position = "bottom") &
  coord_cartesian(expand = FALSE)

Downstream association analyses

With these quantities in place, we test whether the selected AD-PGS associates with AD diagnosis and with AD neuropathology, while adjusting for age at death, sex, and population structure.

data$fetch(demented, braaksc, ceradsc, X, sex, age.death, PCA, .context = "obs")

Since demented is of type factor it will be regressed using a logistic model, while braaksc and ceradsc are numeric and will be regressed using a linear model.

data$modu$associations <- associate$regression(
  data,
  outcomes = c(demented, braaksc, ceradsc),
  covariates = c(sex, age.death, PCA))

p1 <- visualize$associations$forest(data$modu$associations %>% filter(family == "glm"), rows = outcome)
p2 <- visualize$associations$forest(data$modu$associations %>% filter(family == "lm"))

wrap_plots(p1[[1]], p2[[1]], p2[[2]], heights = c(1, 2, .25), ncol = 1)

Survival analysis

We next move from cross-sectional disease status to the timing of clinical dementia onset. This analysis asks whether individuals with a higher AD-PGS tend to develop dementia at younger ages, while preserving the distinction between dementia onset, death without dementia, and censoring.

For PolyGenius survival models, the outcome is still the clinical endpoint being modeled, namely demented. We use age.event is the analysis time: age at dementia onset for cases and age at death for participants who died without dementia. In the competing-risk model, death.without.dementia is handled explicitly as an event that precludes later dementia onset.

Some participants are recorded as demented but do not have an available onset age. Those observations remain useful for the diagnosis and neuropathology analyses above, but they cannot be placed correctly on an age-at-onset time axis. We therefore exclude them from the survival analysis rather than assigning their age at death as an onset time.

data$obsm$phenotypes <- data$obsm$phenotypes %>% mutate(
  age.observed = ifelse(demented == 1, demented.age.onset, age.death)
) 

data$obs$tertiles <- cut(
  data$scores$X[, 1],
  breaks = unique(quantile(data$scores$X[, 1], c(0, .2, .8, 1), na.rm = TRUE)),
  labels = c("Low (<=20%)", "Mid (>20-80%)", "High (>80%)"),
  include.lowest = TRUE
)


data$fetch(demented, age.observed, tertiles, sex, PCA, .context = "obs")

We first divide the AD-PGS into tertiles to provide a simple visual summary of the onset pattern. The Kaplan-Meier analysis compares dementia-free survival across low, middle, and high AD-PGS groups without covariate adjustment.

data$modu$survival.km <- associate$regression(
  data,
  outcomes = demented,
  predictors = tertiles,
  time = age.observed,
  model = "km"
)

p.km <- visualize$associations$survival(data$modu$survival.km, show.risk.table = TRUE, show.summary.table = TRUE, show.statistics = TRUE)
p.km

We then fit an adjusted Cox model using the same tertile predictor. This gives a coefficient-style association result, now interpreted as an effect on the dementia-onset hazard rather than on disease status at death. Sex and ancestry principal components are included as covariates.

data$modu$survival.cox <- associate$regression(
  data,
  outcomes = demented,
  predictors = tertiles,
  time = age.observed,
  covariates = c(sex, PCA),
  model = "cox",
  reference.level = "Mid (>20-80%)"
)

p1 <- visualize$associations$survival(data$modu$survival.cox, curves = "predicted", show.risk.table = FALSE, show.summary.table = FALSE, show.statistics = FALSE)
p2 <- visualize$associations$forest(data$modu$survival.cox, rows.by = c(x=paste(contrast.level, "vs.", reference.level)), table.left = NULL)

wrap_plots(p1, p2[[1]], heights = c(1, .25), ncol = 1)

Finally, we refit the adjusted tertile analysis as a Fine-Gray competing-risk model. This is the preferred formulation when the scientific target is cumulative dementia incidence by age, because death without dementia removes individuals from the set of people who can subsequently experience dementia onset.

data$obsm$phenotypes$died.non.demented <- data$fetch(died & demented == 0)

data$modu$survival.crr <- associate$regression(
  data,
  outcomes = demented,
  predictors = tertiles,
  time = age.observed,
  competing = died.non.demented,
  covariates = c(PCA),
  split.by = sex,
  reference.level = "Mid (>20-80%)",
  model = "crr"
)

p1 <- visualize$associations$survival(data$modu$survival.crr, curves = "predicted", show.risk.table = FALSE, show.summary.table = FALSE, show.statistics = FALSE)
p2 <- visualize$associations$forest(data$modu$survival.crr, rows.by = term, table.left = NULL, group.by=stratum )

wrap_plots(p1, p2[[1]], heights = c(1, .25), ncol = 1)

References

Bennett, David A., Julie A. Schneider, Zoe Arvanitakis, and Robert S. Wilson. 2012. “OVERVIEW AND FINDINGS FROM THE RELIGIOUS ORDERS STUDY.” Current Alzheimer Research 9 (6): 628–45. https://doi.org/10.2174/156720512801322573.
Bennett, David A., Julie A. Schneider, Aron S. Buchman, Lisa L. Barnes, Patricia A. Boyle, and Robert S. Wilson. 2012. “Overview and Findings from the Rush Memory and Aging Project.” Current Alzheimer Research 9 (6): 646–63. https://doi.org/10.2174/156720512801322663.