Stratification and Heterogeneity

Subgroup analyses are useful, but two different questions are often confused:

  1. What is the association within each subgroup?
  2. Is the association significantly different between subgroups?

PolyGenius treats these as different workflows.

Stratified Reporting

Use split.by when you want separate association estimates within each observed stratum:

stratified <- associate$regression(
  data,
  outcome = demented,
  predictors = PRS,
  covariates = c(age, PCA),
  split.by = sex
)

This fits one model among males and another model among females. The returned summary table stores the subgroup in the stratum column.

visualize$associations$forest(stratified)

Use this when the goal is descriptive subgroup reporting:

  • show the association in each sex;
  • report associations within APOE4 carrier groups;
  • inspect whether a signal is consistent across cohorts or ancestry groups;
  • make split-specific survival curves from the stored artifacts.

What split.by Does Not Do

split.by does not formally compare the subgroup estimates. It should not be interpreted as evidence that effects differ just because one subgroup has a small p-value and another subgroup does not.

This is not a valid heterogeneity test:

# Descriptive only:
associate$regression(
  data,
  outcome = demented,
  predictors = PRS,
  split.by = sex
)

The correct question is:

Is the PRS effect different by sex?

That is a heterogeneity or interaction question.

Heterogeneity In Classical Models

For linear and logistic regression, interaction terms can be fit directly:

interaction.assoc <- associate$regression(
  data,
  outcome = demented,
  predictors = PRS,
  covariates = c(age, PCA),
  interactions = sex
)

Conceptually:

demented ~ PRS * sex + age + PCA

Rows with term.type = "interaction" describe interaction coefficients. For a two-level modifier, this row is the formal test that the predictor effect differs between the non-reference and reference group.

For multi-level modifiers, a full heterogeneity answer is often an omnibus test across several interaction coefficients. That belongs naturally in the comparison workflow.

Heterogeneity As A Comparison Question

The question-first comparison interface is:

heterogeneity <- associate$compare(
  data,
  outcome = demented,
  predictors = PRS,
  covariates = c(age, PCA),
  by = sex,
  type = "heterogeneity"
)

This means:

Test whether the association between PRS and demented differs by sex.

The comparison fit should use the statistically appropriate pooled model rather than comparing subgroup p-values.

Use this pattern for:

  • PRS effect differs by sex;
  • PRS effect differs by APOE4 status;
  • score association differs across ancestry groups;
  • biomarker association differs across diagnostic groups.

Splitting A Comparison

split.by can still be useful inside a comparison workflow when the goal is to repeat a comparison within another grouping variable:

heterogeneity.by.apoe <- associate$compare(
  data,
  outcome = demented,
  predictors = PRS,
  by = sex,
  split.by = APOE4.status,
  type = "heterogeneity"
)

This means:

Within each APOE4 stratum, test whether the PRS association differs by sex.

Tertiles: Predictor, Group, Or Split?

PRS tertiles can be used in different ways depending on the question.

Question Better specification
Do PRS tertile groups differ in outcome? use prs.tertile as the predictor or by group
Do PRS tertile survival curves differ? model = "km" or type = "group" with by = prs.tertile
Is another effect different within PRS tertiles? split.by = prs.tertile for descriptive estimates, or heterogeneity comparison
Is continuous PRS associated with outcome? use continuous PRS as predictors

Avoid splitting by a variable derived from the predictor and then interpreting the continuous predictor effect inside each split without a clear scientific reason. Conditioning on the predictor can make the estimate hard to interpret.

Returned Object

Stratified regression results use the same family-specific regression schemas as ordinary association results, with the subgroup stored in stratum. Heterogeneity comparisons use the comparison schema.

lm: Stratified Linear Model

Column Meaning
family "lm".
outcome, predictor Resolved outcome and tested predictor.
term, term.type Coefficient row and row type, usually main or interaction.
estimate Linear-regression beta on the identity scale.
se, lower, upper Standard error and confidence interval for beta.
statistic, p.value, adj.p.value Wald/t-test statistic and p-values.
n Complete-case sample size used by the fit.
effect.scale "identity".
formula, fit.id Resolved model formula and artifact/diagnostic key.

Added Artifacts

Artifact Meaning
prediction.grid Fitted outcome values across a predictor grid with covariates held at reference values.
profile.table Analysis-ready observations with predictor, outcome, fitted value, and residual.

glm: Stratified Logistic Model

Column Meaning
family "glm".
outcome, predictor Resolved binary outcome and tested predictor.
term, term.type Coefficient row and row type, usually main or interaction.
estimate Logistic-regression coefficient on the log-odds scale.
se, lower, upper Standard error and confidence interval on the log-odds scale.
statistic, p.value, adj.p.value Wald/z statistic and p-values.
n, n.cases, n.controls Complete-case sample size and binary outcome counts.
effect.scale "log.odds"; forest plots display odds ratios.
formula, fit.id Resolved model formula and artifact/diagnostic key.

Added Artifacts

Artifact Meaning
prediction.grid Predicted probabilities across a predictor grid with covariates held at reference values.
profile.table Analysis-ready observations with predictor, binary outcome, and fitted probability.

comparison: Heterogeneity Test

Column Meaning
comparison.type Planned comparison type: heterogeneity, group, contrast, or nested.
outcome, predictor, by Outcome, focal predictor where applicable, and comparison-defining variable.
contrast Pairwise or named contrast label where applicable.
family / model Model family used for the comparison.
estimate, se, lower, upper Effect estimate and uncertainty where the comparison has a coefficient-like scale.
statistic, p.value, adj.p.value Primary comparison test result.
n, n.events, n.competing Sample-size and event counts where relevant.
fit.id Artifact/diagnostic key.

Added Artifacts

Comparison artifacts depend on the comparison type. Survival group comparisons reuse the survival-family artifacts (curves, risk.table, group.summary). Contrast and nested-model comparisons may attach model diagnostics or contrast tables when those details do not belong in the one-row summary table.

Plotting

Use forest plots for subgroup estimates:

visualize$associations$forest(stratified)

For survival results, split-specific curves are read from the survival artifacts:

visualize$associations$survival(stratified.survival)

For formal heterogeneity comparisons, use comparison-oriented forest or heatmap views once the comparison rows are returned.