Extending PolyGenius

Most PolyGenius extensions do not require changing the scheduler itself. The normal extension path is to add new resource declarations, new rules, or both, and let the existing execution engine discover and schedule them the same way it handles the built-in sources and algorithms.

Extension design principles

The existing engine is easiest to extend when you follow the same patterns as the built-in code:

  • Keep cache identity in params, and keep runtime-only hints in .meta.
  • Keep inputs(), bind(), and requirements() light; do heavy I/O in run().
  • Reuse existing resource types when they already match the artifact you need.
  • Normalize outputs into the same data shapes that downstream rules already expect.

The core extension building blocks are ResourceSpec, resources.factory, Rule, and ExecutionEngine.

Add a new GWAS source

In most cases, a new GWAS source does not need a new resource type. You usually keep the existing gwas.request -> gwas.sumstats pattern and add a new source label plus a rule that knows how to resolve it.

Step 1: add a user-facing source declaration

generate.sources.mySource <- function(ids, api.token = NULL) {
  ResourceSpecSet(lapply(ids, function(id) {
    resources.factory$gwas.request(
      source = "mySource",
      id = id,
      .meta = list(api.token = api.token)
    )
  }))
}

Step 2: add a rule that resolves that request into gwas.sumstats

FetchMySourceRule <- R6::R6Class(
  "FetchMySourceRule",
  inherit = Rule,
  public = list(
    initialize = function() {
      super$initialize("fetch.my.source")
    },
    matches = function(output.spec) {
      identical(output.spec$type, resources.type$gwas.sumstats) &&
        identical(output.spec$params$source, "mySource")
    },
    inputs = function(output.spec) {
      list(
        request = resources.factory$gwas.request(
          source = "mySource",
          id = output.spec$params$id,
          .meta = output.spec$meta
        )
      )
    },
    requirements = function(output.spec, inputs) {
      list(cores = 1, memory = 2)
    },
    run = function(output.spec, inputs, logger) {
      request <- inputs$request$spec
      raw <- fetch_somehow(request$params$id, token = request$meta$api.token)
      normalized <- normalize_to_polygenius_columns(raw)

      list(
        data = PolyGeniusModel(
          variants = normalized,
          name = request$params$id,
          build = "GRCh38",
          gwas = list(id = request$params$id)
        ),
        meta = list(build = "GRCh38"),
        logs = list(log.entry(self$name, "completed"))
      )
    }
  )
)

Step 3: normalize the returned GWAS columns

At minimum, built-in generation expects:

  • chr
  • position
  • ea
  • nea
  • beta
  • pval

If the new source should support LDpred2 or lassosum2 cleanly, you should also provide:

  • se
  • n, or source metadata carrying n_eff / sample_size

If you want eaf.threshold support in ClumpingThresholding, also provide eaf.

Step 4: register the rule

Add the new rule to rules.sources so the core runtime can discover it.

Add a new PRS algorithm

New algorithms usually keep the existing final output type, polygenius.model. That means you normally add:

  1. a user-facing algorithm declaration;
  2. one or more rules that resolve that algorithm’s output;
  3. optional new intermediate resources if the method needs them.

Step 1: add a user-facing algorithm declaration

generate.algorithm.MyMethod <- function(reference.panel, pval = 1, ncores = 1) {
  ResourceSpecSet(
    resources.factory$generate.algorithm(
      name = "MyMethod",
      reference.panel = reference.panel,
      pval = pval,
      ncores = ncores
    )
  )
}

Once this exists, generate$models() will automatically cross it with the requested GWAS sources and create the corresponding polygenius.model requests.

Step 2: add a rule for the algorithm output

RunMyMethodRule <- R6::R6Class(
  "RunMyMethodRule",
  inherit = Rule,
  public = list(
    initialize = function() {
      super$initialize("run.my.method")
    },
    matches = function(output.spec) {
      identical(output.spec$type, resources.type$polygenius.model) &&
        identical(output.spec$params$algorithm, "MyMethod")
    },
    inputs = function(output.spec) {
      list(
        gwas = resources.factory$gwas.sumstats(
          source = output.spec$params$gwas.source,
          id = output.spec$params$gwas.id,
          .meta = list(pval.max = output.spec$params$pval)
        )
      )
    },
    requirements = function(output.spec, inputs) {
      list(cores = output.spec$params$ncores %||% 1, memory = 8)
    },
    run = function(output.spec, inputs, logger) {
      gwas <- inputs$gwas$data
      model.variants <- fit_my_method(gwas)

      list(
        data = PolyGeniusModel(
          variants = model.variants,
          name = gwas$name,
          build = gwas$build,
          gwas = gwas$gwas,
          generation = list(algorithm = "MyMethod")
        ),
        meta = list(build = gwas$build, algorithm = "MyMethod"),
        logs = list(log.entry(self$name, "completed"))
      )
    }
  )
)

Reusing other rules from an algorithm

An algorithm does not call other rules directly. Instead, it depends on the resources those rules know how to produce.

For example, an LD-based method can ask for:

  • gwas.sumstats
  • ld.matrix
  • ld.matched
  • reference.panel.bigsnp
  • clumped.variants

That is exactly how the built-in LDpred2 and lassosum2 paths work: they request ld.matrix and ld.matched, and the engine resolves those resources through the existing support rules before the final algorithm rule runs.

If your algorithm needs build-aware dependencies, use bind() so the exact input resources can be derived after upstream metadata becomes available.

Step 3: register the rule

Add the new rule to rules.algorithms.

Add a new resource type or intermediate rule

Some extensions do need a new cached intermediate artifact. In that case, add a new resource type and then add rules that produce and consume it.

Step 1: add the resource type and factory constructor

resources.type$my.intermediate <- "my.intermediate"

resources.factory$my.intermediate <- function(gwas.source, gwas.id, setting, .meta = NULL,
                                              .serializer = resources.serializer$rds) {
  ResourceSpec$new(
    resources.type$my.intermediate,
    gwas.source = gwas.source,
    gwas.id = gwas.id,
    setting = setting,
    .meta = .meta,
    .serializer = .serializer
  )
}

Step 2: choose persistence

Use:

  • resources.serializer$rds for general R objects;
  • resources.serializer$data.fst for large tabular or model-like payloads that benefit from the faster serializer.

Step 3: add a rule that produces the resource

Your new rule can now matches() on resources.type$my.intermediate, declare its inputs, and return the new artifact from run(). Downstream algorithm rules can then depend on that resource exactly the same way the built-in rules depend on ld.matrix or clumped.variants.

Good built-in examples to copy

The easiest way to extend PolyGenius is to copy the closest built-in pattern:

For the scheduler and cache behavior that these extensions plug into, see Chapter 17.