PolyGenius Execution Engine

The PolyGenius generation environment is designed to let users declare the GWAS sources and algorithms they want, while the system handles the operational details behind the scenes. That convenience comes with a complication: generating even a single model can require a chain of intermediate resources, and some of those requirements are only known after execution has already started.

For example, consider the seemingly simple task of generating one clumping-and-thresholding PGS model from a local GWAS summary statistics file. The user must also provide a reference panel for clumping. From the user’s perspective, this is a straightforward request. Internally, however, the request must be translated into the concrete commands and intermediate artifacts needed to produce the final model:

  1. Load the GWAS summary statistics file and its metadata, including genome build
  2. Load the requested reference panel
  3. Perform variant clumping with respect to that panel and apply p-value thresholding

In practice, additional steps may be required. If the reference panel is not available locally, the engine must first download it. If the genome build of the panel does not match the build of the GWAS summary statistics, the panel must first be lifted over. Only after these dynamically determined intermediate steps have been resolved can the generation process continue.

More generally, a user’s declaration in PolyGenius is not itself an execution plan. The engine must translate “I want these models from these GWAS sources with these algorithms” into the set of resources that are needed and the rules that can provide them. In workflow terms, that dependency structure is a computation graph. PolyGenius builds and executes that graph dynamically.

Core concept: resources and rules

PolyGenius treats model generation as a resource-resolution problem: you ask for outputs such as a polygenius.model resource, and the engine figures out which intermediate artifacts must exist and which rules can produce them. These two components are defined using:

  • ResourceSpec: a declarative description of a resource, which can represent any type of data or information. Its identity is defined by its type and its params. Runtime hints that influence execution but do not change resource identity live in spec$meta. These include details such as JWT tokens, retry settings, or precomputed-path hints.
  • Rule: an object that says whether it can produce a requested output, which inputs it needs, what resources it requires, and how to run.

The execution engine does not restrict the possible resource types, but it ships with a built-in set that covers the current generation workflows:

  • gwas.request: a user-declared GWAS request
  • gwas.sumstats: normalized GWAS summary statistics
  • reference.panel: a usable reference panel in a requested build and format
  • reference.panel.bigsnp: a bigsnpr-converted panel for LD-based methods
  • clumped.variants: GWAS variants after panel clumping
  • ld.matrix: a sparse LD matrix and its panel map
  • ld.matched: GWAS summary statistics matched to the LD map
  • polygenius.model: the final PRS model output

The constructors live in resources.factory, and the type registry lives in resources.type.

Translating a user request into a computation graph

Discovering and executing the graph on the fly

PolyGenius does not precompute one large static graph before starting. Instead, it grows the graph while resolving the requested outputs:

  1. generate$models() creates the requested polygenius.model outputs from the declared source x algorithm combinations.
  2. The engine queues those outputs and marks any user-supplied inputs as already available.
  3. For each queued output, it first checks whether a compatible cached resource already exists.
  4. If the resource is not cached, it finds the matching rule and asks that rule for its inputs.
  5. The rule’s bind() method can refine those inputs using metadata discovered during resolution. This is how build-aware dependencies are discovered late, for example after the GWAS build becomes known.
  6. Missing inputs are queued as new nodes, and the original output becomes waiting.
  7. Once all inputs are available, the rule declares its cores and memory needs and becomes ready for scheduling.
  8. Completed tasks are written to the resource store, logs are materialized, and downstream tasks are reconsidered immediately.

This means the graph is discovered at the same time that it is being executed. That is important because the exact upstream path may differ between two requests that look similar at the surface: one GWAS may already be cached, another may require download, and a third may require build-specific preprocessing before an algorithm can run.

Caching intermediate resources

Caching is part of the graph-resolution model rather than an afterthought. Every resolved resource is stored in the ResourceStore, and each new request is checked against that store before new work is scheduled.

A resource’s cache identity is based on its type + params, not on runtime-only metadata. This is why the same ld.matrix resource can be reused across runs that share panel.name, build, ld.size, and ld.thr, even if the current run uses a different ncores value or a different precomputed-path hint. Completed tasks also materialize their own rule log plus upstream input logs, which keeps cached resources auditable.

Because the cache operates at the resource level, PolyGenius can reuse not only final models but also shared intermediate artifacts such as normalized GWAS summary statistics, converted reference panels, or LD matrices.

Why PolyGenius uses its own orchestrator

Tools such as Snakemake and targets are excellent when the workflow graph is known up front or can be described as a mostly static pipeline. PolyGenius has a different requirement: users declare desired outputs, while the engine must discover the exact dependency path at runtime.

That is necessary because:

  • the correct upstream dependencies can depend on metadata discovered only during execution, such as genome build;
  • the same requested output may already be satisfied by a cached resource or a user-provided input;
  • different algorithms may reuse intermediate resources created for other algorithms in the same run or in earlier runs;
  • PolyGenius needs scheduling decisions to operate on resource-level requirements rather than only on a prewritten job DAG.

In other words, PolyGenius is not just launching predefined steps. It is resolving which steps are needed, which of them are already satisfied, and which remaining rules should be scheduled next.

NoteWhy this matters in practice

This design has two additional benefits:

  1. Because intermediate components are explicit resources, they can be cached as well. That improves reuse of shared intermediate artifacts and reduces overall computation.
  2. Because the orchestrator is custom, each rule can declare task-specific resource requirements. That improves parallelism compared with treating every job as if it needed the same resources.

For example, suppose a run has access to 10 cores and an LDpred2 task declares that it needs 6. The remaining 4 cores do not need to sit idle: the scheduler can use them to fetch additional GWAS summary statistics from IEU OpenGWAS while the expensive LD matrix work continues in parallel.

Scheduler mechanics

The scheduler is the part of the engine that turns a ready rule into actual work on workers.

Job allocation

  • The backend uses a fixed future multisession worker pool.
  • With max.cores = NULL, PolyGenius auto-detects available cores and uses up to available - 1, with a minimum of one core.
  • Ready tasks are only submitted when their declared requirements fit the currently available core and memory budget.

Job resource needs

Rules declare their own needs through requirements(). The scheduler uses those declarations directly rather than guessing from task type.

Typical built-in examples are:

  • source fetch rules: 1 core and 1-2 GB memory
  • clumping and matching rules: 1 core and 4 GB memory
  • LD matrix construction and LD-based model fitting: up to the rule’s requested ncores and 8 GB memory

If a ready task does not fit the current budget, it is deferred and re-queued instead of being dropped.

Priority

When several tasks are ready at once, PolyGenius uses a greedy priority based on critical dependency depth. In practice, that means the scheduler prefers work that is blocking more of the graph.

State tracing

Every discovered node is tracked as one of:

  • queued
  • waiting
  • running
  • blocked
  • failed
  • completed

The live console status groups those states by resource type, so you can see whether a run is currently spending time on GWAS retrieval, LD preparation, or final model generation.

Inspecting executions

The engine exposes both a low-level R6 interface and exported helper functions for inspecting a run.

execution <- core$execution.engine$resolve.async(
  outputs = outputs,
  inputs = inputs,
  action = "generate$models"
)

core$execution.engine$await(execution)

execute.summary(execution)
execute.graph(execution)

Useful reference pages are ExecutionEngine and execute() / execute.summary() / execute.graph().

Built-in rule map

Source rules

  • LoadLocalGWASRule powers generate$sources$local() by turning a local gwas.request into normalized gwas.sumstats.
  • FetchOpenGWASRule powers generate$sources$opengwas() and chooses tophits or VCF/files mode from the requested p-value threshold.
  • FetchGWASCatalogRule powers generate$sources$gwascatalog() and resolves harmonized GWAS Catalog studies into normalized gwas.sumstats.

Reference-panel and LD support rules

Algorithm output rules

Those rule groupings also mirror the source and algorithm declarations described in Chapter 4 and Chapter 5.

Extending the engine

The scheduler itself usually does not need to change when you add a new source or algorithm. Most extensions happen by adding new rules, and only some extensions require new resource types. The concrete extension patterns are covered in Chapter 18.