execution <- core$execution.engine$resolve.async(
outputs = outputs,
inputs = inputs,
action = "generate$models"
)
core$execution.engine$await(execution)
execute.summary(execution)
execute.graph(execution)PolyGenius Execution Engine
The PolyGenius generation environment is designed to let users declare the GWAS sources and algorithms they want, while the system handles the operational details behind the scenes. That convenience comes with a complication: generating even a single model can require a chain of intermediate resources, and some of those requirements are only known after execution has already started.
For example, consider the seemingly simple task of generating one clumping-and-thresholding PGS model from a local GWAS summary statistics file. The user must also provide a reference panel for clumping. From the user’s perspective, this is a straightforward request. Internally, however, the request must be translated into the concrete commands and intermediate artifacts needed to produce the final model:
- Load the GWAS summary statistics file and its metadata, including genome build
- Load the requested reference panel
- Perform variant clumping with respect to that panel and apply p-value thresholding
In practice, additional steps may be required. If the reference panel is not available locally, the engine must first download it. If the genome build of the panel does not match the build of the GWAS summary statistics, the panel must first be lifted over. Only after these dynamically determined intermediate steps have been resolved can the generation process continue.
More generally, a user’s declaration in PolyGenius is not itself an execution plan. The engine must translate “I want these models from these GWAS sources with these algorithms” into the set of resources that are needed and the rules that can provide them. In workflow terms, that dependency structure is a computation graph. PolyGenius builds and executes that graph dynamically.
Core concept: resources and rules
PolyGenius treats model generation as a resource-resolution problem: you ask for outputs such as a polygenius.model resource, and the engine figures out which intermediate artifacts must exist and which rules can produce them. These two components are defined using:
- ResourceSpec: a declarative description of a resource, which can represent any type of data or information. Its identity is defined by its
typeand itsparams. Runtime hints that influence execution but do not change resource identity live inspec$meta. These include details such as JWT tokens, retry settings, or precomputed-path hints. - Rule: an object that says whether it can produce a requested output, which inputs it needs, what resources it requires, and how to run.
The execution engine does not restrict the possible resource types, but it ships with a built-in set that covers the current generation workflows:
gwas.request: a user-declared GWAS requestgwas.sumstats: normalized GWAS summary statisticsreference.panel: a usable reference panel in a requested build and formatreference.panel.bigsnp: abigsnpr-converted panel for LD-based methodsclumped.variants: GWAS variants after panel clumpingld.matrix: a sparse LD matrix and its panel mapld.matched: GWAS summary statistics matched to the LD mappolygenius.model: the final PRS model output
The constructors live in resources.factory, and the type registry lives in resources.type.
Translating a user request into a computation graph
Discovering and executing the graph on the fly
PolyGenius does not precompute one large static graph before starting. Instead, it grows the graph while resolving the requested outputs:
generate$models()creates the requestedpolygenius.modeloutputs from the declared source x algorithm combinations.- The engine queues those outputs and marks any user-supplied inputs as already available.
- For each queued output, it first checks whether a compatible cached resource already exists.
- If the resource is not cached, it finds the matching rule and asks that rule for its inputs.
- The rule’s
bind()method can refine those inputs using metadata discovered during resolution. This is how build-aware dependencies are discovered late, for example after the GWAS build becomes known. - Missing inputs are queued as new nodes, and the original output becomes
waiting. - Once all inputs are available, the rule declares its
coresandmemoryneeds and becomes ready for scheduling. - Completed tasks are written to the resource store, logs are materialized, and downstream tasks are reconsidered immediately.
This means the graph is discovered at the same time that it is being executed. That is important because the exact upstream path may differ between two requests that look similar at the surface: one GWAS may already be cached, another may require download, and a third may require build-specific preprocessing before an algorithm can run.
Caching intermediate resources
Caching is part of the graph-resolution model rather than an afterthought. Every resolved resource is stored in the ResourceStore, and each new request is checked against that store before new work is scheduled.
A resource’s cache identity is based on its type + params, not on runtime-only metadata. This is why the same ld.matrix resource can be reused across runs that share panel.name, build, ld.size, and ld.thr, even if the current run uses a different ncores value or a different precomputed-path hint. Completed tasks also materialize their own rule log plus upstream input logs, which keeps cached resources auditable.
Because the cache operates at the resource level, PolyGenius can reuse not only final models but also shared intermediate artifacts such as normalized GWAS summary statistics, converted reference panels, or LD matrices.
Why PolyGenius uses its own orchestrator
Tools such as Snakemake and targets are excellent when the workflow graph is known up front or can be described as a mostly static pipeline. PolyGenius has a different requirement: users declare desired outputs, while the engine must discover the exact dependency path at runtime.
That is necessary because:
- the correct upstream dependencies can depend on metadata discovered only during execution, such as genome build;
- the same requested output may already be satisfied by a cached resource or a user-provided input;
- different algorithms may reuse intermediate resources created for other algorithms in the same run or in earlier runs;
- PolyGenius needs scheduling decisions to operate on resource-level requirements rather than only on a prewritten job DAG.
In other words, PolyGenius is not just launching predefined steps. It is resolving which steps are needed, which of them are already satisfied, and which remaining rules should be scheduled next.
This design has two additional benefits:
- Because intermediate components are explicit resources, they can be cached as well. That improves reuse of shared intermediate artifacts and reduces overall computation.
- Because the orchestrator is custom, each rule can declare task-specific resource requirements. That improves parallelism compared with treating every job as if it needed the same resources.
For example, suppose a run has access to 10 cores and an LDpred2 task declares that it needs 6. The remaining 4 cores do not need to sit idle: the scheduler can use them to fetch additional GWAS summary statistics from IEU OpenGWAS while the expensive LD matrix work continues in parallel.
Scheduler mechanics
The scheduler is the part of the engine that turns a ready rule into actual work on workers.
Job allocation
- The backend uses a fixed
futuremultisession worker pool. - With
max.cores = NULL, PolyGenius auto-detects available cores and uses up toavailable - 1, with a minimum of one core. - Ready tasks are only submitted when their declared requirements fit the currently available core and memory budget.
Job resource needs
Rules declare their own needs through requirements(). The scheduler uses those declarations directly rather than guessing from task type.
Typical built-in examples are:
- source fetch rules:
1core and1-2GB memory - clumping and matching rules:
1core and4GB memory - LD matrix construction and LD-based model fitting: up to the rule’s requested
ncoresand8GB memory
If a ready task does not fit the current budget, it is deferred and re-queued instead of being dropped.
Priority
When several tasks are ready at once, PolyGenius uses a greedy priority based on critical dependency depth. In practice, that means the scheduler prefers work that is blocking more of the graph.
State tracing
Every discovered node is tracked as one of:
queuedwaitingrunningblockedfailedcompleted
The live console status groups those states by resource type, so you can see whether a run is currently spending time on GWAS retrieval, LD preparation, or final model generation.
Inspecting executions
The engine exposes both a low-level R6 interface and exported helper functions for inspecting a run.
Useful reference pages are ExecutionEngine and execute() / execute.summary() / execute.graph().
Built-in rule map
Source rules
- LoadLocalGWASRule powers
generate$sources$local()by turning a localgwas.requestinto normalizedgwas.sumstats. - FetchOpenGWASRule powers
generate$sources$opengwas()and choosestophitsor VCF/files mode from the requested p-value threshold. - FetchGWASCatalogRule powers
generate$sources$gwascatalog()and resolves harmonized GWAS Catalog studies into normalizedgwas.sumstats.
Reference-panel and LD support rules
- DownloadLiftoverChainRule materializes the chain file needed when a reference panel must change genome build.
- DownloadReferencePanelRule retrieves a reference panel when no liftover or format conversion is needed.
- LiftoverReferencePanelRule changes the panel build when the requested build differs from the source build.
- ConvertReferencePanelRule changes panel format when a downstream step needs
bfileorpfile. - BuildReferencePanelBigSnpRule converts a cached panel into the
bigsnprrepresentation used by LD-based methods. - BuildLDMatrixRule builds or loads the sparse LD matrix consumed by LDpred2 and lassosum2.
- MatchLDVariantsRule matches GWAS summary statistics to the LD map and is reused by both LDpred2 and lassosum2.
Algorithm output rules
- ClumpVariantsRule plus ThresholdClumpedRule implement
ClumpingThresholding. - RunLdPred2Rule implements
LDpred2once the matched GWAS and LD matrix are ready. - RunLassosum2Rule implements
lassosum2on top of the same LD pipeline. - RunCojoRule is the current COJO stub.
Those rule groupings also mirror the source and algorithm declarations described in Chapter 4 and Chapter 5.
Extending the engine
The scheduler itself usually does not need to change when you add a new source or algorithm. Most extensions happen by adding new rules, and only some extensions require new resource types. The concrete extension patterns are covered in Chapter 18.