DAX Processors, version 3 =========== ----- About ----- DAX pipelines are defined by creating YAML text files. If you are not familiar with YAML, start here: https://learnxinyminutes.com/docs/yaml/. A processor YAML file defines the Environment, Inputs, Commands, and Outputs of your pipeline. Version 3 processors have a number of new options and conveniences. ---------------- Processor Repos ---------------- There are several existing processors that can be used without modification. The processors in these repositories can also provide valuable examples. https://github.com/VUIIS/dax_yaml_processor_examples https://github.com/VUIIS/yaml_processors (private, internal use only) ---------------- Overview ---------------- The processor file defines how a script to run a pipeline should be created. DAX will use the processor to generate scripts to be submitted to your cluster as jobs. The script will contain the commands to download the inputs from XNAT, run the pipeline, and prepare the results to be uploaded back to XNAT (the actual uploading is performed by DAX via *dax upload*). ---------------- A Basic Example ---------------- .. code-block:: yaml --- procyamlversion: 3.0.0-dev.0 # Indicates to run as a v3 processor containers: # Containers we will ref in the command section - name: EXAMP # Reference by this name in command section path: example_v2.0.0.sif # Name/path that is replaced in command section source: docker://vuiiscci/example:v2.0.0 # Not used, but good practice to set it requirements: # Requirements for the cluster node, substituted into SBATCH section of job template walltime: 0-2 # Time to request - SLURM supports the format DAYS-HOURS memory: 16G inputs: vars: # Keyvalues to substitute in the command, for passing static settings - param1: param1value xnat: attrs: # Values to extract from xnat at the specified level of the current instance - varname: scanID # Name to be used to dereference later object: scan # Source, of: project, subject, session, scan, assessor attr: ID # Name of the field in xnat ref: scan_fmri # From which object in inputs, referred to by name scans: - name: scan_fmri # the name of this scan to dereference later types: fMRI_run* # the scan types to match on the session in XNAT nifti: fmri.nii.gz # Shortcut to download file in NIFTI resource as fmri.nii.gz resources: # To get files in other resources - resource: EDAT # Name of the resource fdest: edat.txt # Download the file as edat.txt varname: edat_txt # Reference for command string substitution assessors: - name: assr_preproc proctypes: preproc-fmri_v2 resources: - {resource: FILTERED_DATA, fdest: filtered_data.nii.gz} outputs: - pdf: report*.pdf # Matching file uploaded to PDF resource - stats: stats.txt # Matching file uploaded to STATS resource - dir: PREPROC # Matching directory (PREPROC) uploaded to PREPROC resource - path: inputpathname # General purpose for other outputs type: DIR # Type is FILE or DIR resource: RESOURCENAME # Store it in resource RESOURCENAME # Available commands are 'singularity_run' and 'singularity_exec'. These include default # flags --contain --cleanenv, and mount points for temp space plus INPUTS and OUTPUTS command: type: singularity_run extraopts: [] # Appends to default options for the run command container: EXAMP # Name of the container in the list above args: >- --fmri_file /INPUTS/fmri.nii.gz --filtered_file /INPUTS/filtered_data.nii.gz --param1 {param1value} --scan_id {scanID} --edat_txt /INPUTS/{edat_txt} description: | Example description that gets printed to every PDF created by this processor 1. step 1 does something cool 2. step 2 does this other thing # Specify the job template to use (examples: https://github.com/VUIIS/dax_templates/) job_template: job_template_v3.txt ---------------- Parts of the Processor YAML ---------------- -------------------- inputs (required) -------------------- The **inputs** section defines the files and parameters to be prepared for the pipeline. Currently, the only subsections of inputs supported are **vars** and **xnat**. The **vars** subsection can store parameters to be passed as pipeline options, such as smoothing kernel size, etc that may be more conveniently coded here to substitute into the command arguments. The **xnat** section defines the files, directories or values that are extracted from XNAT and passed to the command. Currently, the subsections of **xnat** that are supported are **scans**, **assessors**, **attrs**, and **filters**. Each of these subsections contains an array with a specific set of fields for each item in the array. xnat scans --------------- Each **xnat scans** item requires a **types** field. The **types** field is used to match against the scan type attribute on XNAT. The value can be a single string or a comma-separated list. Wildcards are also supported. The **resources** subsection of each xnat scan should contain a list of resources to download from the matched scan. **ftype** specifies what type to downloaded from the resource, either *FILE*, *DIR*, or *DIRJ*. *FILE* will download individual files from the resource. *DIR* will download the whole directory from the resource with the hierarchy maintained. *DIRJ* will also download the directory but strips extraneous intermediate directories from the produced path as implemented by the *-j* flag of unzip. The **varname** field defines tags to be replaced in the **command** string template (see below). The optional **fmatch** field defines a regular expression to apply to filter the list of filenames in the resource. **fmulti** affects how inputs are handled when there are multiple matching files in a resource. By default, this situation causes an exception, but if **fmulti** is set to *any1*, a single (arbitrary) file is selected from the matching files instead. By default, any scan that matches will be included as an available input. Several optional settings affect this: - If **needs_qc** is *True* and **require_usable** is *False* or not specified, assessors that would have a scan as an input will be created, but will not run if the scan is marked *unusable*. - If **needs_qc** is *True* and **require_usable** is also *True*, assessors that would have a scan as an input will be created, but will not run unless the scan is marked *usable*. - If **skip_unusable** is *True*, assessors that would have an *unusable* scan as an input will not even be created. - **keep_multis** may be *all* (the default); *first*; *last*; or an index 1,2,3,... This applies when there are multiple scans in the session that match as possible inputs. Normally all matching scans are used as inputs, multiplying assessors as needed. When *first* is specified, only the first matching scan will be used as an input, reducing the number of assessors built by a factor of the number of matching scans. "First" is defined as alphabetical order by scan ID, cast to lowercase. The exact scan type is not considered; only whether there is a match with the **types** specified. xnat assessors --------------- Each xnat assessor item requires a **proctype** field. The **proctype** field is used to match against the assessor proctype attribute on XNAT. The value can be a single string or a comma-separated list. Wildcards are also supported. Any assessor that matches **proctype** will be included as a possible input. However if **needs_qc** is set to *True*, input assessors with a qcstatus of "Needs QA", "Bad", "Failed", "Poor", or "Do Not Run" will cause the new assessor not to run. The **resources** subsection of each xnat assessor should contain a list of resources to download from the matched scan. The **ftype** specifies what type to downloaded from the resource, either *FILE*, *DIR*, or *DIRJ*. *FILE* will download individual files from the resource. *DIR* will download the whole directory from the resource with the hierarchy maintained. *DIRJ* will also download the directory but strips extraneous intermediate directories from the produced path as impelemented by the "-j" flag of unzip. The **varname** field defines the tag to be replaced in the **command** string template (see below). Optional fields for a resource are **fmatch** and **fdest**. fmatch defines a regular expression to apply to filter the list of filenames in the resource. The inputs for some containers are expected to be in specific locations with specific filenames. This is accomplished using the **fdest** field. The file or directory gets copied to /INPUTS and renamed to the name specified in **fdest**. xnat attrs --------------- You can evaluate attributes at the subject, session, or scan level. Any fields that are accessible via the XNAT API can be queried. Each **attrs** item should contain a **varname**, **object**, and **attr**. **varname** specifies the tag to be replaced in the **command** string template. **object** is the XNAT object type to query and can be either *subject*, *session*, or *scan*. **attr** is the XNAT field to query. If the object type is *scan*, then a scan name from the xnat scans section must be included with the **ref** field. For example: .. code-block:: yaml attrs: - varname: project object: session attr: project # Or equivalently attrs: - {varname: project, object: assessor, attr: project} This will extract the value of the project attribute from the assessor object and replace {project} in the command template. xnat filters ------------------ **filters** allows you to filter a subset of the cartesian product of the matched scans and assessors. Currently, the only filter implemented is a match filter. It will only create the assessors where the specified list of inputs match. This is used when you want to link a set of assessors that all use the same initial scan as input. For example: .. code-block:: yaml filters: - type: match inputs: scan_t1,assr_freesurfer/scan_t1 This will tell DAX to only run this pipeline where the value for scan_t1 and assr_freesurfer/scan_t1 are the same scan. outputs -------------------- The **outputs** section defines a list files or directories to be uploaded to XNAT upon completion of the pipeline. Each output item must contain fields **path**, **type**, and **resource**. The **path** value contains the local relative path of the file or directory to be uploaded. The type of the path should either be *FILE* or *DIR*. The **resource** is the name of resource of the assessor created on XNAT where the output is to be uploaded. For every processor, a *PDF* output with **resource** named PDF is required and must be of type *FILE*. *PDF* and *STATS* outputs, as well as *DIR* type outputs, have shortcuts as shown in the example. command -------------------- The **command** field defines a string template that is formatted using the values from **inputs**. Each tag specified inside curly braces ("{}"") corresponds to a field in the **defaults** input section, or to a **var** field from a resource on an input or to a **varname** in the xnat attrs section. See the example for explanations of the other fields. jobtemplate -------------------- The **jobtemplate** is a text file that contains a template to create a batch job script. ------------------- Versioning ------------------- Processor name and version are parsed from the processor file name, based on the format _v.yaml. _v will be used as the proctype. ------------------- Notes on singularity options ------------------- The default options are *SINGULARITY_BASEOPTS* in dax/dax/processors_v3.py:: --contain --cleanenv --home $JOBDIR --bind $INDIR:/INPUTS --bind $OUTDIR:/OUTPUTS --bind $JOBDIR:/tmp --bind $JOBDIR:/dev/shm $JOBDIR, $INDIR, $OUTDIR are available at run time, and refer to locations on the filesystem of the node where the job is running. Singularity has default binds that differ between installations. --contain disables these to prevent cross-talk with the host filesystem. And --cleanenv prevents cross-talk with the host environment. However, with --contain, some spiders will need to have specific temp space on the host attached. E.g. for some versions of Freesurfer, --bind ${INDIR}:/dev/shm. For compiled Matlab spiders, we need to provide --home $INDIR to avoid .mcrCache collisions in temp space when multiple spiders are running. And, some cases may require ${INDIR}:/tmp or /tmp:/tmp. Thus the defaults above. The entire singularity command is built as:: singularity --------------------------- Subject-Level Processors --------------------------- As of version 2.7, dax supports subject-level processors, in addition to session-level. The subject-level processors can include inputs across multiple sessions within the same subject. In the processor yaml, a subject-level processor is implied by including the "sessions" level between inputs.xnat and scans/assessors. Each session requires the attribute types. The types are matched against the XNAT field xnat:imageSessionData/session_type. Currently the match must be exact. To set the session type of a session, you can use dax/pyxnat: .. code-block:: python xnat.select_session(PROJ, SUBJ, SESS).attrs.set('session_type', SESSTYPE) Below is an example of a subject-level processor that will include an assessor from two different sessions of session types Baseline and Week12. .. code-block:: yaml --- procyamlversion: 3.0.0-dev.0 containers: - name: EMOSTROOP path: fmri_emostroop_v2.0.0.sif source: docker://bud42/fmri_emostroop:v2 requirements: walltime: 0-2 memory: 16G inputs: xnat: sessions: - types: Baseline assessors: - name: assr_emostroop_a types: fmri_emostroop_v1 resources: - resource: PREPROC fmatch: swauFMRI.nii.gz fdest: swauFMRIa.nii.gz - types: Week12 assessors: - name: assr_emostroop_c types: fmri_emostroop_v1 resources: - resource: PREPROC fmatch: swauFMRI.nii.gz fdest: swauFMRIc.nii.gz outputs: - dir: PREPROC - dir: 1stLEVEL command: type: singularity_run container: EMOSTROOP args: BLvsWK12 The assessor will be created under the subject on XNAT, at the same level as a session. The proctype of the assessor will be derived from the filename just like session-level processors. The XNAT data type of the assessor, or xsiType, will be proc:subjGenProcData (for session-level assessors the type is proc:genprocData).