DAX Processors, version 3

About

DAX pipelines are defined by creating YAML text files. If you are not familiar with YAML, start here: https://learnxinyminutes.com/docs/yaml/.

A processor YAML file defines the Environment, Inputs, Commands, and Outputs of your pipeline.

Version 3 processors have a number of new options and conveniences.

Processor Repos

There are several existing processors that can be used without modification. The processors in these repositories can also provide valuable examples.

https://github.com/VUIIS/dax_yaml_processor_examples

https://github.com/VUIIS/yaml_processors (private, internal use only)

Overview

The processor file defines how a script to run a pipeline should be created. DAX will use the processor to generate scripts to be submitted to your cluster as jobs. The script will contain the commands to download the inputs from XNAT, run the pipeline, and prepare the results to be uploaded back to XNAT (the actual uploading is performed by DAX via dax upload).

A Basic Example

---
procyamlversion: 3.0.0-dev.0                 # Indicates to run as a v3 processor

containers:                                  # Containers we will ref in the command section
  - name: EXAMP                                  # Reference by this name in command section
    path: example_v2.0.0.sif                     # Name/path that is replaced in command section
    source: docker://vuiiscci/example:v2.0.0     # Not used, but good practice to set it

requirements:  # Requirements for the cluster node, substituted into SBATCH section of job template
  walltime: 0-2  # Time to request - SLURM supports the format DAYS-HOURS
  memory: 16G

inputs:
  vars:   # Keyvalues to substitute in the command, for passing static settings
      - param1: param1value
  xnat:
    attrs:  # Values to extract from xnat at the specified level of the current instance
      - varname:  scanID     # Name to be used to dereference later
        object:   scan       # Source, of: project, subject, session, scan, assessor
        attr:     ID         # Name of the field in xnat
        ref:      scan_fmri  # From which object in inputs, referred to by name
    scans:
      - name: scan_fmri       # the name of this scan to dereference later
        types: fMRI_run*      # the scan types to match on the session in XNAT
        nifti: fmri.nii.gz    # Shortcut to download file in NIFTI resource as fmri.nii.gz
        resources:            # To get files in other resources
          - resource: EDAT        # Name of the resource
            fdest: edat.txt       # Download the file as edat.txt
            varname: edat_txt     # Reference for command string substitution
    assessors:
      - name: assr_preproc
        proctypes: preproc-fmri_v2
        resources:
          - {resource: FILTERED_DATA, fdest: filtered_data.nii.gz}

outputs:
  - pdf: report*.pdf        # Matching file uploaded to PDF resource
  - stats: stats.txt        # Matching file uploaded to STATS resource
  - dir: PREPROC            # Matching directory (PREPROC) uploaded to PREPROC resource
  - path: inputpathname     # General purpose for other outputs
    type: DIR                   # Type is FILE or DIR
    resource: RESOURCENAME      # Store it in resource RESOURCENAME

# Available commands are 'singularity_run' and 'singularity_exec'. These include default
# flags --contain --cleanenv, and mount points for temp space plus INPUTS and OUTPUTS
command:
  type: singularity_run
  extraopts: []       # Appends to default options for the run command
  container: EXAMP    # Name of the container in the list above
  args: >-
    --fmri_file /INPUTS/fmri.nii.gz
    --filtered_file /INPUTS/filtered_data.nii.gz
    --param1 {param1value}
    --scan_id {scanID}
    --edat_txt /INPUTS/{edat_txt}

description: |
  Example description that gets printed to every PDF created by this processor
  1. step 1 does something cool
  2. step 2 does this other thing

# Specify the job template to use (examples: https://github.com/VUIIS/dax_templates/)
job_template: job_template_v3.txt

Parts of the Processor YAML

inputs (required)

The inputs section defines the files and parameters to be prepared for the pipeline. Currently, the only subsections of inputs supported are vars and xnat.

The vars subsection can store parameters to be passed as pipeline options, such as smoothing kernel size, etc that may be more conveniently coded here to substitute into the command arguments.

The xnat section defines the files, directories or values that are extracted from XNAT and passed to the command. Currently, the subsections of xnat that are supported are scans, assessors, attrs, and filters. Each of these subsections contains an array with a specific set of fields for each item in the array.

xnat scans

Each xnat scans item requires a types field. The types field is used to match against the scan type attribute on XNAT. The value can be a single string or a comma-separated list. Wildcards are also supported.

The resources subsection of each xnat scan should contain a list of resources to download from the matched scan.

ftype specifies what type to downloaded from the resource, either FILE, DIR, or DIRJ. FILE will download individual files from the resource. DIR will download the whole directory from the resource with the hierarchy maintained. DIRJ will also download the directory but strips extraneous intermediate directories from the produced path as implemented by the -j flag of unzip.

The varname field defines tags to be replaced in the command string template (see below).

The optional fmatch field defines a regular expression to apply to filter the list of filenames in the resource. fmulti affects how inputs are handled when there are multiple matching files in a resource. By default, this situation causes an exception, but if fmulti is set to any1, a single (arbitrary) file is selected from the matching files instead.

By default, any scan that matches will be included as an available input. Several optional settings affect this:

  • If needs_qc is True and require_usable is False or not specified, assessors that would have a scan as an input will be created, but will not run if the scan is marked unusable.

  • If needs_qc is True and require_usable is also True, assessors that would have a scan as an input will be created, but will not run unless the scan is marked usable.

  • If skip_unusable is True, assessors that would have an unusable scan as an input will not even be created.

  • keep_multis may be all (the default); first; last; or an index 1,2,3,… This applies when there are multiple scans in the session that match as possible inputs. Normally all matching scans are used as inputs, multiplying assessors as needed. When first is specified, only the first matching scan will be used as an input, reducing the number of assessors built by a factor of the number of matching scans. “First” is defined as alphabetical order by scan ID, cast to lowercase. The exact scan type is not considered; only whether there is a match with the types specified.

xnat assessors

Each xnat assessor item requires a proctype field. The proctype field is used to match against the assessor proctype attribute on XNAT. The value can be a single string or a comma-separated list. Wildcards are also supported.

Any assessor that matches proctype will be included as a possible input. However if needs_qc is set to True, input assessors with a qcstatus of “Needs QA”, “Bad”, “Failed”, “Poor”, or “Do Not Run” will cause the new assessor not to run.

The resources subsection of each xnat assessor should contain a list of resources to download from the matched scan.

The ftype specifies what type to downloaded from the resource, either FILE, DIR, or DIRJ. FILE will download individual files from the resource. DIR will download the whole directory from the resource with the hierarchy maintained. DIRJ will also download the directory but strips extraneous intermediate directories from the produced path as impelemented by the “-j” flag of unzip.

The varname field defines the tag to be replaced in the command string template (see below).

Optional fields for a resource are fmatch and fdest. fmatch defines a regular expression to apply to filter the list of filenames in the resource. The inputs for some containers are expected to be in specific locations with specific filenames. This is accomplished using the fdest field. The file or directory gets copied to /INPUTS and renamed to the name specified in fdest.

xnat attrs

You can evaluate attributes at the subject, session, or scan level. Any fields that are accessible via the XNAT API can be queried. Each attrs item should contain a varname, object, and attr. varname specifies the tag to be replaced in the command string template. object is the XNAT object type to query and can be either subject, session, or scan. attr is the XNAT field to query. If the object type is scan, then a scan name from the xnat scans section must be included with the ref field.

For example:

attrs:
    - varname: project
      object: session
      attr: project

# Or equivalently
attrs:
    - {varname: project, object: assessor, attr: project}

This will extract the value of the project attribute from the assessor object and replace {project} in the command template.

xnat filters

filters allows you to filter a subset of the cartesian product of the matched scans and assessors. Currently, the only filter implemented is a match filter. It will only create the assessors where the specified list of inputs match. This is used when you want to link a set of assessors that all use the same initial scan as input.

For example:

filters:
    - type: match
      inputs: scan_t1,assr_freesurfer/scan_t1

This will tell DAX to only run this pipeline where the value for scan_t1 and assr_freesurfer/scan_t1 are the same scan.

outputs

The outputs section defines a list files or directories to be uploaded to XNAT upon completion of the pipeline. Each output item must contain fields path, type, and resource. The path value contains the local relative path of the file or directory to be uploaded. The type of the path should either be FILE or DIR. The resource is the name of resource of the assessor created on XNAT where the output is to be uploaded.

For every processor, a PDF output with resource named PDF is required and must be of type FILE.

PDF and STATS outputs, as well as DIR type outputs, have shortcuts as shown in the example.

command

The command field defines a string template that is formatted using the values from inputs.

Each tag specified inside curly braces (“{}””) corresponds to a field in the defaults input section, or to a var field from a resource on an input or to a varname in the xnat attrs section.

See the example for explanations of the other fields.

jobtemplate

The jobtemplate is a text file that contains a template to create a batch job script.

Versioning

Processor name and version are parsed from the processor file name, based on the format <NAME>_v<major.minor.revision>.yaml. <NAME>_v<major> will be used as the proctype.

Notes on singularity options

The default options are SINGULARITY_BASEOPTS in dax/dax/processors_v3.py:

--contain --cleanenv
--home $JOBDIR
--bind $INDIR:/INPUTS
--bind $OUTDIR:/OUTPUTS
--bind $JOBDIR:/tmp
--bind $JOBDIR:/dev/shm

$JOBDIR, $INDIR, $OUTDIR are available at run time, and refer to locations on the filesystem of the node where the job is running.

Singularity has default binds that differ between installations. –contain disables these to prevent cross-talk with the host filesystem. And –cleanenv prevents cross-talk with the host environment. However, with –contain, some spiders will need to have specific temp space on the host attached. E.g. for some versions of Freesurfer, –bind ${INDIR}:/dev/shm. For compiled Matlab spiders, we need to provide –home $INDIR to avoid .mcrCache collisions in temp space when multiple spiders are running. And, some cases may require ${INDIR}:/tmp or /tmp:/tmp. Thus the defaults above.

The entire singularity command is built as:

singularity <run|exec> <SINGULARITY_BASEOPTS> <extraopts> <container> <args>

Subject-Level Processors

As of version 2.7, dax supports subject-level processors, in addition to session-level. The subject-level processors can include inputs across multiple sessions within the same subject. In the processor yaml, a subject-level processor is implied by including the “sessions” level between inputs.xnat and scans/assessors. Each session requires the attribute types. The types are matched against the XNAT field xnat:imageSessionData/session_type. Currently the match must be exact.

To set the session type of a session, you can use dax/pyxnat:

xnat.select_session(PROJ, SUBJ, SESS).attrs.set('session_type', SESSTYPE)

Below is an example of a subject-level processor that will include an assessor from two different sessions of session types Baseline and Week12.

---
procyamlversion: 3.0.0-dev.0
containers:
  - name: EMOSTROOP
    path: fmri_emostroop_v2.0.0.sif
    source: docker://bud42/fmri_emostroop:v2
requirements:
  walltime: 0-2
  memory: 16G
inputs:
  xnat:
    sessions:
      - types: Baseline
        assessors:
          - name: assr_emostroop_a
            types: fmri_emostroop_v1
            resources:
              - resource: PREPROC
                fmatch: swauFMRI.nii.gz
                fdest:  swauFMRIa.nii.gz
      - types: Week12
        assessors:
          - name: assr_emostroop_c
            types: fmri_emostroop_v1
            resources:
              - resource: PREPROC
                fmatch: swauFMRI.nii.gz
                fdest:  swauFMRIc.nii.gz
outputs:
  - dir: PREPROC
  - dir: 1stLEVEL
command:
  type: singularity_run
  container: EMOSTROOP
  args: BLvsWK12

The assessor will be created under the subject on XNAT, at the same level as a session. The proctype of the assessor will be derived from the filename just like session-level processors. The XNAT data type of the assessor, or xsiType, will be proc:subjGenProcData (for session-level assessors the type is proc:genprocData).