DAX Processors

About

DAX pipelines are defined by creating YAML text files. If you are not familiar with YAML, start here: https://learnxinyminutes.com/docs/yaml/.

A processor YAML file defines the Environment, Inputs, Commands, and Outputs of your pipeline.

Processor Repos

There are several existing processors that can be used without modification. The processors in these repositories can also provide valuable examples.

https://github.com/VUIIS/dax_yaml_processor_examples

https://github.com/VUIIS/yaml_processors (private, internal use only)

Overview

The processor file defines how a script to run a pipeline should be created. DAX will use the processor to generate scripts to be submitted to your cluster as jobs. The script will contain the commands to download the inputs from XNAT, run the pipeline, and prepare the results to be uploaded back to XNAT (the actual uploading is performed by DAX via dax upload).

A “Simple” Example

---
moreauto: true
inputs:
  default:
    container_path: MRIQA_v1.0.0.simg
  xnat:
    scans:
      - name: scan_t1
        types: MPRAGE
        resources:
          - resource: NIFTI
            ftype: FILE
            varname: t1_nifti
outputs:
  - path: stats.txt
    type: FILE
    resource: STATS
  - path: report.pdf
    type: FILE
    resource: PDF
  - path: DATA
    type: DIR
    resource: DATA
command: >-
  singularity
  run
  --contain
  --cleanenv
  --home $INDIR
  --bind $INDIR:/dev/shm
  --bind $INDIR:/tmp
  --bind $INDIR:/INPUTS
  --bind $OUTDIR:/OUTPUTS
  {container_path}
  --t1_nifti /INPUTS/{t1_nifti}
attrs:
  walltime: '36:00:00'
  memory: 8192

Parts of the Processor YAML

All processor YAML files should start with these two lines:

---
moreauto: true

The primary components of a processor YAML file are:

  • inputs

  • outputs

  • command

  • attrs

Each of these components is required.

inputs

The inputs section defines the files and parameters to be prepared for the pipeline. Currently, the only subsections of inputs supported are defaults and xnat.

The defaults subsection can contain paths to local resources such as singularity containers, local codebases, local data to be used by the pipeline. It can essentially contain any value that needs to be passed directly to the command template (see below).

The xnat section defines the files, directories or values that are extracted from XNAT and passed to the command. Currently, the subsections of xnat that are supported are scans, assessors, attrs, and filters. Each of these subsections contains an array with a specific set of fields for each item in the array.

xnat scans

Each xnat scans item requires a types field. The types field is used to match against the scan type attribute on XNAT. The value can be a single string or a comma-separated list. Wildcards are also supported.

By default, any scan that matches will be included. You can exclude scans with a quality of unusable on XNAT by including the field needs_qc with value of True. The default is to run anything, i.e. a needs_qc value of False. Note that questionable is treated the same as usable, so they’ll always run.

The resources subsection of each xnat scan should contain a list of resources to download from the matched scan. Each resource requires fields for ftype and var.

ftype specifies what type to downloaded from the resource, either FILE, DIR, or DIRJ. FILE will download individual files from the resource. DIR will download the whole directory from the resource with the hierarchy maintained. DIRJ will also download the directory but strips extraneous intermediate directories from the produced path as implemented by the -j flag of unzip.

The var field defines the tag to be replaced in the command string template (see below).

The optional fmatch field defines a regular expression to apply to filter the list of filenames in the resource.

xnat assessors

Each xnat assessor item requires a proctype field. The proctype field is used to match against the assessor proctype attribute on XNAT. The value can be a single string or a comma-separated list. Wildcards are also supported.

By default, any assessor that matches proctype will be included. However if needs_qc is set to True, assessors with a qcstatus of “Needs QA”, “Bad”, “Failed”, “Poor”, or “Do Not Run” will be excluded.

The resources subsection of each xnat assessor should contain a list of resources to download from the matched scan. Each resource requires fields for ftype and var.

The ftype specifies what type to downloaded from the resource, either FILE, DIR, or DIRJ. FILE will download individual files from the resource. DIR will download the whole directory from the resource with the hierarchy maintained. DIRJ will also download the directory but strips extraneous intermediate directories from the produced path as impelemented by the “-j” flag of unzip.

The var field defines the tag to be replaced in the command string template (see below).

Optional fields for a resource are fmatch, fdest and fcount. fmatch defines a regular expression to apply to filter the list of filenames in the resource. fcount can be used to limit the number of files matched. By default, only 1 file is downloaded. The inputs for some containers are expected to be in specific locations with specific filenames. This is accomplished using the fdest field. The file or directory gets copied to /INPUTS and renamed to the name specified in fdest.

xnat attrs

You can evaluate attributes at the subject, session, or scan level. Any fields that are accessible via the XNAT API can be queried. Each attrs item should contain a varname, object, and attr. varname specifies the tag to be replaced in the command string template. object is the XNAT object type to query and can be either subject, session, or scan. attr is the XNAT field to query. If the object type is scan, then a scan name from the xnat scans section must be included with the ref field.

For example:

attrs:
    - varname: project
      object: session
      attr: project

This will extract the value of the project attribute from the session object and replace {project} in the command template.

xnat filters

filters allows you to filter a subset of the cartesian product of the matched scans and assessors. Currently, the only filter implemented is a match filter. It will only create the assessors where the specified list of inputs match. This is used when you want to link a set of assessors that all use the same initial scan as input.

For example:

filters:
    - type: match
      inputs: scan_t1,assr_freesurfer/scan_t1

This will tell DAX to only run this pipeline where the value for scan_t1 and assr_freesurfer/scan_t1 are the same scan.

outputs

The outputs section defines a list files or directories to be uploaded to XNAT upon completion of the pipeline. Each output item must contain fields path, type, and resource. The path value contains the local relative path of the file or directory to be uploaded. The type of the path should either be FILE or DIR. The resource is the name of resource of the assessor created on XNAT where the output is to be uploaded.

For every processor, a PDF output with resource named PDF is required and must be of type FILE.

command

The command field defines a string template that is formatted using the values from inputs.

Each tag specified inside curly braces (“{}””) corresponds to a field in the defaults input section, or to a var field from a resource on an input or to a varname in the xnat attrs section.

Not all var must be used.

attrs

The attrs section defines miscellaneous other attributes including cluster parameters. These values replace tags in the jobtemplate.

jobtemplate

The jobtemplate is a text file that contains a template to create a batch job script.

Versioning

By default, name and version are parsed from the container file name, based on the format: <NAME>_v<major.minor.revision>.simg where<NAME>_v<major> is the proctype.

The YAML file can override these by using any of the top level fields procversion, procname, and/or proctype. procversion specifies the major.minor.revision, e.g. 1.0.2. procname specifies the name only without version, e.g. mprage. proctype is the name and major version, e.g. mprage_v1. If only procname is specified, the version is parsed from the container name. If only procversion is specified, the name is parsed from the container name. If proctype is specified, it will override everything else to determine proctype.

Notes on Singularity run options

–cleanenv avoids env confusion, and –contain prevents accidentally using code from the host filesystem. However, with –contain, some spiders will need to have specific temp space on the host attached. E.g. for some versions of Freesurfer, –bind ${INDIR}:/dev/shm. For compiled Matlab spiders, we need to provide –home $INDIR to avoid .mcrCache collisions in temp space when multiple spiders are running. And, some cases may require ${INDIR}:/tmp or /tmp:/tmp.