Extending Projects: Jobs Management & Provenance

C_Wetterer-Nelson · April 13, 2021, 11:24pm

Currently, the ACE3P extension has basic support for connecting to the Cori super computer at NERSC, to run a simulation (Job) and report a list of jobs ran. However, the existing functionality is limited in a number of ways. With the advent of the new Projects system, it is time to re-engineer Jobs management and enable some key features. Most importantly is the concept of jobs provenance, or the capacity to immediately save all the requisite information needed to reproduce the results of a given simulation (Job). Currently, the strategy for implementation is as follows:

Define a Job class which will be implemented as a thin wrapper around a JSON object. The JSON object will store all metadata associated with a Job.
Add a list (std::vector for now) of Jobs objects to the Project class. which will be serialized something like this:

{
  "Jobs": [
    {
      "SLURM_ID": "023411",
      "CUMULUS_ID": "5e239ab2341082398021c"
      "MACHINE": "Cori",
      "JOB_NAME": "Run 10",
      "ANALYSIS_STEP"="Omega3P",
      "NODES": 6,
      "PROCESSES": 144,
      "RUNTIME": 28800,
      "NOTES": "Here is where we type notes",
      "ANALYSIS_UUID"="b886560a-e609-4dbb-a7c5-1b582f9e2773",
      "ANALYSIS_URL": "path/to/the/saved/analysis.smtk"
    },
    {
      "SLURM_ID": "3741238",
      "CUMULUS_ID": "5e239ab2318947109802d"
      "MACHINE": "Cori",
      "JOB_NAME": "Rent",
      "ANALYSIS_STEP"="Tmp3P",
      "NODES": 12,
      "PROCESSES": 288,
      "RUNTIME": 525600,
      "NOTES": "Added more cups of coffee",
      "ANALYSIS_UUID"="75ed5514-daae-4775-96db-3f0916174d15",
      "ANALYSIS_URL": "path/to/the/saved/other_analysis.smtk"
    }
  ]
}

to a standalone file saved to the top level of the project folder. The list of metadata is minimal and will be expanded as needed (fortunately, the JSON object under the hood of the Job class will make extensibility easy).

When a Job is created (by exporting and submitting a job to an HPC resource), this list will be appended to and a snapshot of the progenitor Analysis (represented internally as an Attribute Resource) will be saved to a Jobs Data subdirectory of the project folder with a unique name. This will provide provenance so that if an Analysis is edited after a job has been submitted from that Analysis, we will still have a copy of all the data required to recreate that job. This Analysis artifact can then be stored for later use. Saving this artifact will be optional.

This topic has a lot of hooks to the ongoing conversation on Resource versioning, and this proposed implementation certainly feels brute force, potentially saving entire Attribute Resources to file over and over. On the other hand, having direct access to the exact data used to generate that job would massively streamline simulation provenance.

johnt · April 14, 2021, 2:59pm

This all works for me. I have some relatively minor comments:

The term “jobs provenance” can be interpreted fairly broadly, so just to be more specific, the first priority in my mind is to integrate job metadata into the project so that users can see the status of jobs organized by project/analysis. Second priority would be to then integrate job results into projects – requirements are still to be defined. Next priority is remote visualization of jobs, so that users can launch paraview server instances on NERSC to view job output data (including in situ once SLAC updates ACE3P). Next priority is probably support for analysis sequences including automated job submission. Last priority might be full reproducibility, in which individual jobs in a project’s history can be easily rerun.
In the jobs list, I would use all lower-case for the field names (keys)
Saving the attribute resource to file for each job is submitted could be superseded if/when a resource versioning system is in place. (FYI resource versioning is not part of the planned work for SLAC.)
In addition to saving the attribute resource containing the simulation data for each job, we should also save the export attribute resource.