Extending Projects Part 2: SMTK Updates

johnt · May 12, 2020, 12:29am

This is part 2 of 3 in the series on updating smtk projects for multiple simulation sequences. Part 1 discussed UI updates that are intended to be implemented in the project plugin module. Part 2 now describes proposed changes to SMTK core for supporting the multiple-simulation use case.

The requirements are based primarily on my perspectives working with the ACE3P developers at SLAC. The intent, of course, is to be able to support other applications and other workgroups. The main requirements are:

Track the location where simulation results are stored. This is so that results/data from one simulation can be transparently used as input to other simulation runs. (Right now users just gotta know where their datasets are on Cori, which is not easy, in part, because we insert a 24-character job id in the path.)
Handle multiple instances of a given simulation, to support typical interative development methodologies.

The basic strategy is to add an smtk::project::Job class to store metadata associated with each simulation job that is run with project resources. An instance of the Job class will be added to the project each time a long-duration computing process is run. The Job instance will store information such as the job id assigned by the computing system, the date and time the job was started, the filesystem location of input and output data, and the state of the job (queued, running, complete, error, etc.). For each simulation, the project can store any number of jobs, one of which can be designated as the “current” job, indicating which job outputs should be used when configured as input to other simulations. It is expected that the Job class will be used by simulation-specific classes, either through inheritance or containment, to store data relevant to their individual use cases.

In practice, the thinking is that simulation-specific plugins (ace3pextensions, truchas-extension) will create and update Job instances, and the project-manager plugin will display that data.

New Class smtk::project::Job

The public API for the class is still being definitized, but the current strawman includes:

attributeResource()  // the attribute resource that specifies the job/analysis

description()        // a user-supplied string

id()                 // a system-assigned job id (string)

remoteLocation()     // URL for locating data generated on remote machine (if applicable)

startDateTime()      // timestamp indicating when the job was created/started

state()/setState()   // enum with values such as created, queued, running, etc.

The ACE3P subclass will likely include methods such as:

nerscId()  // we will use the 24-char Cumulus id as the SMTK job id, however, the NERSC scheduler also assigns its own job id (a much friendlier 6-digit number).

resultsComponent() // returns the relative path to specific elements within an ACE3P results folder, e.g., mode files and deformed-mesh files.

slurmScript()      // the SLURM command script that was used to submit the job

Job updates to smtk::project::Project

Basic accessors to be added to the Project class include:

addJob(attResource, Job, setCurrent=true)  // store job instance for given attribute resource

deleteJob(smtk::project::Job)

findJob(id) // return job instance for given id (string)

getCurrentJob(smtk::attribute::Resource)  // return current job, if any, for give attribute resource

getJobs()  // return list of all jobs stored in the project

getJobs(attributeResource)  // return list of jobs stored for a given attribute resource

setCurrentJob(resource, job)  // sets current job for give attribute resource (which can be nullptr)

Issues and Related Topics

Job instances will need to migrate between the modelbuilder server and client processes.

I had earlier posed a question of whether or not to save a copy of the simulation attributes in the Job instance. My revised position is that the job instance should, at a minimum, store a checksum for each resource used to generate the simulation input data. This could be replaced by resource versions in the future.

dcthomp · May 12, 2020, 5:24am

@johnt As far I understand what you are describing, I have a strong aversion to localFolder() and remoteFolder(). There should be a single location per simulation output resource and it should be a URL. If some application wants local copies of job output, that should be tracked elsewhere.

Have you thought about how Catalyst instrumentation might interplay with jobs?

johnt · May 12, 2020, 12:52pm

Agree.

In fact, I will delete “local folder” because that is the project directory (at least for the foreseeable future).
As for representing the remote location, a URL makes sense, though we’ll have to come up with our own/custom pattern for HPC machines. Maybe something like cori.nersc.gov/~johnt/$scratch/path/to/project/whatever>. (Maybe the scheme should be cmb:// or smtk:// ?)

Have not, but we should definitely coordinate smtk job instances with Catalyst-generated data.

dcthomp · May 12, 2020, 2:19pm

Philosophically, a job feels like a persistent object that may not be a file. Things get complicated because a job’s lifecycle may involve movement across machines — or even duplication if the same job is submitted to multiple computers in the hopes of running sooner or re-running (for the sake of reproducibility or to recover lost results). On the other hand, jobs should make the simple use case (running the simulation locally) easy.

Do jobs really belong to projects? I can imagine jobs existing without them, the same as a resource.
Jobs will need to accumulate data, and that data may accumulate in parallel on different computers. For example, a user may submit a job to NERSC, transferring it to some process there for queuing and execution. The fact that the job has been submitted is recorded in the job, presumably before serialization and transmission. Then user could then submit it to AWS (again, adding metadata). The original NERSC submission and the AWS submission add metadata as the job runs and then transmit their results back. How will this be reconciled? If you say that the above must be 2 jobs, then how will users realize the two are related?

johnt · May 13, 2020, 1:07am

No, nothing about the Job class ties it to projects. Jobs can be used by themselves.

I am not sure this addresses your question, but the notion of data accumulation is outside the context of “job” as described here. My mental model is that a project in conjunction with a TBD workflow manager does this data accumulation by running jobs on computing resources, and handling data management/orchestration matters. Each job instance just knows about its own inputs and outputs but not where they came from or go to.

So to the question “how will users know they are related?”, the answer is in the project, where they are stored as 2 jobs for the same simulation.

Also, in case my writing hasn’t been very clear: the proposed smtk::project::Job class stores metadata, not the actual simulation inputs or results.

johnt · May 18, 2020, 6:24pm

A follow-up note about job state. Current thinking is to split the job state into a runtime state and a workflow state:

1. The Job class will store a runtime state, probably as an enum of the most commonly used values: created, queued, running, complete, error. The Job class only provides storage and serialization; the application must set and update the state for each job instance.

2. In addition to the Job storing its runtime state, the project will separately track a workflow state for each job. The main purpose is to identify which job reflects the “current” results for a given analysis.

For each analysis, one job instance can be designated by the user as the accepted job. The results from the accepted job can be used by any coupled downstream analysis.
Each time a job is created/submitted, the currently-accepted job (if any) is automatically changed to some new designation to indicate that it was the previously-accepted job. To name this workflow state, all I can think of right now is benched . In other words, a job is benched if it was the accepted job for an analysis when a new job was created/submitted.
Each time a job is completed, the user should evaluate the results and decide whether to mark the job accepted or not. This workflow state might be called review?
If the latest job results are not accepted, the user should go back and change the benched job back to accepted . The intent is to require the user to explicitly set/update the accepted job each time new results are received.
If a sequence of loosely-coupled jobs are run, a single accepted/not decision should be atomically applied to all jobs in the sequence, although nothing in our UI should prevent the user from accepting individual jobs in the sequence.

Post-edit: replace “benched” with “stashed”.

dcthomp · May 18, 2020, 11:30pm

@johnt I like the idea of users reviewing jobs, but a lot of the steps you’ve outlined seem to make an assumption about customer workflows. Maybe it would be better to think in terms of issue tracking systems like gitlab’s that let users decide what the tags are and how an issue runs through the system; the state diagram for jobs ought to be something programmable.

johnt · May 19, 2020, 12:45pm

I can see that. Base class with a tags feature, whether that’s just string values or something more generic TBD.

Makes me think that jobs are like components – they have tags and are linked to attribute resources. If projects are resources (not that I personally like that design), maybe jobs should be components?