This is part 2 of 3 in the series on updating smtk projects for multiple simulation sequences. Part 1 discussed UI updates that are intended to be implemented in the project plugin module. Part 2 now describes proposed changes to SMTK core for supporting the multiple-simulation use case.
The requirements are based primarily on my perspectives working with the ACE3P developers at SLAC. The intent, of course, is to be able to support other applications and other workgroups. The main requirements are:
- Track the location where simulation results are stored. This is so that results/data from one simulation can be transparently used as input to other simulation runs. (Right now users just gotta know where their datasets are on Cori, which is not easy, in part, because we insert a 24-character job id in the path.)
- Handle multiple instances of a given simulation, to support typical interative development methodologies.
The basic strategy is to add an smtk::project::Job class to store metadata associated with each simulation job that is run with project resources. An instance of the Job class will be added to the project each time a long-duration computing process is run. The Job instance will store information such as the job id assigned by the computing system, the date and time the job was started, the filesystem location of input and output data, and the state of the job (queued, running, complete, error, etc.). For each simulation, the project can store any number of jobs, one of which can be designated as the “current” job, indicating which job outputs should be used when configured as input to other simulations. It is expected that the Job class will be used by simulation-specific classes, either through inheritance or containment, to store data relevant to their individual use cases.
In practice, the thinking is that simulation-specific plugins (ace3pextensions, truchas-extension) will create and update Job instances, and the project-manager plugin will display that data.
New Class smtk::project::Job
The public API for the class is still being definitized, but the current strawman includes:
attributeResource() // the attribute resource that specifies the job/analysis
description() // a user-supplied string
id() // a system-assigned job id (string)
remoteLocation() // URL for locating data generated on remote machine (if applicable)
startDateTime() // timestamp indicating when the job was created/started
state()/setState() // enum with values such as created, queued, running, etc.
The ACE3P subclass will likely include methods such as:
nerscId() // we will use the 24-char Cumulus id as the SMTK job id, however, the NERSC scheduler also assigns its own job id (a much friendlier 6-digit number).
resultsComponent() // returns the relative path to specific elements within an ACE3P results folder, e.g., mode files and deformed-mesh files.
slurmScript() // the SLURM command script that was used to submit the job
Job updates to smtk::project::Project
Basic accessors to be added to the Project class include:
addJob(attResource, Job, setCurrent=true) // store job instance for given attribute resource
deleteJob(smtk::project::Job)
findJob(id) // return job instance for given id (string)
getCurrentJob(smtk::attribute::Resource) // return current job, if any, for give attribute resource
getJobs() // return list of all jobs stored in the project
getJobs(attributeResource) // return list of jobs stored for a given attribute resource
setCurrentJob(resource, job) // sets current job for give attribute resource (which can be nullptr)
Issues and Related Topics
Job instances will need to migrate between the modelbuilder server and client processes.
I had earlier posed a question of whether or not to save a copy of the simulation attributes in the Job instance. My revised position is that the job instance should, at a minimum, store a checksum for each resource used to generate the simulation input data. This could be replaced by resource versions in the future.