Managing the Resources of a Project

Bob_Obara · January 13, 2020, 7:46pm

See Project Design Discussion for background information.

Knowing which resources of a project to load into memory and when.

Assume that a project requires a set of resources {R1, R2, R3}. In the new design, this can be modeled using the internal attribute resource owned by the Project. This alleviates the project from having a separate data store for holding onto the resources). When the project is loaded into memory, R1 and R2 should be loaded as well but R3 should not (perhaps R1 is the model, R2 is the analysis mesh that the user wishes to use, and R3 is a courser mesh that the user might use in the future). How do we capture this and what mechanism is to be used to perform the action?

1. Project does all the work

This approach requires each derived project to provide an operation to explicitly load the resources into memory. In this example, the designer would create a LoadOperation that would load in R1 and R2 (if they’re set) when the project loads.

2. Base Project provides an initial structure.

In this case, the base project class could provide a simple structure to its internal attribute resource. For example it could have an Attribute Definition called “LoadImmediately” and one called “LoadLater”. A derived project could then add a set of ResourceItems to either of these Definitions and the base project load operation would load all resources listed in the former but not those in the latter. In the above example, R1 and R2 would be captured by ResourceItems under the “LoadImmediately” Attribute and R3 would be a ResourceItem under the “LoadLater” Attribute.

3. Extend Resource Item and its Definition

This approach delegates this responsibility to the ResourceItem and ResourceItemDefinition classes. The ResourceItemDefinition class could have a new property called LoadRequirements with values {LoadImmediately, LoadOnDemand}. An operation could then be created that would walk the attribute resource and load those resources marked with LoadImmediately. In addition, the ResourceItem would also have a boolean property called LoadImmediately that returns true its definition has LoadRequirements = LoadImmediately or if it is marked to be loaded explicitly. The idea for this is to capture the case where the user has loaded the resource into memory explicitly and wants the workflow to remember to load the resource the next time the project in loaded into memory.

In the case of the above example, the ReferenceItemDefinitions for R1 and R2 would have LoadRequirements = LoadImmediately and R3 would have its set to LoadOnDemand.

Benefits

Approach 1

Simplest to implement.

Approach 2

Very simple structure that makes it very easy to define new types of projects.
Base Project can provide a load operation that derived projects could reuse.

Approach 3 (My Current Choice)

Most flexible and can be used outside of the Project Concept
Generic Resource Load operation could be created and used both by project and non-project workflows
Base Project does not enforce any information structure in terms of capturing a project’s resource requirements

Drawbacks

Approach 1 - Every project will need to implement its own “load” operation
Approach 2- Initial structure can appear limiting
Approach 3 - None that I see

Relating Project Resources with Task Resources

Let’s assume we have a project that has a set of resources, how can a Task in the Project’s workflow refer to these resources? Once again there are several options:

The Task uses a ResourceItem that is manually set to the correct resource (either by hand or through an operation). In this case it’s up to the project logic to get this right.
Use Properties or LinkRoles - in this approach, the Task ResourceItem would be constrained to hold only specific resources which could be tested when determining if the user is allowed to enter that Task. Note that the operation to change the Property/Link Association would not be contained in the Task itself. Meaning that the user would need to leave the Task and enter the one that provides the operation. Note that this would allow the Project to represent a resource as LoadOnDemand while the Task could have it marked as LoadImmediately.
Use something like an ItemVariant to directly refer to the item in the Project

Projects and the Resource Manager

Lastly I wanted to bring up the issue of how Projects and their Resources work with respects to the Resource Manager. Lets assume we have the following:

Project 1 (P1) requires resource (R1)
Project 2 (P2) requires Resources (R1 and R2)
The user loads in P2 which means R1 and R2 should be accessible in the Resource Manager.
The user then loads in P1 which doesn’t require anything to be loaded since R1 is already loaded via P2.
Now the user unloads P2. R2 should be unloaded as well though R1 is still needed.
Finally the user unloads P1 which should also result in R1 from being unloaded.

Currently the Resource Manager holds a shared pointer to the resource when it is loaded which means without any modification to the code, resources loaded in via Projects would not get removed when the Project is removed.

Here are a couple options:

Project Resources are held by the Project and not the Resource Manager

This would solve the above problem but at the cost of introducing a bigger problem. Namely Projects would need to provide the same search functionality the Resource Manager provides.

Resource Manager has the ability to hold a Resource using either shared or weak pointers

If there was a way to indicate that a resource should be added to the Resource Manager using either shared or weak pointers then the above example would work as described assuming that P1 and P2 Resources indicated the Resource Manager to hold them using weak pointers.

This would also address the following modification of the above example:
4a. The user directly loads in R1

In Step 4a, R1 would be found to be in the ResourceManager as a weak pointer. Since the resource load operation would then add it to ResourceManager as a shared pointer as well.

Since the ResourceManager uses a multi-index array, the simplest way of possibly supporting this would be to have the ResourceManager maintain two arrays - one using shared pointers and one using weak pointers. The same resource could exist in both depending on the circumstances.

Its the Project Manager Job

This is sort of the dual of the above approach. The Project Manager could keep track of all Projects in the system (which would be its main job). When a Project is being unloaded, the Project Manager would grab the Project’s list of loaded resources and remove those Resources that don’t exist in either another Project or has been directly added to the Resource Manager.

jacob.vaverka · January 13, 2020, 9:59pm

I agree with your conclusion. It seems to me that extending Resource Item and its definition provides the most flexibility, and therefore offers the most potential for SMTK managing resources throughout the lifespan of a project.

johnt · January 14, 2020, 1:26pm

Regarding item #1, I agree that adding this capability to ResourceItem is the way to go. Before moving on, I think we should also clarify, and possibly extend, what the term “load on demand” means. For example, when editing a project’s attribute resource, I presume we will be able to set a ResourceItem value to some project resource whether that resource is in memory or not. Similarly, the ResourceItem.value() should be able to return some kind of info/placeholder/descriptor for a resource not currently in memory. In other words, the process of loading an on-demand resource should always be initiated by the application, not SMTK.

Regarding item #2, I don’t have an opinion on the implementation, at least not yet. The one recommendation I have is to reframe the issue. The current wording considers how workflows can refer to project resources. In contrast, I think the reverse perspective is more applicable. In other words, each workflow should define the resources it uses with whatever mechanism TBD, and it will be the project’s job to provide storage and management of them. So I see 2 main parts to this issue:

How do workflows specify and represent their resources (and data assets in general)
What API does the project provide for workflows to access those resources/assets

Also, because this is a new project/workflow discussion, I am contractually obligated to reiterate my usual comment that we need to extend projects to support simulation assets that are not SMTK resources. This is a requisite feature.

jacob.vaverka · January 14, 2020, 3:59pm

Well said, and an important distinction I think.

Sorry to make you repeat yourself, but could you provide a brief example for clarity?

johnt · January 14, 2020, 4:20pm

By SMTK resources I mean smtk::resource::Resource and its subclasses. This comprises the various model types that SMTK supports along with our mesh and attribute resource types. I presume that our first implementation will be limited to these data types, but there are other data generated and used during a simulation workflow/lifecycle that we will also need to support, most notably simulation input decks and output results. (There is also at least one more data type that I remember from a nuclear application.) So my comment is mostly to remind our guys that we need to generalize our notion of “resource” to include other data types that are not based on SMTK.

jacob.vaverka · January 14, 2020, 4:39pm

Thank you! I was hoping to hear this. As you pointed out, supporting these assets brings continuity to the simulation workflow/lifecycle (and Project).

amuhsin · January 20, 2020, 6:54pm

What project structure will these solutions enforce? How will it facilitate the reuse of resources between projects?

If we assume that each project gets its own directory then importing the same resource R1 into more than one project would result in an smtk representation that contains the same tessellation data (assuming R1 is a tessellated mesh) but with different UUIDs. At that point it becomes difficult to assume that R1 imported into Proj1 is the same R1 that was imported into project Proj2.

The smtk representation of R1 contains the file path of the original resource. We can possibly rely on that but it doesn’t cover the use-case where the user may have two identical copies of R1 that live in two different locations.

Furthermore, how do we handle the case where a user wants to use the same resource (R1) in more than one project but doesn’t want the changes made to R1 in Proj1 to be reflected in R1 in Proj2?

Bob_Obara · January 22, 2020, 2:56am

The image of the Project Manager shows a use case where Resource R1 is shared between Projects 1 and 2. In order to be the most flexible I would propose the following:

A Project may choose to “own” it’s resources - meaning that at least the Resource’s SMTK file is located within the Project’s directory and should not be shared. Now there is the matter of the “native” representation of the Resource as in the case of parametric models and meshes. For example a part’s geometry may exist in a CAD Kernel like Parasolids. Therefore, we would also need to support both cases where the native representation is contained within the project and stored externally
A Project can refer to a Resource but not store it under its directory. In this case the Resource would be considered to be “Shared” instead of owned.

This does introduce 2 subtopics:

Portability

A Project that owns all of its Resources (including those that are referred to by its attribute resources are 100% portable and can be sent to other users. Those that contained either shared Resources or shared native representations are limited in terms of portability.

Versioning

In your example where the users wants to shared a Resource between 2 Projects and then changes the Resource for one Project and not the other, I assume you mean that there are now two versions of the shared Resource (If not this would be a very difficult task when dealing with geometric and meshing resources since modification can have a large impact on the resource. In that case the user would do a save as operation on the Resource within Project 2. The one outstanding issue is how to prevent the user from doing a save on a shared Resource instead of a save as.

Does that make sense?

amuhsin · January 22, 2020, 3:31pm

Yes it does. Thank you for the explanation.

Bob_Obara:

The image of the Project Manager shows a use case where Resource R1 is shared between Projects 1 and 2. In order to be the most flexible I would propose the following:

A Project may choose to “own” it’s resources - meaning that at least the Resource’s SMTK file is located within the Project’s directory and should not be shared. Now there is the matter of the “native” representation of the Resource as in the case of parametric models and meshes. For example a part’s geometry may exist in a CAD Kernel like Parasolids. Therefore, we would also need to support both cases where the native representation is contained within the project and stored externally

A Project can refer to a Resource but not store it under its directory. In this case the Resource would be considered to be “Shared” instead of owned.

It seems like there needs to be a separation between SMTK files and source data files that they are abstracting (models, meshes, etc.). The models and meshes are the Resources that we want to share between projects. Whereas an SMTK file represents a Resource’s state in a project.

For example, if I were to import R1.stl into project1 and project2 each project should by default have its own R1.smtk file that references R1.stl.

I see a few benefits in doing this:

we can use the .smtk files to store whether the tessellations should be loaded in for visualization instead of giving that responsibility to the project.
SMTK files contain the resources topological hierarchy. Meaning we can implement a ‘lazy load’ feature that will enable the user to preview what a file contains before deciding to load it in.
Finally, after a user decides to load in an .stl file for visualization (after previewing it in the ‘lazy-loaded’ state) we can use the Project’s SMTK file to store the visibility state of each part. That would allow us to persist the state between saves. Also, since each project that uses that .stl file has a its own .smtk file, each project can have its own visibility state.
- For example, R1.stl is used in project1 and project2. In project1 I keep it in ‘lazy loaded’ state so I only see the structure of the file in the Resource browser while in project2 I decide to also load in the tessellations to for visualization in the render window.
- Another example would be that I go ahead and load the tessellations for visualization in both project1 and project2 but in project1 I mark PartA as hidden and in project2 I mark PartB as hidden.

johnt · January 22, 2020, 7:28pm

+1

I agree that sharing resources (in the Kitware/SMTK sense) is much less important than sharing the underlying/native data. Ideally, the focus on data versioning would be more useful being applied to that underlying data, whether through a repository (e.g., dvc.org) or something more distributed like we intend to build for DOE/SBIR.

dcthomp · January 22, 2020, 9:20pm

It is a bit tangential, but when it comes to versioning, it would be nice for SMTK to provide some support to applications:

When a resource/project is written, any files it references should be checksummed and that checksum included in the resource’s reference.
When a resource/project is loaded, the checksums for files should be verified as they are read and, at a minimum, a log message emitted when there is a mismatch.

SMTK should provide the utilities to do checksumming so this is easy for resources.

dcthomp · January 22, 2020, 9:28pm

I have also seen “shared” vs “owned” semantics called “linked” vs “embedded” (Apple’s Keynote, GNOME’s Inkscape). We might also call files “public” or “private,” but we should standardize on some terms in the documentation.

dcthomp · January 23, 2020, 2:53pm

This brings up some concerns about shared/linked/external resources. Are these resources going to have references back to the projects that use them? Without this, I can see problems with resources getting out of sync when they are modified while only a subset of referencing projects are loaded in memory. If – whenever a project decides to reference an external resource – it saves the resource with an added reference back to the project, then the act of saving a resource gives applications an opportunity to do a copy-on-write (perhaps with input from the user).

Storing this list of referencing projects means one potential solution to “Save As…” would be for the application to disallow saving resources individually; trying to save a resource would result in saving all referencing projects (to maintain consistency).

As mentioned elsewhere, it is not the resources themselves that are currently very large in our use cases – it is non-SMTK files (model/mesh data) that tend to be large. For those files, we do not have the luxury of adding references back to resources that use them (we don’t control the file formats). However, those files will never be saved directly; they will be imported/exported (potentially inside the “save” operation of a resource). Arguably, modifying external non-SMTK files should result in changing them from external/shared/linked to internal/owned/embedded (since other not-in-memory SMTK resources might reference them). But we should at least warn when an export operation would overwrite an external/linked/shared file.

johnt · January 24, 2020, 2:37pm

In the interest of covering all use cases, here is an example directly applicable to the DOE project:

A user has a genesis model that that will be used in multiple projects.
In the first project, the user submits a job to NERSC that runs ACDTool/meshconvert to generate the corresponding netcdf file. The user will download the mesh quality report, but there is no need to download the netcdf file itself.
In subsequent projects, the user wants to use the local genesis model to specify simulations. Then, when submitting simulation jobs to NERSC, we want to use the netcdf file as the analysis mesh. So basically we want to use the local genesis file/resource as a proxy for the remote netcdf file.

Although this probably goes beyond the scope of the initial project system design, it is something we should keep in mind. FYI we should have some TBD feasibility demonstration for this in the next few months.

Bob_Obara · January 24, 2020, 11:16pm

The only reason I can think of for sharing the SMTK form of a model resource would be sharing the conceptual information that was added to the underlying model. For example if you were to assign properties to a STL model and wanted to share that conceptual resource between different projects.

Bob_Obara · January 26, 2020, 2:44am

I think checksumming is useful - the main issue I have is that storing the checksum in a file changes it’s checksum so things can get complicated especially if the two resources refer to each other. In addition if we were to store which projects are using the resource as you mentioned in a later section, this would potentially also change the resource’s checksum even though it doesn’t change its content.

One possible solution would be to use Libarchive and make a resource an achieve of three parts:

The contents of the resource
The checksum of the first part
The list of potential users (i.e. projects) of the resource

There could be a potential 4th part which could be the version history but I’m assuming the version itself would be stored in the contents section.

dcthomp · January 27, 2020, 3:56pm

@Bob_Obara I don’t think what was proposed would run into these issues because:

Storing the checksums of native files in the resource would not change the checksum of the native file; likewise, storing a link to a resource from a project would not alter the resource (only the project, since links are unidirectional).
Saving a project should imply that its resources are saved if they are modified, so even storing bidirectional links between projects and resources should be fine as long as child resources are saved before parent projects.