Operational model for SMTK

dcthomp · September 19, 2022, 5:50pm

There is currently a bug in SMTK that allows operation observers to run after the operation’s resource locks have been released. This results in other operations modifying the same resources while observers attempt to update the UI in response to the first operation.

There is a way to force observers to run in the GUI thread while the operation’s thread holds the resource locks, but this can cause deadlocks if the GUI thread ever attempts to obtain a resource lock – because now an operation can be waiting for the GUI event loop to run observers while the GUI event loop is running code that is waiting for the operation’s resources to become free.

For this reason, we are proposing …

GUI Code Must Never Attempt To Acquire Resource Locks

If the GUI thread never blocks waiting for resource locks, then it is safe for Qt code to block until GUI code runs.

This rule forms the basis for an operational model for SMTK applications. As a corollary, the second tenet of this operational model is that …

GUI Applications Must Always Launch Operations …

… as opposed to calling operate() on them directly. Because smtk::operation::Operation’s operate() method acquires resource locks, it must always run on a non-GUI thread.

We have considered making operate() a protected method, but this greatly complicates Python scripting. If you always had to launch operations, even in Python scripts, scripting would become tedious and verbose.

Instead we propose emitting an error message when operate() is invoked in a GUI thread. This error message would cause test failures (to prevent bad code from getting merged) and, should a deadlock occur, indicates the cause of the deadlock.

One minor variation we might consider is adding a new method to operations named launch() that returns a std::future<Result> instead of a Result and queues the operation to run in a separate thread. Scripts could simply wait for the future to become available. However, waiting like this would leave GUI non-responsive.

Also, you might be asking “What about the Python shell inside modelbuilder?” Yes, even here calls to operate() must be avoided.

Another corner case to consider is Python operations (i.e., operations written in Python that inherit smtk.operation.Operation). Because of the GIL, this operation must be run in the GUI thread. Some work may need to be done to lock resources on a separate thread and then signal the operation to begin on the GUI thread.

Once we have audited SMTK to ensure the rules above are met, we will modify pqSMTKCallObserversOnMainThread to properly block operations from releasing resource locks until the main thread completes running the operation’s observers.

johnt · September 28, 2022, 12:55pm

Relevant to this, we still build several modelbuilder packages with SMTK operation threads disabled because of problems with c++ operations that import and run python operations. I revisited my branch that demonstrates this behavior (https://gitlab.kitware.com/cmb/smtk/-/merge_requests/2396) and rebased it to the current smtk:master. The test is called UnitTestCxxPythonOperation and uses a c++ operation that: (i) imports a python operation, (ii) executes the python operation, (iii) unregisters the python operation. The overall test essentially runs the c++ operation twice.

In the past, this test would hang in the second iteration when the c++ operation tried to import the python operation. We presumed this was because the first instantiation of the python op was still alive somewhere and hanging on to the GIL.

Unfortunately, the rebased version of the test seems to be worse: it now hangs the first time the c++ operation tries to import the python operation. It is not obvious what the underlying problem is, but gdb tells me the hang occurs in the pybind::exec() call in PythonInterpreter::loadPythonSourceFile().

I hope we can resolve this problem as part of any update to the operation launching code. For now, we still need the SMTK cmake option to run operations in the main thread.

dcthomp · September 28, 2022, 4:44pm

I’ve built and run the test. It hangs, even when I change the various calls from ->operate() to ->operate(Key{}) (to run without resource locking). That tells me that this problem is unrelated to resource locking. And as you note, it hangs trying to load python source and in particular trying to obtain the GIL. Some more digging tells me it is hanging trying to load the smtk.attribute module (at the top of PyInit__smtkPybindAttribute(), which tries to obtain the GIL inside the PYBIND11_MODULE(_smtkPybindAttribute, attribute) macro).

dcthomp · September 28, 2022, 9:58pm

After some more discussion, it appears python operations must run on the GUI thread in order for the GIL locking to work (at least as far as we understand it, because modelbuilder/paraview’s python shell evaluates python on the GUI thread). So, there are two locks we must deal with

Resource locking – should never happen on GUI thread to avoid deadlocks
Python GIL locking – should always happen on the GUI thread to avoid deadlocks

I think by modifying the base python smtk.operation.Operation class (which inherits smtk::operation::Operation), we can add a facility for operateInternal() to run on the GUI thread after the resource locking has been achieved on a non-GUI thread. This facility would be a static functor that Qt plugins can override; the functor is what invokes Operation::operateInternal(). The same facility could be used by any operation to ensure it is run on the GUI thread after locks have been obtained in a separate thread.

class Operation {
protected:
  std::function<Result(Operation*)> m_threadShiftingFunctor;
  std::function<Result(Operation*)> s_guiThreadShifter;
public:
  Operation() {
    // Provide a default that does not shift threads:
    m_threadShiftingFunctor =
      [](Operation* op) { return op->operateInternal(); };
  }

  Result operate() {
    Result result;
    // … acquire read/write resource locks.
    // … invoke pre-op observers (already thread-shifted, just not properly).
    result = m_threadShiftingFunctor(this);
    // … invoke observers (already thread-shifted, just not properly).
    // … release read/write resource locks.
  }

  static void setThreadShiftingMethod(std::function<Result(Operation*)> tsm) {
    if (tsm) { s_guiThreadShifter = tsm; }
  }
};

Then, Python operations (or any other operation that must run on the GUI thread) can re-assign this functor in its constructor:

class PyOperation : public smtk::operation::Operation
{
public:
  PyOperation()
  {
    m_threadShiftingFunctor = s_guiThreadShifter;
    // … and whatever else
  }
};

and ParaView plugins (or other GUI applications) can provide the means for running the method on the GUI thread:

class pqSMTKOperationThreadShifter : public QObject
{
public:
  void startup()
  {
    // When requestOperation is emitted, force it to wait until
    // runOperation finishes its invocation on this thread (the GUI thread):
    QObject::connect(
      this, &pqSMTKOperationThreadShifter::requestOperation,
      this, &pqSMTKOperationThreadShifter::runOperation,
      Qt::BlockingQueuedConnection);

    // Force any operations created after this call *and* which
    // override the default thread-shifter to run on the GUI thread:
    smtk::operation::Operation::setThreadShiftingMethod(
      [this](Operation* op) { Q_EMIT requestOperation(op); });
  }
  public Q_SIGNALS:
    Result requestOperation(Operation*);
  protected Q_SLOTS:
    Result runOperation(Operation* op)
    {
      return op->operateInternal();
    }
};

Note that long-running python operations will cause the GUI to hang. There doesn’t appear to be anything we can do about that.

johnt · October 4, 2022, 4:52pm

Shoot, I should have replied last week. If you built and ran the c++/python/operation test, would you do one more test iteration please:

Edit UnitTestCxxPythonOperation.cxx:45 to uncomment the line, adding a #define var:

45 #define RUN_PYTHON_OP

dcthomp · October 12, 2022, 2:33pm

Progress on implementation

As a first step, SMTK MR 2808 generates warnings whenever a smtk::resource::Lock is lock()-ed on the main thread. Using this, I’ve identified several categories of potential deadlocks we need to work around along with potential workarounds:


Problem	Several UI elements run operations directly. E.g., `smtk::view::ComponentPhraseContent::editStringValue()`, `smtk::extension::qt::editColorValue()`.
Solution	Launch the operations instead.

Problem	When initializing a resource-manager observer, each resource is locked before invoking the observer.
Solution	Run the iteration in a thread and ensure code adding observers does not expect immediate population. We may need to return a std::future that can be joined.

Problem	Phrase model’s `addSource()` visits all resources in the manager to produce top-level phrases.
Solution	Run this in a thread, being careful to return to GUI thread to invoke phrase-model observers.

Problem	When the operation::Manager creates an operation, it creates an attribute resource to hold parameters. This locks the resource. Since all operations of the same type share a resource (specification) that holds their attributes (parameters). A deadlock is possible but highly unlikely, but theoretically possible.
Solution	We could make each operation have its own specification. We will need to find some way to either prevent locking or ignore the warning when operations are created on the main thread.

Problem	Several potential hangs on exit. E.g.: `pqSMTKBehavior::removeManagerFromServer()` removes resources from the resource manager directly.
Solution	Most of these could be eliminated. However, if a long-running operation is still going when the user quits, Windows could keep the process alive a long time.

None of these are show-stoppers, but you can see it will be a fair amount of work. Until these are addressed, we should not protect smtk::operation::Operation::operate().