[wip] exception handling spec: mainly rework section B.4

7d15d35d · Martin Christoph Hierholzer · 24eb5ed2 · 7d15d35d
Commit 7d15d35d authored 4 years ago by Martin Christoph Hierholzer
--- a/doc/spec_exceptionHandling.dox
+++ b/doc/spec_exceptionHandling.dox
@@ -5,19 +5,6 @@ namespace ChimeraTK {
 <b>DRAFT VERSION, WRITE-UP IN PROGRESS!</b>
-\section spec_execptionHandling_changes Recent changes
-This section lists bigger recent changes, which might be hard to track due to restructuring the document at the same time. This section will go away once the changes have been reviewed.
- Write never blocks in case of exceptions. The following spec points (with discussion why) were hence <b>removed/replaced</b>. See also the mattermost channel.
-  - <strike>Write operations will block immediately until the device has been recovered and the write operation has been completed. [TBD: is this really a good idea? <b>COMMENT</b>: The order of write operations is still not guaranteed through the recovery accessors (which maybe should be changed), and blocking writes has some severe drawbacks. Not only in fan outs but also in normal ApplicationModules blocking writes will prevent propagation of DataValidity flags! Blocking writes might help if a sequence of values is written to the same register - this is not handled by the recovery accessor. But if a handshake register is read back in between the writes, the situation can already be handled properly (check DataValidity flag, restart sequence after recovery). Maybe blocking writes create more probelms then they solve!? On the other hand, how does the application then know that a write() has no effect yet? E.g. a PI controller might wind-up if actuator and sensor are on different devices and the actuator fails. Then again, how is this different a failing actuator hardware without breaking the communication? Some form of a status readback of the actuator again cures the situation. I think I am in favour of "fire-and-forget" writes.].</strike>
-    - <strike>Write should not block in case of an exception for the outputs of ThreadedFanOut / TriggerFanOut.</strike>
-    - <strike>According to \link spec_initialValuePropagation \endlink, writes in ApplicationModules do not block before the first successful read in the main loop.</strike>
- The order of writes during recovery (through recoveryAccessors) is now guaranteed to be the same as the original writes.
- Direct access to the DataFaultCounter is not necessary. Since the spec says the behavior should be transparent whether a connection is directly made to the device or another ApplicationModule/FanOut is in between, it is sufficient to override the flag returned by ExceptionHandlingDecorator::dataValidity() in case of an exception state. This greatly simplifies the implementation and does not change the behavior.
 \section spec_execptionHandling_intro Introduction
 Exceptions are handled by ApplicationCore in a way that the application developer does not need to care much about it.
@@ -72,6 +59,8 @@ When the device is functional, it be (re)initialised by using application-define
  - 4.1 Even if some devices are initially in a persisting error state, the part of the application which does not interact with the faulty devices starts and works normally.
  - 4.2 Initial values are correctly propagated after a device is opened. See \link spec_initialValuePropagation \endlink. Especially, all read operations (even readNonBlocking/readLatest) will be frozen until an initial value has been received. (*)
+- 5. Any ApplicationModule can explicitly report a problem with the device by calling DeviceModule::reportException(). This allows the reinitialisation of a device e.g. after a reboot of the device which didn't result in a exception (e.g. because it was too quick to be noticed, or rebooting the device takes place without interrupting the communication).
 \subsection spec_execptionHandling_behavior_comments (*) Comments
@@ -93,80 +82,46 @@ When the device is functional, it be (re)initialised by using application-define
 A so-called ExceptionHandlingDecorator is placed around all device register accessors (used in ApplicationModules and FanOuts). It is responsible for catching the exceptions and implementing most of the behavior described in A.2. It has to work closely with the DeviceModule and there is a complex syncronsiation and locking scheme, which is described here, together with the according interface functions of the DeviceModule. The sequence executed in the DeviceModule is described in \ref spec_execptionHandling_high_level_implmentation_deviceModule.
-\subsection spec_execptionHandling_high_level_implmentation_TransferElement B.0 Requirements to the DeviceAccess TransferElement
-Note: This section should be integrated into the TransferElement specification and then removed here. Requirements which are already met by the TransferElement specifciation are not mentioned here.
- 0.1 readAsync() may only be called if AccessMode::wait_for_new_data is set. It will throw a ChimeraTK::logic_error otherwise.
+\subsection spec_execptionHandling_high_level_implmentation_interface B.4 Internal interface between ExceptionHandlingDecorator and DeviceModule
- 0.2 If AccessMode::wait_for_new_data is set, the TransferFuture is initialised in the constructor. All read implementations except readLatest() are then using always the TransferFuture.
- 0.3 readLatest() never uses the TransferFuture. Its implementation is identical to the one read implementation when AccessMode::wait_for_new_data is not set. The return value of readLatest() is changed to void.
- 0.4 readTransferAsync() and doReadTransferAsync() are obsolete and hence removed from the interface.
- 0.5 readAsync() always returns the same TransferFuture in subsequent calls.
- 0.6 TransferFuture::wait() and TransferFuture::hasNewData(), as well as ReadAnyGroup::waitAny(), call TransferElement::preRead() at the beginning (keep in mind that extra calls to preRead() are ignored) before the transferFutureWaitCallback is called. This makes sure that preRead() and postRead() are always called in pairs.
- 0.7 Due to the nature of asynchronous transfers, backends must not expect preRead() to be called before new data arrives and is filled into the cppext::future_queue. Hence, doPreRead() of asynchronous accessor implementations will usually be empty. The call to preRead() is still necessary also for asynchronous transfers, since decorators might have important tasks to be done there.
- 0.8 There is no need to call readAsync() for each read transfer again. If the TransferFuture has been obtained once it can simply be used over and over again. Hence, readAsync() will just return the TransferFuture (which had been created in the constructor already) only. It does not call preRead() - this is done by the TransferFuture (see 0.6), and it doesn't have any side effects. (Maybe it should be renamed into getTransferFuture()).
+FIXME: NUMBERING
-\subsection spec_execptionHandling_high_level_implmentation_locking B.4 Syncronsisation and locking between ExceptionHandlingDecorator and DeviceModule
+Note: This section defines the internal interface on a low level. Helper functions, like getters and setters, are intenionally not mentioned here, since those are (in this context) unimportant details which can be chosen at will to structure the code conveniently. The entire interface between the ExceptionHandlingDecorator and the DeviceModule should be protected and the two classes should be friends, to prevent interference with the interface from other entities.
-FIXME: NUMBERING
+- 4.1 The boolean flag DeviceModule::deviceHasError
+  - 4.1.1 is used by the RecoveryAccessor to detect prevailing error conditions, to know when transfers have to be skipped, frozen or delayed.
+  - 4.1.2 The access is protected by the DeviceModule::errorMutex:
+    - shared lock allows to read
+    - unique lock allows to read and write
+- 4.2 The atomic DeviceModule::transferCounter (*)
+  - 4.2.1 tracks the number of on-going (synchronous) transfers, and
+  - 4.2.2 is used by the DeviceModule to wait until they are all terminated (2.3.15).
+- 4.3 The DeviceModule::recoveryHelpers
+  - 4.3.1 are used delay write operations and to restore the last-written values during recovery.
+  - 4.3.2 The access to the list elements are protected by the DeviceModule::recoveryMutex:
+    - shared lock allows to update the application buffer
+    - unique lock allows to call write()
- 4.1 ChimeraTK::DeviceModule::deviceHasError is used by the RecoveryAccessor to detect prevailing error conditions, to know when transfers have to be skipped, frozen or delayed. The access is protected by the ChimeraTK::DeviceModule::errorMutex:
+- 4.4 The cppext::future_queue DeviceModule::errorQueue
-  - 4.1.1 shared lock allows to read
+  - 4.4.1 is used by the RecoveryAccessor to inform the DeviceModule about new exceptions.
-  - 4.1.3 unique lock allows to read and write
- 4.2 The atomic ChimeraTK::DeviceModule::transferCounter is used by the DeviceModule to wait until all on-going transfers are terminated (2.3.15). The access is protected by the ChimeraTK::DeviceModule::errorMutex:
+- 4.5 Reading initial values is controlled by the DeviceModule::initialValueMutex:
-  - 4.2.4 no lock required to read and decrement
+  - 4.4.1 unique lock is hold by the DeviceModule from the beginning until the recovery procedure is complete for the first time
-  - 4.2.4 shared lock allows to read and increment
- 4.3 The ChimeraTK::DeviceModule::recoveryHelpers are used delay write operations and to restore the last-written values during recovery. The access to the list elements are protected by the ChimeraTK::DeviceModule::recoveryMutex:
-  - 4.3.1 shared lock allows to update the application buffer
-  - 4.3.2 unique lock allows to call write()
- 4.4 Reading initial values is controlled by the ChimeraTK::DeviceModule::initialValueMutex:
-  - 4.4.1 unique lock is hold by the ChimeraTK::DeviceModule from the beginning until the recovery procedure is complete for the first time
  - 4.4.2 shared lock allows to continue with reading the initial values (no need to keep it, just acquire it once)
+- 4.6 The following mutexes govern critical sections (besides variable access listed above):
+  - 4.6.1 DeviceModule::errorMutex protects (*)
+    - the (positive) decision to start a transfer followed by incrementing the DeviceModule::transferCounter in 1.2.1 to 1.2.3, against
+    - setting DeviceModule::deviceHasError flag in 1.6.1.
+  - 4.6.2 DeviceModule::recoveryMutex protects (*)
+    - writing the DeviceModule::recoveryHelpers to the device and clearing the DeviceModule::deviceHasError flag in 2.3.5 to 2.3.8, against
+    - updating the DeviceModule::recoveryHelpers in 1.3.
-FIXME: MOVE THE REST
- 4.1 To ensure that the accessor knows when the device is working or has an error, there is boolean flag \c **deviceHasError** in the DeviceModule.
-  - 4.1.1 The flag is protected by the \c **errorMutex**.
-  - 4.1.2 A conditon variable \c recoveryCondVar allows to wait for a change on this flag in another thread. It is used inside **DeviceModule::waitForRecovery()**, which is called from the ExceptionHandlingDecorator which is running in the ApplicationModule or FanOut thread.
-  - 4.1.3 \c deviceHasErrror is set to \true only from ApplicationModule threads (and in the constructor) (*)
-  - 4.1.4 \c deviceHasError is set to \false at exactly one place in the DeviceModule thread. Afterwars the \c recoveryCondVar condition variable is notified.
- 4.2 To inform the DeviceModule that there is a new exception, there is a queue of strings where the exeption message can be pushed by the ExceptionHandlingDecorator.
-  - 4.2.1 The function \c **DeviceModule::reportException()** 
- 4.3 The atomic \c **transferCounter** in the DeviceModule is tracking how many transfers are running. Only when all transfers have finished, the recovery can take place.
-  - 4.3.1  A function \c **startTransfer()** is increasing the counter if the device is not in error state.
-    - 4.3.1.1 As the deviceHasError must only be accessed while holding the errorMutex, also increasing the counter must happen under the mutex to ensure without race condition that no transfer is started while the device is not OK.
-    - 4.3.1.2 If the device is OK and the counter was increased, \c startTransfer() return \c true, otherwise it returns \c false.
-  - 4.3.2 The counter has to be decreased without acquiring the mutex (*). That's why it has to be atomic. This is done in a convenience function \c** stopTransfer()**.
-  - 4.3.3 If \c startTransfer() returned \c true, \c stopTransfer() must be called exactly once. \c stopTransfer() must not be called if startTransfer() returned false.
- 4.4 The so called recovery accessors and **RecoveryHelpers** (see 1.1) are used by the ExceptionHandlingDecorators and the DeviceModule.
-  - 4.4.1 As accessors and RecoveryHelpers are not thread-safe, they have to be protected by a mutex, the \c **recoveryMutex**.
-  - 4.4.2 The mutex can be a shared mutex for the ExceptionHandlingDecorators. Each ExceptionHandlingDecorator is only setting values of it's recovery helper, so all ExceptionHandlingDecorators can to this in parallel.
-  - 4.4.3 During recovery, there is a *critical recovery section* where the DeviceModule must hold an  *exlusive* lock of the recoveryMutex. There it accesses all recovery helpers and executes accessors' write functions.
- 4.6 MOVE THIS? The critical recovery section
-  - 4.6.1 It has 3 steps
-    - 4.6.1.1 Write all recovery accessors. If an exception occurs release the exclusive lock and exit the critical section.
-    - 4.6.1.2 Clear the list of reported exceptions.
-    - 4.6.1.3 Reset the \c deviceHasError flag (while holding the \c errorMutex in additon to the \c recoveryMutex)(*)
-  - 4.6.2 Step 2 and step 3 must only be executed if step 1 was successful
-  - 4.6.3 Do not release the \c recoveryMutex between step 2 and 3. Only this guarantees that no value in a recovery accessor is lost.
-  - 4.6.4 If the critical section is left without having cleared the exception list and \c deviceHasError, it must be re-tried later
-  - 4.6.5 The list of exceptions must be cleared before resetting deviceHasError. As soon as deviceHasError is cleared new exceptions can be reported, which would be lost if the list was cleared afterwards.
-  - 4.6.7 resetting deviceHasError must the be the very last action taken in the recovery. It happens directly before the recovery lock is released.
- 4.7 Interaction of 4.4 and 4.6 to make sure no value in a recovery accessor is skipped
-  - 4.7.1 The ExceptionHandlingDecorator is setting the RecoveryHelper while holdling the shared recovery lock, so the DeviceModule is not in the critical recovery section.
-  - 4.7.2 If the subsequent call to startTransfer() returns false it is guaranteed that the recovery will be executed (because of 4.6.3 and 4.6.4), so the data will eventually be written (unless the recovery helber is later re-filled  before this has happened, but then this data loss is reportet, see 1.3.1).
-  - 4.7.3 If the subsequent call to startTransfer() returns true, the transfer will take place
-    - 4.7.3.1 if there is no exception the data is written successfully
-    - 4.7.3.2 if there is an exception the DeviceModule will be informed and the recovery section will be executed
-MOVE COMMENTS TO THE COMMENT SECTION
- 4.1.3 Setting the flag has to be done in the application thread. If you just send a message and let the device module do both setting and clearing of the flag you can have a race condition: A blocking read would inform the DeviceModule about an exception and continue. The next call to the blocking read is supposed to freeze, but pre-read might not detect this because the device module thread has not woken up yet to set the error flag.
 - 4.3.2 The state of \c deviceHasError does not matter here. The counter always MUST be decreased after a transfer, whether the transfer failed or not.
 - 4.6.1.3 This includes all related steps like notifying the CS, clearing the error message, clearing the actual errorMutex protected variable and notifying the condition variable
@@ -184,6 +139,7 @@ MOVE COMMENTS TO THE COMMENT SECTION
  - 1.1.3 The RecoveryHelper object may be accessed only under a lock to prevent concurrent access during recovery. The lock shall be shared to allow concurrent write operations of different registers - only the DeviceModule needs to obtain an exclusive lock during recovery. The lock is obained by the ExceptionHandlingDecorators via DeviceModule::getRecoverySharedLock().
 - 1.3 In doPreWrite() the recoveryAccessor with the version number and ordering parameter is updated, and the written flag is cleared. This has to happen while holding the shared recovery lock.
+  - 1.3.0 This step needs to be done unconditionally at the very beginning of doPreWrite(), before 1.2 and before delegating preWrite(). (*)
  - 1.3.1 If the written flag was previously not set, the return value of doWriteTransfer() must be forced to true (data lost).
  - 1.3.2 The check wheterh to skip the transfer (cf. 1.2) has to be done without releasing the lock between the write to the recoveryAccessor and the check. (*)
@@ -229,7 +185,7 @@ MOVE COMMENTS TO THE COMMENT SECTION
 - 2.3 In the DeviceModule thread, the following procedure is executed (in a loop until termination):
  - 2.3.1 The DeviceModule tries to open the device until it succeeds and isFunctional() returns true.
    - 2.3.1.1 If the very first attempt to open the device after the application start fails, the error message of the exception is used to overwrite the content of Devices/<alias>/message. Otherwise error messages of exceptions thrown by Device::open() are not visible.
-  - <strike> 2.3.2 Obtain lock for accessing recoveryAccessors.</strike> too early. Only protects recovery accessors.
+  - New position for 2.3.6 The queue of reported exceptions is cleared. (*)
  - 2.3.3 Check that all registers on DeviceModule::listOfReadRegisters are isReadable() and all registers on DeviceModule::listOfWriteRegisters are isWriteable().
    - 2.3.3.1 This involves obtaining an accessor for the register first, which is discarded after the check.
    - 2.3.3.2 If there is an exception, update Devices/<alias>/message with the error message and go back to 2.3.1.
@@ -240,22 +196,28 @@ MOVE COMMENTS TO THE COMMENT SECTION
  - 2.3.5 All valid recoveryAccessors are written in the same order they were originally written.
    - 2.3.5.1 A recoveryAccessor is considered "valid", if it has already received a value, i.e. its current version number is not {nullptr} any more.
    - 2.3.5.2 If there is an exception, update Devices/<alias>/message with the error message, release the lock and go back to 2.3.1.
-  - FIXME numberig: write all remaining async reads. Clear them from the list if successful. If there is an exception, update Devices/<alias>/message with the error message, release the lock and go back to 2.3.1.
-  - 2.3.6 The queue of reported exceptions is cleared. <strike>(*)</strike>
  - 2.3.7 Devices/<alias>/status is set to 0 and Devices/<alias>/message is set to an empty string.
  - 2.3.8 DeviceModule allows ExceptionHandlingDecorators to execute reads and writes again (cf. 2.3.14)
  - 2.3.9 All frozen read operations (cf. 1.4.4) are notified via DeviceModule::errorIsResolvedCondVar. 
  - 2.3.10 Release lock for recoveryAccessors.
-  - 2.3.11 The DeviceModuleThread waits for the next reported exception. The call to reportException in the other thread has already set deviceHasError to true. From this point on, no new transfers will be started.
+  - 2.3.11 The DeviceModuleThread waits for the next reported exception. The call to reportException in the other thread has already set deviceHasError to true (*). From this point on, no new transfers will be started.
  - 2.3.12 An exception is received.
  - 2.3.13 Devices/<alias>/status is set to 1 and Devices/<alias>/message is set to the first received exception message.
-  - <strike> 2.3.14 Set DeviceModule::deviceHasError = true under exclusive recovery lock (cf. 1.2). From this point on, no new transfers will be started.</strike> (done in 2.3.11/1.6.1)
  - 2.3.15 The device module waits until all running read and write operations of ExceptionHandlingDecorators have ended (wait until DeviceModule::activeTransfers == 0). (*)
  - 2.3.16 The thread goes back to 2.3.1 and tries to re-open the device.
+\subsection spec_execptionHandling_high_level_implmentation_reportException B.2 DeviceModule::reportException()
+FIXME missing
 \subsection spec_execptionHandling_high_level_implmentation_comments (*) Comments
+- 4.2.3 Reason for not using an (exclusive) lock: Incrementing and decrementing the counter is done in the ExceptionHandlingDecorator for each operation, even if there is no exception or error state. Concurrent operations must not exclude each other, to allow lockfree operation (if the backend supports it) and to avoid priority inversion, if different application threads have different priority.
+- 4.6.1 This prevents a race condition in 2.3.15. If a (synchronous) transfer might be started after DeviceModule::deviceHasError has been set, the barrier for new transfers in 2.3.15 would not be effective and the transfer might be even executed only after the device has been re-openend (2.3.1) but before the recovery is complete.
+- 4.6.2 This prevents data loss due to a race condition. If the ExceptionHandlingDecorator would update the corresponding DeviceModule::RecoveryHelpers list entry only after it has been written to the device in 2.3.5, but the ExceptionHandlingDecorator would decide not to execute the write operation (1.2) because the DeviceModule thread is still before 2.3.8, the data would not be written to the device at all.
 - 1.1 Possible future change: Output accessors can have the option not to have a recovery accessor. This is needed for instance for "trigger registers" which start an operation on the hardware. Also void registers don't have recovery accessors (once the void data type is supported).
 - 1.1.1 The written flag cannot be replaced by comparing the version number of the recoveryAccessor and the version number stored in the RecoveryHelper, because normal writes (without exceptions) would not update the version number of the recoveryAccessor.
@@ -264,6 +226,8 @@ MOVE COMMENTS TO THE COMMENT SECTION
 - 1.1.2 The ordering guarantee cannot work across DeviceModules anyway. Different devices may go offline and recover at different times. Even in case of two DeviceModules which actually refer to the same hardware device there is no synchronisation mechanism which ensures the recovering procedure is done in a defined order.
+- 1.3.0 Updating the recoveryHelper first ensures that no data is lost, even if the write operation attempt is concurrent with a recovery. See 4.6.2.
 - 1.3.2 Extending the duration of the lock until the decision whether to skip the transfer will prevent unncessary duplicate writes, which otherwise could occur if the DeviceModule went through the whole critical section 2.3.2 to 2.3.10 in between.
 - 1.2.5 The cppext::future_queue in the TransferFuture is a notification queue and hence of the type void. So we don't have to "invent" any value. Also this injection of values is legal, since the queue is multi-producer but single-consumer. This means, potentially concurrent injection of values while the actual accessor might also write to the queue is allowed. Also, the application is the only receiver of values of this queue, so injecting values cannot disturb the backend in any way.
@@ -282,7 +246,9 @@ MOVE COMMENTS TO THE COMMENT SECTION
 - <strike> 1.4.3 The order of locks is important here. The recovery lock prevents the DeviceModule from entering the section 2.3.2 to 2.3.10, which includes the notification through the DeviceModule::errorIsResolvedCondVar at 2.3.9. The mutex DeviceModule::errorLock is the mutex used for the condition variable. Since the ExceptionHandlingDecorator obtains it before the DeviceModule can start the notification, it is guaranteed that the decorator does not miss the notification. Note that the DeviceModule::errorLock is not a shared lock, so concurrent ExceptionHandlingDecorator::preRead() will mutually exclude, but the mutex is held only for a short time until errorIsResolvedCondVar.wait() is called.</strike> See comment on striked out 1.4.3 directly. 
- 2.3.6 The exact place when this is done does not matter, as long as it is done after 2.3.15 (no ongoing transfers).
+- 2.3.6 The exact place when this is done does not matter, as long as it is done after 2.3.15 (no ongoing synchronous transfers) and before 2.3.8 (resetting deiveHasError). As soon as deviceHasError is cleared new exceptions can be reported, which would be lost if the list was cleared afterwards. Moving it as early as possible after the device has been reopenend has the (slight) advantage, that exceptions which might be reported by asynchronous transfers during the recovery are not discarded, even if the recovery itself does't catch them for some reason. Since exceptions reported by asynchronous transfers are subject to race conditions with the recovery procedure, there cannot be strict guarantees about the behavior. The optimal place where to reset the queue (to minimise unnecessary recoveries while minimising the probability of rejecting true errors which then need to be found instead later by other transfers) might need to be found in real-life experiments later.
+- 2.3.11 Setting the DeviceModule::deviceHasError flag has to be done in the application thread which has caught the exception. If you just send a message and let the device module do both setting and clearing of the flag you can have a race condition: A blocking read would inform the DeviceModule about an exception and continue. The next call to the blocking read is supposed to freeze, but pre-read might not detect this because the device module thread has not woken up yet to set the error flag.
 - 2.3.15 The backend has to take care that all operations, also the blocking/asynchronous reads with "waitForNewData", terminate when an exception is thrown, so recovery can take place (see DeviceAccess TransferElement specification).