@@ -23,7 +23,7 @@ When the device is functional, it be (re)initialised by using application-define
- Whenever a write operation or a call to write() is mentioned, destructive writes via writeDestructively() are included. The destructive write optimisation makes no difference for the exception handling.
\section spec_execptionHandling_behavior A. Behavioural description
\section spec_execptionHandling_behaviour A. Behavioural description
- 1. All ChimeraTK::runtime_error exceptions thrown by device register accessors are handled by the framework and are never exposed to user code in ApplicationModules.
- 1.1 ChimeraTK::logic_error exceptions are left unhandled and will terminate the application. These errors may only occur in the initialisation phase (up to the point where all devices are opened and initialised) and point to a severe configuration error which is not recoverable. (*)
...
...
@@ -46,7 +46,7 @@ When the device is functional, it be (re)initialised by using application-define
- 2.3.1 The return value of write() indicates whether data was lost in the transfer. If the write has to be delayed due to an exception, the return value will be true, if a previously delayed and not-yet written value is discarded in the process, false otherwise.
- 2.3.2 When the delayed value is finally written to the device during the recovery procedure, it is guaranteed that no data loss happens (writes with data loss will be retried).
- 2.3.3 It is guaranteed that the write takes place before the device is considered fully recovered again and other transfers are allowed (cf. 3.1).
- 2.4 In case of exceptions, there is no guaranteed realtime behavior, not even for "non-blocking" transfers. (*)
- 2.4 In case of exceptions, there is no guaranteed realtime behaviour, not even for "non-blocking" transfers. (*)
- 3. The framework tries to resolve an exception state by periodically re-opening the faulty device.
- 3.1 After successfully re-opening the device, a recovery procedure is executed before allowing any read/write operations from the AppliactionModules and FanOuts again. This recovery procedure involves:
...
...
@@ -55,20 +55,20 @@ When the device is functional, it be (re)initialised by using application-define
- 3.1.3 Finally, Devices/<alias>/deviceBecameFunctional is written to inform any module subscribing this variable about the finished recovery. (*)
- 3.2 Any number of initialisation handlers can be added to the DeviceModule in the user code. Initialisation handlers are callback function which will be executed when a device is opened for the first time and after a device recovers from an exception, before any process variables are written. See DeviceModule::addInitialisationHandler().
- 4. The behavior at application start (when all devices are still closed at first) is similar to the case of a later received exception. The only differences are mentioned in 4.2.
- 4. The behaviour at application start (when all devices are still closed at first) is similar to the case of a later received exception. The only differences are mentioned in 4.2.
- 4.1 Even if some devices are initially in a persisting error state, the part of the application which does not interact with the faulty devices starts and works normally.
- 4.2 Initial values are correctly propagated after a device is opened. See \link spec_initialValuePropagation \endlink. Especially, all read operations (even readNonBlocking/readLatest) will be frozen until an initial value has been received. (*)
- 5. Any ApplicationModule can explicitly report a problem with the device by calling DeviceModule::reportException(). This allows the reinitialisation of a device e.g. after a reboot of the device which didn't result in a exception (e.g. because it was too quick to be noticed, or rebooting the device takes place without interrupting the communication).
- 1.1 In future, maybe logic_errors are also handled, so configuration errors can nicely be presented to the control system. This may be important especially since logic_errors may depend also on the configuration of external components (devices). If e.g. a device is changed (e.g. device is another control system application which has been modified), logic_errors may be thrown in the recovery phase, despite the device had been successfully initialsed previously.
- 2.2.4 Preventing the device to send data before the recovery is complete is not trivial in the general case for asynchronous transfers (i.e. wait_for_new_data). Race conditions might occur if the transport layer does not guarantee the order of packets (e.g. UDP), in which case unsubscribing a variable might not guarantee that no more data arrives which has been sent before unsubscribing. Hence it was decided not to specify a mechanism which would guarantee that no asychronous data transfers take place before the recovery has completed.
- 2.2.5 Not defining the behavior here avoids a conflict with 1.2 without requiring a complicated implementation which does not block in this case. Implementing this would not present any gain for the application. If there are many exceptions on the same device in a short period of time, the number of faulty data updates seen by the application modules will always depend on the speed the module is attempting to read data (unless we require every exception to be visible to every module, but this will have complex effects, too). It might break consistency of the number of updates sent through different paths in an application, but applications should anyway not rely on that and use a DataConsistencyGroup to synchronise instead. Hence, the implementation will block always if a blocking read sees a known exception
- 2.2.5 Not defining the behaviour here avoids a conflict with 1.2 without requiring a complicated implementation which does not block in this case. Implementing this would not present any gain for the application. If there are many exceptions on the same device in a short period of time, the number of faulty data updates seen by the application modules will always depend on the speed the module is attempting to read data (unless we require every exception to be visible to every module, but this will have complex effects, too). It might break consistency of the number of updates sent through different paths in an application, but applications should anyway not rely on that and use a DataConsistencyGroup to synchronise instead. Hence, the implementation will block always if a blocking read sees a known exception
- 2.3 / 3.1.3 If timing is important for write operations (e.g. must not write a sequence of registers too fast), or if multiple values need to be written to the same register in sequence, the application cannot fully rely on the framework's recovery procedure. The framework hence provides the process variable Devices/<alias>/deviceBecameFunctional for each device, which will be written each time the recovery procedure is completed (cf. 3.1.3). ApplicationModules which implement such timed sequence need to receive this variable and restart the entire sequence after the recovery.
...
...
@@ -76,11 +76,11 @@ When the device is functional, it be (re)initialised by using application-define
- 3.1.2 For some applications, the order of writes may be important, e.g. if firmware expects this. Please note that the VersionNumber is insufficient as a sorting criteria, since many writes may have been done with the same VersionNumber (in an ApplicationModule, the VersionNumber used for the writes is determined by the largest VersionNumber of the inputs).
- 4.2 DataValidity::faulty is initially set by default, so there is no need to propagate this flag initially. To prevent race conditions and undefined behavior, it even needs to be made sure that the flag is not propagated unnecessarily. The behavior of non-blocking reads presents a slight asymmetry between the initial device opening and a later recovery. This will in particular be visible when restarting a server while a device is offline. If a module only uses readLatest()/readNonBlocking() (= read() for poll-type inputs) for the offline device, the module was still running before the server restart using the last known values for the dysfunctional registers (and flagging all outputs as faulty). After the restart, the module has to wait for the initial value and hence will not run until the device becomes functional again. To make this behavior symmetric, one would need to persist the values of device inputs. Since this only affects a corner case in which anyway no usable output is produced, this slight inconsistency is considered acceptable.
- 4.2 DataValidity::faulty is initially set by default, so there is no need to propagate this flag initially. To prevent race conditions and undefined behaviour, it even needs to be made sure that the flag is not propagated unnecessarily. The behaviour of non-blocking reads presents a slight asymmetry between the initial device opening and a later recovery. This will in particular be visible when restarting a server while a device is offline. If a module only uses readLatest()/readNonBlocking() (= read() for poll-type inputs) for the offline device, the module was still running before the server restart using the last known values for the dysfunctional registers (and flagging all outputs as faulty). After the restart, the module has to wait for the initial value and hence will not run until the device becomes functional again. To make this behaviour symmetric, one would need to persist the values of device inputs. Since this only affects a corner case in which anyway no usable output is produced, this slight inconsistency is considered acceptable.
\section spec_execptionHandling_high_level_implmentation B. Implementation
A so-called ExceptionHandlingDecorator is placed around all device register accessors (used in ApplicationModules and FanOuts). It is responsible for catching the exceptions and implementing most of the behavior described in A.2. It has to work closely with the DeviceModule and there is a complex syncronsiation and locking scheme, which is described here, together with the according interface functions of the DeviceModule. The sequence executed in the DeviceModule is described in \ref spec_execptionHandling_high_level_implmentation_deviceModule.
A so-called ExceptionHandlingDecorator is placed around all device register accessors (used in ApplicationModules and FanOuts). It is responsible for catching the exceptions and implementing most of the behaviour described in A.2. It has to work closely with the DeviceModule and there is a complex syncronsiation and locking scheme, which is described here, together with the according interface functions of the DeviceModule. The sequence executed in the DeviceModule is described in \ref spec_execptionHandling_high_level_implmentation_deviceModule.
\subsection spec_execptionHandling_high_level_implmentation_interface B.4 Internal interface between ExceptionHandlingDecorator and DeviceModule
...
...
@@ -100,7 +100,7 @@ Note: This section defines the internal interface on a low level. Helper functio
- 4.2.2 is used by the DeviceModule to wait until they are all terminated (2.3.15).
- 4.3 The DeviceModule::recoveryHelpers
- 4.3.1 are used delay write operations and to restore the last-written values during recovery.
- 4.3.1 are used to delay write operations and to restore the last-written values during recovery.
- 4.3.2 The access to the list elements are protected by the DeviceModule::recoveryMutex:
- shared lock allows to update the application buffer
- unique lock allows to call write()
...
...
@@ -143,20 +143,20 @@ Note: This section defines the internal interface on a low level. Helper functio
- 1.3.1 If the written flag was previously not set, the return value of doWriteTransfer() must be forced to true (data lost).
- 1.3.2 The check wheterh to skip the transfer (cf. 1.2) has to be done without releasing the lock between the write to the recoveryAccessor and the check. (*)
- 1.2 In doPreRead()/doPreWrite(), it must be decided whether to execute xxxTransferYyy(). This part requires a shared lock on the ChimeraTK::DeviceModule::errorMutex.
- 1.2 In doPreRead()/doPreWrite(), it must be decided whether to execute xxxTransferYyy(). This part requires a shared lock on the DeviceModule::errorMutex.
- 1.2.1 xxxTransferYyy() is <i>not</i> executed, if DeviceModule::deviceHasError == true and either:
- it is a write transfer (cf. A.2.3), or
- it is a read transfer and AccessMode::wait_for_new_data is not set (cf. A.2.2.3), or
- it is a read transfer and AccessMode::wait_for_new_data is set and ExceptionHandlingDecorator::previousReadFailed == false (cf. 1.5.1, 1.6.3.1 and A.2.2.4).
Otherwise xxxTransferYyy() is executed (potentially after it is frozen, see 1.4).
- 1.2.2 If xxxTransferYyy() is not executed, none of the pre/transfer/post functions must be delegated to the target accessor.
- 1.2.3 If xxxTransferYyy() is executed, and it is <i>not</i> a read transfer with AccessMode::wait_for_new_data set, the ChimeraTK::DeviceModule::transferCounter must be incremented.
- 1.2.3 If xxxTransferYyy() is executed, and it is <i>not</i> a read transfer with AccessMode::wait_for_new_data set, the DeviceModule::transferCounter must be incremented.
- 1.4 In doPreRead() certain read operations are frozen in case of a fault state, i.e. startTransfer() returned false (see A.2.2):
- 1.4.1 The shared lock on the DeviceModule::errorMutex acquired in 1.2 is still kept.
- 1.4.2 Decide, whether freezing is done (don't freeze yet). Freezing is done if no initial value has been read yet (getCurretVersion() == {nullptr}) and DeviceModule::deviceHasError == true (cf. A.4.2). (*)
- 1.4.3 Release the DeviceModule::errorMutex.
- 1.4.4 If the read should be frozen, acquire a shared lock on the ChimeraTK::DeviceModule::initialValueMutex. (*)
- 1.4.4 If the read should be frozen, acquire a shared lock on the DeviceModule::initialValueMutex. (*)
- 1.5 In doPostRead()/doPostWrite():
- 1.5.0 Delegate postRead() / postWrite() (see 1.6)
...
...
@@ -244,9 +244,7 @@ FIXME missing
- 1.6.3.1 The freezing is done in doPreRead(), see 1.4.
- <strike> 1.4.3 The order of locks is important here. The recovery lock prevents the DeviceModule from entering the section 2.3.2 to 2.3.10, which includes the notification through the DeviceModule::errorIsResolvedCondVar at 2.3.9. The mutex DeviceModule::errorLock is the mutex used for the condition variable. Since the ExceptionHandlingDecorator obtains it before the DeviceModule can start the notification, it is guaranteed that the decorator does not miss the notification. Note that the DeviceModule::errorLock is not a shared lock, so concurrent ExceptionHandlingDecorator::preRead() will mutually exclude, but the mutex is held only for a short time until errorIsResolvedCondVar.wait() is called.</strike> See comment on striked out 1.4.3 directly.
- 2.3.6 The exact place when this is done does not matter, as long as it is done after 2.3.15 (no ongoing synchronous transfers) and before 2.3.8 (resetting deiveHasError). As soon as deviceHasError is cleared new exceptions can be reported, which would be lost if the list was cleared afterwards. Moving it as early as possible after the device has been reopenend has the (slight) advantage, that exceptions which might be reported by asynchronous transfers during the recovery are not discarded, even if the recovery itself does't catch them for some reason. Since exceptions reported by asynchronous transfers are subject to race conditions with the recovery procedure, there cannot be strict guarantees about the behavior. The optimal place where to reset the queue (to minimise unnecessary recoveries while minimising the probability of rejecting true errors which then need to be found instead later by other transfers) might need to be found in real-life experiments later.
- 2.3.6 The exact place when this is done does not matter, as long as it is done after 2.3.15 (no ongoing synchronous transfers) and before 2.3.8 (resetting deiveHasError). As soon as deviceHasError is cleared new exceptions can be reported, which would be lost if the list was cleared afterwards. Moving it as early as possible after the device has been reopenend has the (slight) advantage, that exceptions which might be reported by asynchronous transfers during the recovery are not discarded, even if the recovery itself does't catch them for some reason. Since exceptions reported by asynchronous transfers are subject to race conditions with the recovery procedure, there cannot be strict guarantees about the behaviour. The optimal place where to reset the queue (to minimise unnecessary recoveries while minimising the probability of rejecting true errors which then need to be found instead later by other transfers) might need to be found in real-life experiments later.
- 2.3.11 Setting the DeviceModule::deviceHasError flag has to be done in the application thread which has caught the exception. If you just send a message and let the device module do both setting and clearing of the flag you can have a race condition: A blocking read would inform the DeviceModule about an exception and continue. The next call to the blocking read is supposed to freeze, but pre-read might not detect this because the device module thread has not woken up yet to set the error flag.