diff --git a/doc/spec_exceptionHandling.dox b/doc/spec_exceptionHandling.dox index 4bcf8ec4b6a37a4c0daf8f46ea75b7167e7689f7..2500c91b312661bef3a7c04ad4d19035a62f8da5 100644 --- a/doc/spec_exceptionHandling.dox +++ b/doc/spec_exceptionHandling.dox @@ -49,8 +49,11 @@ When the device is functional, it be (re)initialised by using application-define - 2.2 Read operations will propagate the DataValidity::faulty flag to the owning module / fan out (without changing the actual value): - 2.2.1 The normal module algorithm code will be continued, to allow this flag to propagate to the outputs in the same way as if it had been received through the process variable itself (c.f. 1.2). - 2.2.2 The DataValidity::faulty flag resulting from the fault state is propagated once, even if the variable had the a DataValidity::faulty flag already set previously for another reason. - - 2.2.3 readLatest() (including any read operation without AccessMode::wait_for_new_data) will be skipped. The return value will be false (no new data), if the fault flag has been read once already by the same accessor and hence is already propagated (regardless of the type of the first read), true otherwise. - - 2.2.4 Read operations with AccessMode::wait_for_new_data (read(), readNonBlocking() and readAsync()) will be skipped, if the DataValidity::faulty flag has not yet been propagated by the same accessor (which counts as new data, i.e. readNonBlocking() will return true). Otherwise, it will behave like there is no new data: Blocking operations will be frozen, non-blocking operations will be skipped. When the frozen operation is finally executed, another exception might be thrown, in which case the previously frozen operation is finally skipped. + - 2.2.3 Read operations without AccessMode::wait_for_new_data are skipped. + - 2.2.4 Read operations with AccessMode::wait_for_new_data will be skipped once for each accessor to propagate the DataValidity::faulty flag (which counts as new data, i.e. readNonBlocking() will return true). In the following: + - non-blocking read operations (readNonBlocking() and readLatest()) are skipped and return false, until new data has arrived from the device, and + - blocking read operations (read()) will freeze until new data has arrived from the device. + - Note: The device may start sending data already before the recovery procedure (cf. 3.1) is complete. If this is not acceptable, a device specific handshake mechanism has to be implemented in the application to control when the device is allowed to send updates again. (*) - 2.2.5 If the fault state had been resolved in between two read operations (regardless of the type) and the device had become faulty again before the second read is executed, it is not defined whether the second operation will frozen/skipped (depending on the type) or not. The second operation might behave either like it is a new exception or like the same fault state would still prevail. (*) - 2.3 Write operations will be delayed. In case of a fault state (new or persisting), the actual write operation will take place asynchronously when the device is recovering. The same mechanism as used for 3.1.2 is used here, hence the order of write operations is guaranteed across accessors, but only the latest written value of each accessor prevails. (*) - 2.3.1 The return value of write() indicates whether data was lost in the transfer. If the write has to be delayed due to an exception, the return value will be true, if a previously delayed and not-yet written value is discarded in the process, false otherwise. @@ -74,6 +77,8 @@ When the device is functional, it be (re)initialised by using application-define - 1.1 In future, maybe logic_errors are also handled, so configuration errors can nicely be presented to the control system. This may be important especially since logic_errors may depend also on the configuration of external components (devices). If e.g. a device is changed (e.g. device is another control system application which has been modified), logic_errors may be thrown in the recovery phase, despite the device had been successfully initialsed previously. +- 2.2.4 Preventing the device to send data before the recovery is complete is not trivial in the general case for asynchronous transfers (i.e. wait_for_new_data). Race conditions might occur if the transport layer does not guarantee the order of packets (e.g. UDP), in which case unsubscribing a variable might not guarantee that no more data arrives which has been sent before unsubscribing. Hence it was decided not to specify a mechanism which would guarantee that no asychronous data transfers take place before the recovery has completed. + - 2.2.5 Not defining the behavior here avoids a conflict with 1.2 without requiring a complicated implementation which does not block in this case. Implementing this would not present any gain for the application. If there are many exceptions on the same device in a short period of time, the number of faulty data updates seen by the application modules will always depend on the speed the module is attempting to read data (unless we require every exception to be visible to every module, but this will have complex effects, too). It might break consistency of the number of updates sent through different paths in an application, but applications should anyway not rely on that and use a DataConsistencyGroup to synchronise instead. Hence, the implementation will block always if a blocking read sees a known exception - 2.3 / 3.1.3 If timing is important for write operations (e.g. must not write a sequence of registers too fast), or if multiple values need to be written to the same register in sequence, the application cannot fully rely on the framework's recovery procedure. The framework hence provides the process variable Devices/<alias>/deviceBecameFunctional for each device, which will be written each time the recovery procedure is completed (cf. 3.1.3). ApplicationModules which implement such timed sequence need to receive this variable and restart the entire sequence after the recovery. @@ -86,6 +91,8 @@ When the device is functional, it be (re)initialised by using application-define \section spec_execptionHandling_high_level_implmentation B. Implementation +A so-called ExceptionHandlingDecorator is placed around all device register accessors (used in ApplicationModules and FanOuts). It is responsible for catching the exceptions and implementing most of the behavior described in A.2. It has to work closely with the DeviceModule and there is a complex syncronsiation and locking scheme, which is described here, together with the according interface functions of the DeviceModule. The sequence executed in the DeviceModule is described in \ref spec_execptionHandling_high_level_implmentation_deviceModule. + \subsection spec_execptionHandling_high_level_implmentation_TransferElement B.0 Requirements to the DeviceAccess TransferElement Note: This section should be integrated into the TransferElement specification and then removed here. Requirements which are already met by the TransferElement specifciation are not mentioned here. @@ -101,10 +108,26 @@ Note: This section should be integrated into the TransferElement specification a \subsection spec_execptionHandling_high_level_implmentation_locking B.4 Syncronsisation and locking between ExceptionHandlingDecorator and DeviceModule -// fixme: do cyclic re-namimg 2. -> 3., 1. -> 2., 4. -> 1. when done. +FIXME: NUMBERING + + + +- 4.1 ChimeraTK::DeviceModule::deviceHasError is used by the RecoveryAccessor to detect prevailing error conditions, to know when transfers have to be skipped, frozen or delayed. The access is protected by the ChimeraTK::DeviceModule::errorMutex: + - 4.1.1 shared lock allows to read + - 4.1.3 unique lock allows to read and write +- 4.2 The atomic ChimeraTK::DeviceModule::transferCounter is used by the DeviceModule to wait until all on-going transfers are terminated (2.3.15). The access is protected by the ChimeraTK::DeviceModule::errorMutex: + - 4.2.4 no lock required to read and decrement + - 4.2.4 shared lock allows to read and increment +- 4.3 The ChimeraTK::DeviceModule::recoveryHelpers are used delay write operations and to restore the last-written values during recovery. The access to the list elements are protected by the ChimeraTK::DeviceModule::recoveryMutex: + - 4.3.1 shared lock allows to update the application buffer + - 4.3.2 unique lock allows to call write() +- 4.4 Reading initial values is controlled by the ChimeraTK::DeviceModule::initialValueMutex: + - 4.4.1 unique lock is hold by the ChimeraTK::DeviceModule from the beginning until the recovery procedure is complete for the first time + - 4.4.2 shared lock allows to continue with reading the initial values (no need to keep it, just acquire it once) + -A so-called ExceptionHandlingDecorator is placed around all device register accessors (used in ApplicationModules and FanOuts). It is responsible for catching the exceptions and implementing most of the behavior described in A.2. It has to work closely with the DeviceModule and there is a complex syncronsiation and locking scheme, which is described here, together with the according interface functions of the DeviceModule. The sequence executed in -the DeviceModule is described in \ref spec_execptionHandling_high_level_implmentation_deviceModule. + +FIXME: MOVE THE REST - 4.1 To ensure that the accessor knows when the device is working or has an error, there is boolean flag \c **deviceHasError** in the DeviceModule. - 4.1.1 The flag is protected by the \c **errorMutex**. @@ -124,10 +147,6 @@ the DeviceModule is described in \ref spec_execptionHandling_high_level_implment - 4.4.2 The mutex can be a shared mutex for the ExceptionHandlingDecorators. Each ExceptionHandlingDecorator is only setting values of it's recovery helper, so all ExceptionHandlingDecorators can to this in parallel. - 4.4.3 During recovery, there is a *critical recovery section* where the DeviceModule must hold an *exlusive* lock of the recoveryMutex. There it accesses all recovery helpers and executes accessors' write functions. - - 4.5 Summary - - 4.5.1 errorMutex protects deviceHasError and increasing the transferCounter (starting the transfer) - - 4.5.2 recoveryMutex protects RecoveryHelpers, and clearing the exceptions list and resetting deviceHasError - - 4.6 MOVE THIS? The critical recovery section - 4.6.1 It has 3 steps - 4.6.1.1 Write all recovery accessors. If an exception occurs release the exclusive lock and exit the critical section. @@ -166,34 +185,26 @@ MOVE COMMENTS TO THE COMMENT SECTION - 1.3 In doPreWrite() the recoveryAccessor with the version number and ordering parameter is updated, and the written flag is cleared. This has to happen while holding the shared recovery lock. - 1.3.1 If the written flag was previously not set, the return value of doWriteTransfer() must be forced to true (data lost). + - 1.3.2 The check wheterh to skip the transfer (cf. 1.2) has to be done without releasing the lock between the write to the recoveryAccessor and the check. (*) -- 1.2 In doPreRead()/doPreWrite(), DeviceModule::startTransfer() is called. The return values is stored in transferAllowed. - - 1.2.1 If it returns false the device is in error. The actual transfer will be skipped. (cf. 2.2 or 2.3.14) - - 1.2.2 If it returns true the transfer will be executed. startTransfer has already increased the transfer counter and stopTransfer must be called in doPostRead()/doPostWrite() - - <strile> 1.2.3 write: The check for a prevailing fault state has to be done without releasing the lock between the write to the recoveryAccessor and the check. (*) </strike> Not needed, see 4.7 - - 1.2.4 For skipped transfers, none of the pre/transfer/post functions must be delegated to the target accessor. - - 1.2.5 If an asynchronous read transfer is skipped, a pseudo value needs to be written to the cppext::future_queue of the TransferFuture. This will cause the TransferFuture to be ready immediatly, so postRead() is called (*). +- 1.2 In doPreRead()/doPreWrite(), it must be decided whether to execute xxxTransferYyy(). This part requires a shared lock on the ChimeraTK::DeviceModule::errorMutex. + - 1.2.1 xxxTransferYyy() is <i>not</i> executed, if DeviceModule::deviceHasError == true and either: + - it is a write transfer (cf. A.2.3), or + - it is a read transfer and AccessMode::wait_for_new_data is not set (cf. A.2.2.3), or + - it is a read transfer and AccessMode::wait_for_new_data is set and ExceptionHandlingDecorator::previousReadFailed == false (cf. 1.5.1, 1.6.3.1 and A.2.2.4). + Otherwise xxxTransferYyy() is executed (potentially after it is frozen, see 1.4). + - 1.2.2 If xxxTransferYyy() is not executed, none of the pre/transfer/post functions must be delegated to the target accessor. + - 1.2.3 If xxxTransferYyy() is executed, and it is <i>not</i> a read transfer with AccessMode::wait_for_new_data set, the ChimeraTK::DeviceModule::transferCounter must be incremented. - 1.4 In doPreRead() certain read operations are frozen in case of a fault state, i.e. startTransfer() returned false (see A.2.2): - - <strike> 1.4.1 Obtain the recovery lock through DeviceModule::getRecoverySharedLock(), to prevent interference with an ongoing recovery procedure.</strike> Not needed. this would only be writing (we are in preRead) and resetting the error state, which is an atomic operation inside startTransfer. It does not quarantee that recovery (re-open the device) has not started. - - 1.4.2 Decide, whether freezing is done (don't freeze yet). Freezing is done if one of the following conditions is met: - - read type is blocking and AccessMode::wait_for_new_data is set, previousReadFailed == true, and DeviceModule::deviceHasError == true (cf. A.2.2.4), or - - no initial value has been read yet (getCurretVersion() == {nullptr}) and DeviceModule::deviceHasError == true (cf. A.4.2). - - <strike> 1.4.3 Obtain the DeviceModule::errorLock. Only then release the recovery lock. (*)</strike> Not needed. We don't rely on getting the notification though the condition variable. The important information - is only in deviceHasError. If the first check on it already says that the device has recovered this is ok. There is no harm that we have not slept and waited for a notification through the condition variable. - The only race condition is that the device could be OK and broken again. But this can happen as well with the condition variable. If the notification is coming there is no guarantee that the conditions is still true when the predicate is checked. - - 1.4.4 If the read should be frozen - - 1.4.4.1 Call DeviceModule::waitForRecovery(). This will wait on the condition variable until the error condition is gone. - - 1.4.4.2 Call startTransfer() and store it in transferAllowed. - - 1.4.4.3 If it returns true delegate preRead(), and continue with the transfer - - 1.4.4.4 If it returns false, go back to 1.4.4.1 - - 1.4.6 If an asynchronous read transfer is frozen, instead of 1.4.4 the following actions are executed: - - 1.4.6.1 Register the asynchronous read transfer with the DeviceModule::asynchronousReadQueue by placing a shared_pointer to this on it. FIXME: This would need the recovery lock. - - 1.4.6.2 Do not delegate to preRead() and readTransferAsync() - both functions are called by the DeviceModule instead. FIXME Missing in DeviceModule - + - 1.4.1 The shared lock on the DeviceModule::errorMutex acquired in 1.2 is still kept. + - 1.4.2 Decide, whether freezing is done (don't freeze yet). Freezing is done if no initial value has been read yet (getCurretVersion() == {nullptr}) and DeviceModule::deviceHasError == true (cf. A.4.2). (*) + - 1.4.3 Release the DeviceModule::errorMutex. + - 1.4.4 If the read should be frozen, acquire a shared lock on the ChimeraTK::DeviceModule::initialValueMutex. (*) + - 1.5 In doPostRead()/doPostWrite(): - 1.5.0 Delegate postRead() / postWrite() (see 1.6) - - 1.5.1 If there was no exception, set previousReadFailed = false. + - 1.5.1 If there was no exception, set ExceptionHandlingDecorator::previousReadFailed = false (cf. 1.2.1 and 1.6.3.1). - 1.5.3 In doPostWrite() the recoveryAccessor's written flag is set if the write was successful (no exception thrown; data lost flag does not matter here). (*) - 1.5.4 In doPostRead(), if no exception was thrown, end overriding the DataValidity returned by the accessor (cf. 1.6.2). - 1.5.2 If the transfer wasperform allowed, call in 1.2.2 the DeviceModule::activeTransfers counter was incremented, atomically decrement it. Must happen after 1.5.3 FIXME: fix numbering @@ -203,12 +214,12 @@ MOVE COMMENTS TO THE COMMENT SECTION - 1.6.2 For readable accessors: the DataValidity returned by the accessor is overridden to faulty until next successful read operation (cf. 1.5.4). - 1.6.2.1 The code instantiating the decorator (Application::createDeviceVariable()) has to make sure that the ExceptionHandlingDecorator is "inside" the MetaDataPropagatingRegisterDecorator, so the overriden DataValidity flag in case of an exception is properly propagated to the owning module/fan out. - 1.6.3 Action depending on the calling operation: - - 1.6.3.1 All read operations: The ExceptionHandlingDecorator remembers that it is in an exception state by setting previousReadFailed = true + - 1.6.3.1 All read operations: The ExceptionHandlingDecorator remembers that it is in an exception state by setting ExceptionHandlingDecorator::previousReadFailed = true (cf. 1.2.1 and 1.5.1) - 1.6.3.1 read (push-type inputs): return immediately (*) - 1.6.3.2 readNonBlocking / readLatest / read (poll-type inputs): Just return (true in readLatest() by definition in poll type). The calling module thread will continue and propagate the DataValidity::faulty flag (cf. 1.6.2). - 1.6.3.3 write: Do not block. Write will be later executed by the DeviceModule (see 1.1) -- 1.6 In the constructor of the decorator, put the name of the register to DeviceModule::listOfReadRegisters resp. DeviceModule::listOfWriteRegisters depending on the direction the accessor is used. +- 1.7 In the constructor of the decorator, put the name of the register to DeviceModule::listOfReadRegisters resp. DeviceModule::listOfWriteRegisters depending on the direction the accessor is used. \subsection spec_execptionHandling_high_level_implmentation_deviceModule B.2 DeviceModule @@ -253,10 +264,14 @@ MOVE COMMENTS TO THE COMMENT SECTION - 1.1.2 The ordering guarantee cannot work across DeviceModules anyway. Different devices may go offline and recover at different times. Even in case of two DeviceModules which actually refer to the same hardware device there is no synchronisation mechanism which ensures the recovering procedure is done in a defined order. -- <strike> 1.2.3 The lock excludes that the DeviceModule is between 2.3.2 and 2.3.10. If it is right before, the device is still in fault state and the value written to the recoveryAccessor is guaranteed to be written in 2.3.5. If it is right after, the exception state has already been resolved and the real write transfer will be attempted by the ExceptionHandlingDecorator. </strike> Has been replacd by 4.7 +- 1.3.2 Extending the duration of the lock until the decision whether to skip the transfer will prevent unncessary duplicate writes, which otherwise could occur if the DeviceModule went through the whole critical section 2.3.2 to 2.3.10 in between. - 1.2.5 The cppext::future_queue in the TransferFuture is a notification queue and hence of the type void. So we don't have to "invent" any value. Also this injection of values is legal, since the queue is multi-producer but single-consumer. This means, potentially concurrent injection of values while the actual accessor might also write to the queue is allowed. Also, the application is the only receiver of values of this queue, so injecting values cannot disturb the backend in any way. +- 1.4.2 In A.2.2.4 it was stated that also in case AccessMode::wait_for_new_data is set blocking read transfers are frozen on the second operation. Nothing is to be implemented for this case, the freezing simply relies on having an empty queue in the accessor. Once the device sends data again, the operation is intrinsically unfrozen. + +- 1.4.4 The transferCounter is already incremeted at this point. It is acceptable to freeze anyway in this case by waiting on the initialValueMutex, because the DeviceModule release the mutex after the first successful recovery and never obtains it again, and this happens before it waits for the transferCounter to become 0 in 2.3.15. + - 1.5.3 The written flag for the recoveryAccessor is used to report loss of data. If the loss of data is already reported directly, it should not later be reported again. Hence the written flag is set even if there was a loss of data in this context. - 1.6 Remember: exceptions from other phases are redirected to the post phase by the TransferElement base class.