diff --git a/doc/spec_exceptionHandling.dox b/doc/spec_exceptionHandling.dox index ab4555c374b9e1a09c926449022146bfa82e6af1..73a2375bd080ed2541a991908aaa28f9ac8b9b00 100644 --- a/doc/spec_exceptionHandling.dox +++ b/doc/spec_exceptionHandling.dox @@ -90,7 +90,7 @@ FIXME: NUMBERING Note: This section defines the internal interface on a low level. Helper functions, like getters and setters, are intenionally not mentioned here, since those are (in this context) unimportant details which can be chosen at will to structure the code conveniently. The entire interface between the ExceptionHandlingDecorator and the DeviceModule should be protected and the two classes should be friends, to prevent interference with the interface from other entities. Only DeviceModule::reportException() is public, see A.5. - 4.1 The boolean flag DeviceModule::deviceHasError - - 4.1.1 is used by the RecoveryAccessor to detect prevailing error conditions, to know when transfers have to be skipped, frozen or delayed (cf. 1.2 and 1.4). + - 4.1.1 is used by the ExceptionHandlingDecorator to detect prevailing error conditions, to know when transfers have to be skipped, frozen or delayed (cf. 1.2 and 1.4). - 4.1.2 The access is protected by the DeviceModule::errorMutex: - shared lock allows to read - unique lock allows to read and write @@ -106,18 +106,18 @@ Note: This section defines the internal interface on a low level. Helper functio - unique lock allows to call RecoveryHelper::accessor.write() and to read the RecoveryHelper::versionNumber - 4.4 The cppext::future_queue DeviceModule::errorQueue - - 4.4.1 is used by the RecoveryAccessor to inform the DeviceModule about new exceptions. + - 4.4.1 is used by the ExceptionHandlingDecorator to inform the DeviceModule about new exceptions. -- 4.6 The following mutexes govern critical sections (besides variable access listed above): - - 4.6.1 DeviceModule::errorMutex protects (*) +- 4.5 The following mutexes govern critical sections (besides variable access listed above): + - 4.5.1 DeviceModule::errorMutex protects (*) - the (positive) decision to start a transfer followed by incrementing the DeviceModule::transferCounter in 1.2.1 to 1.2.3, against - setting DeviceModule::deviceHasError flag in 1.6.1. - - 4.6.2 DeviceModule::recoveryMutex protects (*) + - 4.5.2 DeviceModule::recoveryMutex protects (*) - writing the DeviceModule::recoveryHelpers to the device and clearing the DeviceModule::deviceHasError flag in 2.3.5 to 2.3.8, against - updating the DeviceModule::recoveryHelpers in 1.3. - - 4.6.3 DeviceModule::initialValueMutex protects (*) + - 4.5.3 DeviceModule::initialValueMutex protects (*) - the start of a read operation in 1.4.4, against - the setup phase of a device until it has been opened and recovered for the very first time in 2.1 to 2.9. @@ -128,28 +128,29 @@ Note: This section defines the internal interface on a low level. Helper functio - 4.3.2 A shared lock (in contrast to an exclusive lock) is used for the same reasons as in 4.2. -- 4.6.1 This prevents a race condition in 2.3.15. If a (synchronous) transfer might be started after DeviceModule::deviceHasError has been set, the barrier for new transfers in 2.3.15 would not be effective and the transfer might be even executed only after the device has been re-openend (2.3.1) but before the recovery is complete. +- 4.5.1 This prevents a race condition in 2.3.15. If a (synchronous) transfer might be started after DeviceModule::deviceHasError has been set, the barrier for new transfers in 2.3.15 would not be effective and the transfer might be even executed only after the device has been re-openend (2.3.1) but before the recovery is complete. -- 4.6.2 This prevents data loss due to a race condition. If the ExceptionHandlingDecorator would update the corresponding DeviceModule::RecoveryHelpers list entry only after it has been written to the device in 2.3.5, but the ExceptionHandlingDecorator would decide not to execute the write operation (1.2) because the DeviceModule thread is still before 2.3.8, the data would not be written to the device at all. +- 4.5.2 This prevents data loss due to a race condition. If the ExceptionHandlingDecorator would update the corresponding DeviceModule::RecoveryHelpers list entry only after it has been written to the device in 2.3.5, but the ExceptionHandlingDecorator would decide not to execute the write operation (1.2) because the DeviceModule thread is still before 2.3.8, the data would not be written to the device at all. -- 4.6.3 This implements freezing reads until the initial value can be read, cf. 4.2. +- 4.5.3 This implements freezing reads until the initial value can be read, cf. 4.2. \subsection spec_execptionHandling_high_level_implmentation_decorator B.1 ExceptionHandlingDecorator -- 1.1 A second, undecorated copy of each writeable device register accessor (*) is used as a so-called recoveryAccessor by the ExceptionHandlingDecorator and the DeviceModule. These recoveryAccessor are used to set the initial values of registers when the device is opened for the first time and to recover the last written values during the recovery procedure. - - 1.1.1 The recoveryAccessor is stored by the DeviceModule with additional meta data in a so-called RecoveryHelper data structure, which contains: - - the recoveryAccessor itself, - - the VersionNumber of the (potentially unwritten) data stored in the accessor, - - an ordering parameter which determines the order of write opereations during recovery. - - an atomic flag which indicates whether the value in the recoveryAccessor has already been written to data. (*) - - 1.1.2 Ordering can be done per device (*), hence each DeviceModule has one 64-bit atomic counter which is incremented for each write operation and the value is stored in the ordering parameter for the recoveryAccessor. - - 1.1.3 The RecoveryHelper object may be accessed only under a lock to prevent concurrent access during recovery. The lock shall be shared to allow concurrent write operations of different registers - only the DeviceModule needs to obtain an exclusive lock during recovery. The lock is obained by the ExceptionHandlingDecorators via DeviceModule::getRecoverySharedLock(). +- 1.1 A second, undecorated copy of each writeable device register accessor (*), the so-called recovery accessor, is stored in the DeviceModule::recoveryHelpers. These recoveryHelpers are used to set the initial values of registers when the device is opened for the first time and to recover the last written values during the recovery procedure. + - 1.1.1 The DeviceModule::recoveryHelpers is a list of RecoveryHelper objects, which each contain: + - RecoveryHelper::accessor, the recovery accessor itself, + - RecoveryHelper::versionNumber, the VersionNumber of the (potentially unwritten) data stored in the value buffer of the accessor, + - RecoveryHelper::writeOrder, an ordering parameter which determines the order of write opereations during recovery. + - RecoveryHelper::wasWritten, an atomic flag which indicates whether the data in the value buffer of the RecoveryHelper::accessor has already been written to the device. (*) + - 1.1.2 Ordering can be done per device (*), hence each DeviceModule has one 64-bit atomic counter which is incremented for each write operation and the value is stored in RecoveryHelper::writeOrder. + - 1.1.3 The RecoveryHelper objects may be accessed only under a lock, see 4.3. -- 1.3 In doPreWrite() the recoveryAccessor with the version number and ordering parameter is updated, and the written flag is cleared. This has to happen while holding the shared recovery lock. +- 1.3 In doPreWrite() the RecoveryHelper is updated. This has to happen while holding the shared recovery lock. - 1.3.0 This step needs to be done unconditionally at the very beginning of doPreWrite(), before 1.2 and before delegating preWrite(). (*) - 1.3.1 If the written flag was previously not set, the return value of doWriteTransfer() must be forced to true (data lost). - - 1.3.2 The check wheterh to skip the transfer (cf. 1.2) has to be done without releasing the lock between the write to the recoveryAccessor and the check. (*) + - 1.3.x Update the value buffer of the RecoveryHelper::accessor + - 1.3.2 The check whether to skip the transfer (cf. 1.2) has to be done without releasing the lock between the update of the RecoveryHelper and the check. (*) - 1.2 In doPreRead()/doPreWrite(), it must be decided whether to execute xxxTransferYyy(). This part requires a shared lock on the DeviceModule::errorMutex. - 1.2.1 xxxTransferYyy() is <i>not</i> executed, if DeviceModule::deviceHasError == true and either: @@ -169,9 +170,9 @@ Note: This section defines the internal interface on a low level. Helper functio - 1.5 In doPostRead()/doPostWrite(): - 1.5.0 Delegate postRead() / postWrite() (see 1.6) - 1.5.1 If there was no exception, set ExceptionHandlingDecorator::previousReadFailed = false (cf. 1.2.1 and 1.6.3.1). - - 1.5.3 In doPostWrite() the recoveryAccessor's written flag is set if the write was successful (no exception thrown; data lost flag does not matter here). (*) - - 1.5.4 In doPostRead(), if no exception was thrown, end overriding the DataValidity returned by the accessor (cf. 1.6.2). - 1.5.2 If the DeviceModule::transferCounter was incremented in 1.2.3, decrement it. (*) + - 1.5.3 In doPostWrite() the RecoveryHelper::wasWritten flag is set if the write was successful (no exception thrown; data lost flag does not matter here). (*) + - 1.5.4 In doPostRead(), if no exception was thrown, end overriding the DataValidity returned by the accessor (cf. 1.6.2). - 1.6 In doPostRead()/doPostWrite(), any runtime_error exception thrown by the delegated postRead()/postWrite() is caught (*). The following actions are in case of an exception: - 1.6.1 The error is reported to the DeviceModule via DeviceModule::reportException(). This automatically sets DeviceModule::deviceHasError to true. From this point on, no new transfers will be started.(*) @@ -187,15 +188,13 @@ Note: This section defines the internal interface on a low level. Helper functio \subsubsection spec_execptionHandling_high_level_implmentation_decorator_comments (*) Comments -- 1.1 Possible future change: Output accessors can have the option not to have a recovery accessor. This is needed for instance for "trigger registers" which start an operation on the hardware. Also void registers don't have recovery accessors (once the void data type is supported). +- 1.1 Possible future change: Output accessors can have the option not to have a RecoveryHelper. This is needed for instance for "trigger registers" which start an operation on the hardware. Also void registers don't have a RecoveryHelper (once the void data type is supported by ChimeraTK). -- 1.1.1 The written flag cannot be replaced by comparing the version number of the recoveryAccessor and the version number stored in the RecoveryHelper, because normal writes (without exceptions) would not update the version number of the recoveryAccessor. -- 1.1.1 The flag is atomic so it can be set without getting the recoveryLock again in doPostRead(). This has to happen before calling DeviceModule::stopTransfer() to ensure the DeviceModule() does not start the recovery yet. - When clearing it in doPreRead(), and setting it in the DeviceModule during recovery, the recoveryLock must be held. +- 1.1.1 The written flag cannot be replaced by comparing RecoveryHelper::accessor.getCurrentVersion() and RecoveryHelper::versionNumber, because normal writes (without exceptions) would not update the version number of the RecoveryHelper::accessor. The written flag is atomic so it can be set without getting the recoveryLock again in doPostWrite(). This has to happen before calling DeviceModule::stopTransfer() to ensure the DeviceModule does not start the recovery yet. When clearing it in doPreWrite(), and setting it in the DeviceModule during recovery, the recoveryLock must be held (see 4.5.2). - 1.1.2 The ordering guarantee cannot work across DeviceModules anyway. Different devices may go offline and recover at different times. Even in case of two DeviceModules which actually refer to the same hardware device there is no synchronisation mechanism which ensures the recovering procedure is done in a defined order. -- 1.3.0 Updating the recoveryHelper first ensures that no data is lost, even if the write operation attempt is concurrent with a recovery. See 4.6.2. +- 1.3.0 Updating the recoveryHelper first ensures that no data is lost, even if the write operation attempt is concurrent with a recovery. See 4.5.2. - 1.3.2 Extending the duration of the lock until the decision whether to skip the transfer will prevent unncessary duplicate writes, which otherwise could occur if the DeviceModule went through the whole critical section 2.3.2 to 2.3.10 in between. @@ -205,9 +204,9 @@ Note: This section defines the internal interface on a low level. Helper functio - 1.4.4 The transferCounter is already incremeted at this point. It is acceptable to freeze anyway in this case by waiting on the initialValueMutex, because the DeviceModule release the mutex after the first successful recovery and never obtains it again, and this happens before it waits for the transferCounter to become 0 in 2.3.15. -- 1.5.2 The state of DeviceModule::deviceHasError does not matter here. The counter always MUST be decreased after a transfer (if it has been incremented in the corresponding preXxx()), whether the transfer failed or not. Also, this must happen after 1.5.3 ===> why? DeviceModule::transferCounter > 0 prevents the DeviceModule from starting the recovery, but during the recovery the written flag will also just be set and not read. The written flag is merely used to determine in the next write whether data has been lost (which is the case if the written flag is not set). +- 1.5.2 The state of DeviceModule::deviceHasError does not matter here. The counter always MUST be decreased after a transfer (if it has been incremented in the corresponding preXxx()), whether the transfer failed or not. Note: the exact place of decrementing the counter within doPostXxx does not matter, it just has to be done after delegating to postXxx(). The other actions on doPostXxx() have no influence on the behaviour of the DeviceModule. -- 1.5.3 The written flag for the recoveryAccessor is used to report loss of data. If the loss of data is already reported directly, it should not later be reported again. Hence the written flag is set even if there was a loss of data in this context. +- 1.5.3 The RecoveryHelper::wasWritten flag is used to report loss of data. If the loss of data is already reported directly, it should not later be reported again. Hence the written flag is set even if there was a loss of data in this context. - 1.6 Remember: exceptions from other phases are redirected to the post phase by the TransferElement base class. @@ -219,25 +218,33 @@ Note: This section defines the internal interface on a low level. Helper functio \subsection spec_execptionHandling_high_level_implmentation_deviceModule B.2 DeviceModule - 2.1 The application always starts with all devices as closed. For each device, the initial value for Devices/<alias>/status is set to 1 and the initial value for Devices/<alias>/message is set to an error that the device has not been opened yet (the message will be overwritten with the real error message if the first attempt to open fails, see 2.3.1). + - 2.2 The DeviceModule takes care that ExceptionHandlingDecorators initally do not perform any read or write operations, but freeze (cf. 1.4). This happens before running any prepare() of an ApplicationModule, where the first write calls to ExceptionHandlingDecorators might be done. + - 2.3 In the DeviceModule thread, the following procedure is executed (in a loop until termination): + - 2.3.1 The DeviceModule tries to open the device until it succeeds and isFunctional() returns true. - 2.3.1.1 If the very first attempt to open the device after the application start fails, the error message of the exception is used to overwrite the content of Devices/<alias>/message. Otherwise error messages of exceptions thrown by Device::open() are not visible. - New position for 2.3.6 The queue of reported exceptions is cleared. (*) + - 2.3.3 Check that all registers on DeviceModule::listOfReadRegisters are isReadable() and all registers on DeviceModule::listOfWriteRegisters are isWriteable(). - 2.3.3.1 This involves obtaining an accessor for the register first, which is discarded after the check. - 2.3.3.2 If there is an exception, update Devices/<alias>/message with the error message and go back to 2.3.1. - 2.3.3.3 If one of the accessors does not meet this condition, throw a ChimeraTK::logic_error. + - 2.3.4 Device is initialised by iterating initialisationHandlers list. - 2.3.4.1 If there is an exception, update Devices/<alias>/message with the error message and go back to 2.3.1. - - New positon of 2.3.2 Obtain lock for accessing recoveryAccessors. - - 2.3.5 All valid recoveryAccessors are written in the same order they were originally written. - - 2.3.5.1 A recoveryAccessor is considered "valid", if it has already received a value, i.e. its current version number is not {nullptr} any more. + + - New positon of 2.3.2 Obtain unique lock on DeviceModule::recoveryMutex. + + - 2.3.5 Call write() on all valid RecoveryHelper::accessor, in the ascending order of the DeviceModule::writeOrder. + - 2.3.5.1 A RecoveryHelper::accessor is considered "valid", if it has already received a value, i.e. RecoveryHelper::versionNumber != {nullptr} - 2.3.5.2 If there is an exception, update Devices/<alias>/message with the error message, release the lock and go back to 2.3.1. + - 2.3.7 Devices/<alias>/status is set to 0 and Devices/<alias>/message is set to an empty string. - 2.3.8 DeviceModule allows ExceptionHandlingDecorators to execute reads and writes again (cf. 2.3.14) - 2.3.9 All frozen read operations (cf. 1.4.4) are notified via DeviceModule::errorIsResolvedCondVar. - - 2.3.10 Release lock for recoveryAccessors. + - 2.3.10 Release lock on DeviceModule::recoveryMutex (was obtained in 2.3.2). - 2.3.11 The DeviceModuleThread waits for the next reported exception. The call to reportException in the other thread has already set deviceHasError to true (*). From this point on, no new transfers will be started. - 2.3.12 An exception is received. - 2.3.13 Devices/<alias>/status is set to 1 and Devices/<alias>/message is set to the first received exception message.