Newer
Older
// put the namespace around the doxygen block so we don't have to give it all the time in the code to get links
namespace ChimeraTK {
\page spec_execptionHandling Technical specification: Exception handling for device runtime errors
<b>DRAFT VERSION, WRITE-UP IN PROGRESS!</b>
\section spec_execptionHandling_changes Recent changes
This section lists bigger recent changes, which might be hard to track due to restructuring the document at the same time. This section will go away once the changes have been reviewed.
- Write never blocks in case of exceptions. The following spec points (with discussion why) was hence removed/replaces. See also the mattermost channel.
- Write operations will block immediately until the device has been recovered and the write operation has been completed. [TBD: is this really a good idea? <b>COMMENT</b>: The order of write operations is still not guaranteed through the recovery accessors (which maybe should be changed), and blocking writes has some severe drawbacks. Not only in fan outs but also in normal ApplicationModules blocking writes will prevent propagation of DataValidity flags! Blocking writes might help if a sequence of values is written to the same register - this is not handled by the recovery accessor. But if a handshake register is read back in between the writes, the situation can already be handled properly (check DataValidity flag, restart sequence after recovery). Maybe blocking writes create more probelms then they solve!? On the other hand, how does the application then know that a write() has no effect yet? E.g. a PI controller might wind-up if actuator and sensor are on different devices and the actuator fails. Then again, how is this different a failing actuator hardware without breaking the communication? Some form of a status readback of the actuator again cures the situation. I think I am in favour of "fire-and-forget" writes.].
- Write should not block in case of an exception for the outputs of ThreadedFanOut / TriggerFanOut.
- According to \link spec_initialValuePropagation \endlink, writes in ApplicationModules do not block before the first successful read in the main loop.
- The order of writes during recovery (through recoveryAccessors) is now guaranteed to be the same as the original writes.
- Direct access to the DataFaultCounter is not necessary. Since the spec says the behavior should be transparent whether a connection is directly made to the device or another ApplicationModule/FanOut is in between, it is sufficient to override the flag returned by ExceptionHandlingDecorator::dataValidity() in case of an exception state. This greatly simplifies the implementation and does not change the behavior.
\section spec_execptionHandling_intro Introduction
Exceptions are handled by ApplicationCore in a way that the application developer does not need to care much about it.
ChimeraTK::runtime_error exceptions are caught by the framework and are reported to the DeviceModule. The DeviceModule handles this exception and periodically tries to open the device. Communication with the faulty device is blocked or delayed until the device is functional again. In case of several devices only the faulty device is blocked. Faulty devices do not prevent the application from starting, only the parts of the application that depend on the fault device are waiting for the device to come up.
Input variables of ApplicationModules which cannot be read due to a faulty device will set and propagate the DataValidity::faulty flag (see also \link spec_dataValidityPropagation \endlink).
When the device is functional, it be (re)initialised by using application-defined initialisation handlers and also recover the last known values of its process variables.
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
\section spec_execptionHandling_behavior A. Behavioural description
- 1. All ChimeraTK::runtime_error exceptions thrown by device register accessors are handled by the framework and are never exposed to user code in ApplicationModules.
- 2 When an exception has been received (thrown by a device register accessor in an ApplicationModule, FanOut etc.):
- 2.1 The exception status is published as a process variable together with an error message.
- 2.1.1 The variable Devices/<alias>/status contains a boolean flag whether the device is in an error state
- 2.1.2 The variable Devices/<alias>/message contains an error message, if the device is in an error state, or an empty string otherwise.
- 2.1 Read operations will propagate the DataValidity::faulty to the owning module / fan out (without changing the actual value).
- 2.2 The normal module algorithm code will be continued, to allow this flag to propagate to the outputs in the same way as if it had been received through the process variable itself (c.f. 9.).
- 2.3 Blocking read operations will block after the flag has been read and propagated once (i.e. on the second blocking read of the same accessor).
- 2.4 Non-blocking read operations (incl. readLatest) never block.
- 2.5 Asynchronous read operations behave analogue to 2.3: The TransferFuture, which was valid while the exception was received, is fulfilled once, the DataValidity::faulty is propaated to the owning module and the value is left unchanged. The TransferFuture will only be fulfilled again after the device has been recovered.
- 2.6 Write operations never block. In case of an exception (new or persisting), the actual write operation will be delayed until the device is functional and recovered again. The same mechanism as used for 3.1.2 is used here, hence the order of write operations is guaranteed across accessors, but only the latest written value of each accessor prevails. (*)
- 2.6.1 The return value of write() indicates whether data was lost in the transfer. If the write has to be delayed due to an exception, the return value will be true, if a previously delayed and not-yet writen value is discarded in the process, false otherwise.
- 2.6.2 When the delayed value is finally written to the device after recovering the exception, it is guaranteed that no data loss happens (writes with data loss will be retried).
- 2.7 In case of exceptions, there is no guaranteed realtime behavior, not even for "non-blocking" transfers. (*)
- 3 The framework tries to resolve an exception state by periodically re-opening the faulty device.
- 3.1 After successfully re-opening the device, a recovery procedure is executed before allowing any read/write operations from the AppliactionModules and FanOuts again. This recovery procedure involves:
- 3.1.1 the execution of so-called initialisation handlers (cf. 3.2), and
- 3.1.2 restoring all registers that have been written since the start of the application with their latest values. The register values are restored in the same order they were written. [<b>NEW REQUIREMENT!</b>] (*)
- 3.1.3 Finally, Devices/<alias>/deviceBecameFunctional is written to inform any module subscribing this variable about the finished recovery. (*)
- 3.2 Any number of initialisation handlers can be added to the DeviceModule in the user code. Initialisation handlers are callback function which will be executed when a device is opened for the first time and after a device recovers from an exception, before any process variables are written. See DeviceModule::addInitialisationHandler().
- 4 The behavior at application start (when all devices are still closed at first) is similar to the case of a later received exception. The only differences are mentioned in 4.2.
- 4.1 Even if some devices are initially in a persisting error state, the part of the application which does not interact with the faulty devices starts and works normally.
- 4.2 Initial values are correctly propagated after a device is opened. See \link spec_initialValuePropagation \endlink. Especially, no read function (even readNonBlocking/readLatest) will return before an initial value has been received.(*)
- 5 Exception handling and DataValidity flag propagation is implemented such that it is transparent to a module whether it is directly connected to a device, or whether a fanout or another application module is in between.
- 6 ChimeraTK::logic_error exceptions are left unhandled and will terminate the application. These errors may only occur in the initialisation phase (up to the point where all devices are opened and initialised) and point to a severe configuration error which is not recoverable. (*)
\subsection spec_execptionHandling_behavior_comments (*) Comments
- 2.6 / 3.1.3 If timing is important for write operations (e.g. must not write a sequence of registers too fast), or if multiple values need to be written to the same register in sequence, the application cannot fully rely on the framework's recovery procedure. The framework hence provides the process variable Devices/<alias>/deviceBecameFunctional for each device, which will be written each time the recovery procedure is completed (cf. 3.1.3). ApplicationModules which implement such timed sequence need to receive this variable and restart the entire sequence after the recovery.
- 2.7 Even non-blocking read and write operations are not truely non-blocking, since they are still synchronous. The "non-blocking" guarantee only means that the operation does not block for an extended period of time until the fault state has been cleared. For the duration of the recovery procedure and of course for timeout periods these operations may still block.
- 3.1.2 For some applications, the order of writes may be important, e.g. if firmware expects this. Please note that the VersionNumber is insufficient as a sorting criteria, since many writes may have been done with the same VersionNumber (in an ApplicationModule, the VersionNumber used for the writes is determined by the largest VersionNumber of the inputs).
- 4.2 DataValidity::faulty is set at first by default, so there is no need to propagate this flag initially. To prevent race conditions and undefined behavior, it even needs to be made sure that the flag is not propagated unnecessarily. The behavior of non-blocking reads presents a slight asymmetry between the initial device opening and a later recovery. This will in particular be visible when restarting a server while a device is offline. If a module only uses readLatest()/readNonBlocking() (= read() for poll-type inputs) for the offline device, the module was still running before the server restart using the last known values for the dysfunctional registers (and flagging all outputs as faulty). After the restart, the module has to wait for the initial value and hence will not run until the device becomes functional again. To make this behavior symmetric, one would need to persist the values of device inputs. Since this only affects a corner case in which anyway no usable output is produced, the behavior is considered acceptable.
- 6. In future, maybe logic_errors are also handled, so configuration errors can nicely be presented to the control system. This may be important especially since logic_errors may depend also on the configuration of external components (devices). If e.g. a device is changed (e.g. device is another control system application which has been modified), logic_errors may be thrown in the recovery phase, despite the device had been successfully initialsed previously.
\section spec_execptionHandling_high_level_implmentation B. High-level description of the implementation
- 1. A so-called ExceptionHandlingDecorator is placed around all device register accessors (used in ApplicationModules and FanOuts). It is responsible for catching the exceptions and implementing most of the behavior described in A.2.
- 1.1 The ExceptionHandlingDecorator will catch any runtime_error exception thrown in postRead/postWrite (exceptions from other stages are delayed there by the TransferElement base class).
- 1.1.1 The error is reported to the DeviceModule
- 1.1.2 For readable accessors: override DataValidity returned by the accessor to faulty until next successful read operation
- 1.1.2.1 The code decorating the accessors has to make sure that the ExceptionHandlingDecorator is "inside" the MetaDataPropagatingRegisterDecorator, so the overriden DataValidity flag in case of an exception is properly propagated to the owning module/fan out.
- 1.1.3 Action depending on the calling operation:
- 1.1.3.1 read (push-type inputs): The first "blocking" read call returns immediately. The ExceptionHandlingDecorator remembers that it is in an exception state. The calling module thread will continue and propagate the DataValidity::faulty flag (cf. 1.1.2). The second call to read() on the same accessor will block, if the exception state still prevails. The mechanism for this blocking is described in 1.3.
- 1.1.3.2 readNonBlocking / readLatest / read (poll-type inputs): Just return false (no new data). The calling module thread will continue and propagate the DataValidity::faulty flag (cf. 1.1.2).
- 1.1.3.3 any read operation: If no intial value is yet present (current version number is still VersionNumber{nullptr}), the two previous rules void. Instead the read operation will block until they can succeed (cf. A.4.2), using the mechanism described in 1.3.
- 1.1.3.4 write: Do not block. Write will be later executed by the DeviceModule (cf. 1.2)
- 1.2 A second, undecorated copy of each writeable device register accessor is used as a so-called recoveryAccessor by the ExceptionHandlingDecorator and the DeviceModule. These recoveryAccessor are used to set the initial values of registers when the device is opened for the first time and to recover the last written values during the recovery procedure. (*)
- 1.2.1 Along with each recoveryAccessor, an ordering parameter is stored. Ordering can be done per device (*), hence each DeviceModule has one 64-bit atomic counter which is incremented for each write operation and the is value stored in the ordering parameter for the recoveryAccessor.
- 1.2.2 Also a flag is stored with each recoveryAccessor which indicates whether the value in the recoveryAccessor has already been written to data.
- 1.2.3 The recoveryAccessor, its ordering parameter and the written flag may be accessed only under a lock to prevent concurrent access during recovery. The lock shall be shared to allow concurrent write operations of different registers - only the DeviceModule needs to obtain an exclusive lock during recovery.
- 1.2.4 In doPreWrite() the recoveryAccessor with the ordering parameter is updated, and the written flag is cleared (*)
- 1.2.4.1 If the written flag was previously not set, the return value of doWriteTransfer() must be forced to true (data lost).
- 1.2.5 In doPostWrite() the recoveryAccessor's written flag is set if the write was successful (no exception thrown; data lost flag does not matter here). (*)
- 1.3 As described in 1.1.3, the ExceptionHandlingDecorator blocks certain read operations in case of exceptions. This is done as follows:
- 1.3.1 Blocking will take place in postRead, after the exception has been reported to the DeviceModule.
- 1.3.2 Wait until the DeviceModule allows the operation to continue (cf. 2.3.7)
- 1.3.3 (Re-)tries to get the value. Exceptions occurring during the retry are handled in the same way as in normal read operations (see 1.1).
- 1.4 In doPreRead/doPreWrite, check if fault state already prevails. If yes, the actual transfer will be skipped. (cf. 2.2 or 2.3.13)
- 1.4.1 The check for a prevailing fault state has to be done while holding the recoveryAccessor lock, without releasing it between the write to the recoveryAccessor and the check. (*)
- 2. DeviceModule:
- 2.1 The application always starts with all devices as closed. For each device, the initial value for Devices/<alias>/status is set to 1 and the initial value for Devices/<alias>/message is set to an error that the device has not been opened yet (the message will be overwritten with the real error message if the first attempt to open fails, see 2.3.1).
- 2.2 The DeviceModule takes care that ExceptionHandlingDecorators initally do not perform any read or write operations, but block (cf. 1.4). This happens before running any prepare() of an ApplicationModule, where the first write calls to ExceptionHandlingDecorators might be done.
- 2.3 In the DeviceModule thread, the following procedure is executed (in a loop until termination):
- 2.3.1 The DeviceModule tries to open the device until it succeeds and isFunctional() returns true.
- 2.3.1.1 If the very first attempt to open the device after the application start fails, the error message of the exception is used to overwrite the content of Devices/<alias>/message. Otherwise error messages of exceptions thrown by Device::open() are not visible.
- 2.3.2 Obtain lock for accessing recoveryAccessors.
- 2.3.3 Device is initialised by iterating initialisationHandlers list.
- 2.3.3.1 If there is an exception, update Devices/<alias>/message with the error message, release the lock and go back to 2.3.1. (*)
- 2.3.4 All valid recoveryAccessors are written in the same order they were originally written [<b>NEW REQUIREMENT! see A.4.2</b>].
- 2.3.4.1 A recoveryAccessor is considered "valid", if it has already received a value, i.e. its current version number is not {nullptr} any more.
- 2.3.4.2 If there is an exception, update Devices/<alias>/message with the error message, release the lock and go back to 2.3.1. (*)
- 2.3.5 The queue of reported exceptions is cleared. (*)
- 2.3.6 Devices/<alias>/status is set to 0 and Devices/<alias>/message is set to an empty string.
- 2.3.7 DeviceModule allows ExceptionHandlingDecorators to execute reads and writes again (cf. 2.3.13)
- 2.3.8 All blocked read operations (cf. 1.1.3) are notified.
- 2.3.9 Release lock for recoveryAccessors.
- 2.3.10 The DeviceModuleThread waits for the next reported exception.
- 2.3.11 An exception is received.
- 2.3.12 Devices/<alias>/status is set to 1 and Devices/<alias>/message is set to the first received exception message.
- 2.3.13 From this point on, all ExceptionHandlingDecorators for this device must prevent new read and write operations from starting (see also 1.4).
- 2.3.14 The device module waits until all running read and write operations of ExceptionHandlingDecorators have ended. (*)
- 2.3.15 The thread goes back to 2.3.1 and tries to re-open the device.
Martin Killenberg
committed
\subsection spec_execptionHandling_high_level_implmentation_comments (*) Comments
- 1.2 Possible future change: Output accessors can have the option not to have a recovery accessor. This is needed for instance for "trigger registers" which start an operation on the hardware. Also void registers don't have recovery accessors (once the void data type is supported).
- 1.2.1 The ordering guarantee cannot work across DeviceModules anyway. Different devices may go offline and recover at different times. Even in case of two DeviceModules which actually refer to the same hardware device there is no synchronisation mechanism which ensures the recovering procedure is done in a defined order.
- 1.2.5 The written flag for the recoveryAccessor is used to report loss of data. If the loss of data is already reported directly, it should not later be reported again. Hence the written flag is set even if there was a loss of data in this context.
Martin Killenberg
committed
- 1.4.1 The lock excludes that the DeviceModule is between 2.3.2 and 2.3.9. If it is right before, the device is still in fault state and the value written to the recoveryAccessor is guaranteed to be written in 2.3.4. If it is right after, the exception state has already been resolved and the real write transfer will be attempted by the ExceptionHandlingDecorator.
- 2.3.5 The exact place when this is done does not matter, as long as it is done under the lock for the recoveryAccessors.
- 2.3.14 The backend has to take care that all operations, also the blocking/asynchronous reads with "waitForNewData", terminate when an exception is thrown, so recovery can take place (see DeviceAccess TransferElement specification).
\section spec_execptionHandling_implmentation_details C. Implementation details
Martin Killenberg
committed
\subsection spec_execptionHandling_implmentation_details_DeviceAccess C.a. Requirements to the DeviceAccess interface:
- 1. Exceptions are reported in postRead()/postWrite()
- As the error itself always occurs in the read/write transfer, the TransferElement base class implements a mechanism to catch it and transfer the exception message into the post-read function, where it is re-thrown. This is required for three reasons:
- i. A transfer must always be complete, i.e. preXxx and postXxx must always be called. This is for instance important in case a user buffer has been swapped out, and has to be swapped back in so the user buffer stays intact in the application. Letting the exception in doXxxTransfer through would break this. (This is DeviceAccess spec.)
- ii. The transfer groups calls xxxTransfer itself on a potentially exchanged hardware accessing element. All code using transfer groups would have to do exception handling itself, and the individual accessors would not behave according to this (ApplicationCore exception handling) specification when used with a transfer group. By throwing in postXxx the ExceptionHandlingDecorator can handle it, and it automatically works with transfer groups.
- iii. Asynchronous reads are executing the transfer in a different thread anyway, and have to delay the throwing to postRead.
- 2. Before throwing, each backend must make sure that the actions in doPostRead() are completed such that the user buffer of a calling accessor is intact
- 3. postRead() and postWrite() take care that the bookkeeping of ongoing transfers is done correctly, even if the called doPostXxx actions throw.
- 4. The TransferType (read, readNonBlocking, readLatest, readAsync, write, writeDestructively) is known in postRead and postWrite, so a decorator or backend can do different actions if required.
- 5. postRead() must always be called, also for failed transfers and for readNonBlocking and readLatest if there was no new data.
- 6. If a backend / doXXXTransfer implementation throws, the backend must make sure that all pending transactions will terminate. Especially transfers which implement reading with watiForNewData must return with an error, because no new data will arrive because the device is broken. These transfers must be interruptible.
- 7. In the level closest to the hardware the exception will be caught in the transfer. postXxx() (without 'do') will raise the exception. In case of the decorator-like pattern, where the call is delegated, the doPostXxx() (with 'do') will throw because it is calls _impl->postXxx(). The postXxx() implementation in TransferElement must make sure that the bookkeeping in each layer is complete. If required it has to catch the exception from doPostXxx(), finish what needs to be done and re-throw.
\subsection spec_execptionHandling_implmentation_details_DeviceModule C.b. DeviceModule
Interfaces:
- 1. External interface
An error status and the last error message are automatically connected to the control system for each device
- /Devices/{AliasName}/message
- /Devices/{AliasName}/status
- 4.2 Internal interface to the ExceptionHandlingDecorator
- 4.2.1 A thread safe function DeviceModule::reportException() (implements 2.5.2). It does not block but only puts the exception into a lock-free queue.
- 4.2.2 A blocking way to wait for the device to become available after reporting the exception (implements 2.3.7 and 2.4.1) (as a response that report exception has been processed).
- 4.2.3 A shared mutex to prevent read and write operations before the device has been initialised (implements 1.b, 2.1, 2.3.6 and 2.6.1)
- 4.2.4 A function to add recoveryAccessors (implements 1.g)
- 4.2.5 A shared mutex to protect the recovery accessors
- 4.2.6 A counter of active transfers
- 4.2.1 A user/application can also report device errors calling DeviceModue::reportException(). This allows to for instance to write a watchdog module which is monitoring a reference register, and puts
the whole device into an exception state (incl. automatic message to the CS, propagation of the DataValidity::faulty flag and recovery).
- 4.2.2 Currently implemented as a condition variable
- 4.2.2 FIXME We might also need a way to wait until the device module has seen the exception, but not recovered yet. But if it is already recovering this might take a while, so it would effectively be the same. Not clear at this moment.
- 4.2.3 Read/write operations must hold a shared lock before starting the actual read/write. This is implemented in the ExceptionHandlingDecorator. As the lock is shared, parallel write operations don't block each other inside application core. While recovering, the device module will hold an exclusive lock.
- 4.2.5 As the recovery accessors are filled in the ApplicationModule threads (or fanouts), but the writing is taking place in the device module thread, the recovery accessor's user buffer must be protected with a mutex. Again, a shared mutex is used so normal write operations can run in parallel and don't interfere with each other (each one only touches its own buffer), and the write, which touches all buffers holds an exclusive lock.
- 4.2.6 The counter is needed so the DeviceModule knows when no transfer will access the device, and the recovery accessors can be used. If the accessors would hold the shared lock, they could dead-lock each other in asynchronous transfers if accessor A holds the lock while waiting for accessor B to finish. But B is waiting for the device to recover which cannot happen because A is holding the lock.
The counter is increased while holding the lock 4.2.3, and then the lock is released again. This is sufficient to stop new accessors from starting a transfer. And the counter is there to make sure the running ones have finished.
<b>5. ExceptionHandlingDecorator</b>
- 5.1 External interface
- 5.1.1 Provides a function that does not block writes, even if the device is not available (part of implementation of 1.i) [TBD: name of the function, maybe writeWithoutErrorBlocking() ]
- 5.1.2 There is a convenience function that allows to call a this function on any transfer element. If it is has an ExceptionHandlingDecorator, this functions called. Otherwise the normal
write() is executed, which does not block in case of connections inside of ApplicationCore.
- 5.2 Internal interface with other parts of ApplicationCore
- 5.2.1 Catches exception thrown in TransferElement::doPostRead()/doPostWrite() (implements 1.e)
- 5.2.2 In read operations, it informs it's associated DataFaultCounter about device errors (implements 2.4.2.1 and 2.5.1)
- 5.2.3 Reports exceptions to the DeviceModule (implements 2.5.2)
- 5.3 Implementation
- 5.3.1 Writing
- 5.3.1.1 Writes to the recovery accessor before initiating the transfer (implements 2.4.2) in doPreWrite()
- 5.3.1.2 Decorates doPreWrite to acquire the shared lock described in 4.2.3, then increase the transfer counter and release the lock.
- 5.3.1.3 Decorates doPostWrite to decrease the transfer counter 4.2.6
- 5.3.1.4 Blocking writes wait in doPostWrite() until informed by the DeviceModule that the device has recovered (via 4.2.2, implements 2.5.3 for writing)
- 5.3.1.5 If doReadTransferNonBlocking()/doReadTransferLatest() must return true even in case of an exception, because eventually
- 5.3.2.1 Decorates doPreRead to acquire the shared lock described in 4.2.3, then increase the transfer counter and release the lock.
- 5.3.2.2 Decorates doPostRead to decrease the transfer counter, then perform the delegated call to postRead, which might throw, and catch here.
- 5.3.2.2 Blocking reads, or reads which have not seen a valid initial value yet, wait in doPostRead() until informed by the DeviceModule that the device has recovered (via 4.2.2, implements 2.5.3 for writing), the try a complete read cycle (incl. preRead) until they can successfully read a value (they might receive data with the faulty flag turned on by the sender, which is ok. It is a valid transfer).
- 5.3.3 Sequences of calles to the delegated preXxx(), xxxTransferYyy() and postXxx() must always follow the DeviceAccess TransferElement specification.
- 5.3.3.1 preXxx() and postXxx() must alwas be called in matching pairs. If a recovery is started in doPostXxx(), the failed transfer must be finished first by calling postXxx() and DeviceModule::stopTransfer(), then a completely new cycle (including DeviceModule::startTransfer() and DeviceModule::stopTransfer()) must be initiated.
- 5.3.3.2 If the transfer is not taking place at all (because in preXxx() the device is already known to be broken and no recovery shall be attempted), the delegated preXxx() and postXxx() functions must not be called as well.
<b>6. TriggerFanout and ThreadedFanOut </b>
- 6.1 TriggerFanout
Each TriggerFanOut reads several poll-type variables when a trigger (push type) is received. If one of the poll-type inputs is in error state, it shall not block the other variables.
To implement this, the TriggerFanout uses the write function which does not block on device exceptions (5.1.2), (implements 1.i)
- 6.2 ThreadedFanOut
If outputs of a ThreadedFanOut also write do devices, the writes must not block the other variables in the fanout. To implement this, the TreadedFanOut uses the non blocking write through the convenience function described in 5.1.2 (implements 1.i)
<b>7. The server must always start even if a device is in error state .</b>
Implementation of 1.k. This section extracts some points from 1. and 2. to put the bits and pieces into context.
To make sure that the server always starts, even if some or all devices are in error state, the initial opening of the device takes place in the DeviceModule thread (inside the exception handling loop).
The device module reports its status and error messages to the control system (see 2.1, 2.3.5, 2.6.1).
Some initial values are already written in prepare(), before the threads are started. Writing these values must be delayed until the device is available. This is done by the same mechanism that is used to re-write the values after recovery. (see 10 and \link spec_initialValuePropagation \endlink)
Martin Killenberg
committed
<b>8. Propagating the DataValidity flag</b>
If a device is in error state, all it's output data is marked as invalid. This invalid flag shall be propagated through the connected modules such that all data that is calculated from these invalid values is also marked invalid (see \link spec_dataValidityPropagation \endlink). The ExceptionHandlingDecorator is informing the DataFaultCounter about the device state (faulty or ok, see 3.6.3 and 2.4.1.2.1)
To propagate the flag, the first blocking read after the device error return the last value. As the DataFaultCounter knows about the device error, the data invalid flag is turned on (2.5.3-read). In order not to prevent unnecessary running of modules with invalid data, the following read call blocks until the device has recovered.
After recovery the DataFaultCounter is informed that the device is OK again, and the received DataValidity of the variable is propagated (usually 'ok', but if 'faulty' is received, the data validity stays faulty).
<b>9. Device initialisation </b>
This partly is specification of the DeviceModule. As it is strongly connected with exception handling, and in fact handled by the same code, it is mentioned here.
- 9.1 The user code can register exception handlers (in the constructor of the DeviceModule or using DeviceModule::addInitialisationHandler). They are executed each time after the device has successfully been opened (*)
- 9.2 Sometimes it is only possible to write parts of the device after a proper initialisation sequence (for instance reset-registers must be cleared, or communication clocks to sub-devices must be set). Hence no read or write operations must take place until this point, not even writing recovery accessors (implements 1.c, implemented by 4.2.3, 5.3.1.2 and 5.3.2.1).
- 9.3 The recovery accessors are written after the initialisation (implements 1.l).
- 9.4 The lock 4.2.3 is only released after all recovery accessors are written, so ApplicationModules which continue find the same state as before the error when writing or reading.
Comments:
- 9.1 Successfully opened means open() did not throw, and the device reports isFunctional() as true.
<b>10. Recover accessors</b>
After a device has failed and recovered, it might have re-booted and lost the values of the process variables that live in the server and are written to the device. Hence these values have to be re-written after the device has recovered. The same holds for initial values which have been written before the device thread has started (see 7.), and even normal variables which have been written before the device is available, as several threads start asynchronously.
The writing after the recovery is done in the device thread. The regular register accessors (which are decorated with the ExceptionHandlingDecorator) belong to the ApplicationModule threads (or those of the fanouts), which can modify the user buffer any time. Hence the device thread cannot use these accessors in a thread-safe way. In addition, the device module has to remember the last value which has been written to restore a consistent state. The ApplicationModule might already have modified it's user buffer, but not have written yet. Hence also for logical reasons this buffer cannot be used for recovery.
As a consequence a copy has to be created whenever the data is written to the device. It is implemented by a so called recovery accessor. This is a regular second accessor to the register whose accessor has been decorated with the ExceptionHandlingDecorator, but with the special usage that the data is set in the Application thread, and written in the DeviceModule thread.
- 10.1 The recovery accessor is created together with the normal accessor in the connection code (in DeviceModule::writeRecoveryOpen), registered at the DeviceModule and given to the recovery accessors.
- 10.2 Data is copied in doPreWrite(), before the original accessor's pre-write is called. This is the last occasion where the data is still guaranteed to be in the original accessors's user buffer. The accessor's pre-write might swap the data out, and it might never be available again (in case of write destructively.
- 10.3 As the user buffer recovery accessor is written in an ApplicationModule or fanout thread, but read in the DeviceModule thread when recovering, it has to be protected by a mutex. For efficiency one single shared mutex is used. All ExceptionHandlingDecorators will acquire a shared lock, as each decorator only touches his own buffer. The DeviceModule, which writes all recovery accessors, uses the unique lock to prevent any ExceptionHandlingDecorator to modify the user buffer while doing so.
- 10.4 All valid recovery accessors are written each time the device has been (re)-opened, after the initialisation handlers have been executed. If a recovery accessor has not seen an initial value yet, the version number is still nullptr, and the accessor is invalid. These accessors are not written. (implements 1.l)
\section spec_execptionHandling_known_issues Known issues
Martin Killenberg
committed
- 11.1 In step 2.1: The initial value of deviceError is not set to 1.
Martin Killenberg
committed
- 11.2 In step 2.2.3: is not correctly fulfilled as we are only waiting for device to be opened and don't wait for it to be correctly initialised. The lock 4.2.3 is not implemented at all.
Martin Killenberg
committed
- 11.3 In step 2.3.5: is currently being set before initialisationHandlers and writeAfterOpen.
Martin Killenberg
committed
- 11.4 Check the documentation of DataValidity. ...'Note that if the data is distributed through a triggered FanOut....'
- 11.5 Data validity is currently propagated through the "owner", which conceptually does not always work. A DataFaultCounter needs to be introduced and used at the correct places.
- 11.6 In comment to 1.g: recovery accessors are not optional at the moment.
- 11.7 In 1.c: Currently data is transported even if the "value after construction" is still in.
- 11.8 In 1.i, 6: ThreadedFanout and TriggerFanout do not use non-blocking write because it does not exist yet
- 11.9 In 1.j, 2.5.3: Not implemented like that. The first read blocks, and a special mechanism to propagate the flags is triggered only in the next module.
- 11.10 In 2.3: The device module has a special "first opening" sequence. This is not necessary any more. The "writeAfterOpen" list is obsolete. You can always use the recovery accessors.
- 11.11 In 2.3.4: Recovery accessors are always written. It is not checked whether there is valid data (not "value after construction")
- 11.12 In 2.4.1.1: Write probably re-executed after recovery. This should not happen because the recovery accessor has already done it.
- 11.13 In 2.5.3: The non-blocking read functions always block on exceptions. They should not (only if there is no initial value).
- 11.14 In 2.5.2, 5.1: writeWithoutErrorBlocking is not implemented yet
- 11.15 Asynchronous reads are not working with the current implementation, incl. readAny.
- 11.16 In 3: DeviceAccess : RegisterAccessors throw in doReadTransfer now.
- 11.17 In 4.2.1: reportException does block (should not)
- 11.18 In 4.2.2: blocking wait function does not exist (not needed in current implementation as reportException blocks)
- 11.19 In 5.2.1: Exceptions are caught in doXxxTransfer instead of doPostXxx.
- 11.20 In 5.3.1.2, 5.3.2.1: Decoration of doXxxTransfer does not acquire the lock (which does not even exist yet, see 4.2.3)
- 11.21 In 3.2: Decorators might have to try-catch because they usually can only do their task after calling the delegated postXxx.
- 11.22 In 3.4: The TransferType is not known. Needs to be implemented in TransferElement
- 11.23 In 3.5: PostRead is currently skipped if readNonBlocking or readLatest does not have new data
- 11.24 In 3.6: The waitForNewData calls in the DoocsBackend (using zmq) are currently not interruptible
Martin Christoph Hierholzer
committed
} // end of namespace ChimeraTK