Skip to content
Snippets Groups Projects
exceptionHandlingDesign.dox 15.3 KiB
Newer Older
/**
\page exceptionHandlingDesign Exception Handling Design
\section gen_idea General Idea


Exceptions must be handled by ApplicationCore in a way that the application developer does not have to care much about it.

In case of a ChimeraTK::runtime_error exception the framework must catch the expection and report it to the DeviceModule. The DeviceModule handles this exception and preiodically tries to open the device. In case of several devices only the faulty device is blocked. Even if a device is faulty it should not block the server from starting.
If an input variable is in the error state, it sets the DataValidity flag for its DataValidityProparationExecutor (see \link spec_dataValidityPropagation \endlink) to faulty and the flag is propogated appropriately. After the exception is cleared and operation returns without a data fault flag, set DataValidity flag to ok. Furthermore, the device must be reinitialised automatically and also recover the values of process variables as the device might have rebooted and the variables have been re-set.
Nadeem Shehzad's avatar
Nadeem Shehzad committed
<b>1. Genesis</b>
- b. An initailisation handler can be added to the DeviceModule in the user code. Initialisation handlers are callback function which will be executed when a device is opened for the first time and after a device recovers from an exception, before any process variables are written.
- c. Initial values must be correctly propogated after a device is opened. See \link spec_initialValuePropagation \endlink. Especially, no read function (even readNonBlocking/readLatest) must return before an initial value has been received.
- d. (removed)
- e. A ChimeraTK::ExceptionHandlingDecorator is placed around all ChimeraTK::NDRegisterAccessors which connect a device to a ChimeraTK::ApplicationModule or fanout. (*)
- f. (removed)
- g. By default a recovery accessor is added for each device register when it is obtianed. These recovery accessors are used to correctly set the values of variables when the device is opened for the first time and after a device is recovered from an exception. (*)
- h. A ChimeraTK::ExceptionHandlingDecorator for an input knows its DataValidityProparationExecutor, which lives in the ApplicationModule or fanout that reads the input. Like this it can propagate the
     dataValidity flag. Outputs do not send DataValidity faulty in case of exceptions (see \link spec_dataValidityPropagation \endlink).
- i. Write should not block in case of an exception for the outputs of ThreadedFanOut / TriggerFanOut. (*)
Nadeem Shehzad's avatar
Nadeem Shehzad committed
<b>2. The Flow</b>
- 2.1. The application always starts with all devices as closed and intial value for deviceError.status is set to 1. The DeviceModule takes care that ExceptionHandlingDecorators do not perform any read or write operations, but block. This must happen before running any prepare() of an ApplicationModule, where the first write calls to ExceptionHandlingDecorators are done.

- 2.2 In ApplicationModule::prepare() some initial values (and constants) are written. As the ExceptionHandlingDecorator must not perform the actual write at this point, it will put the value into the dataRecoveryAccesssor and report an exception to the DeviceModule.
  - 2.2.3 Although ApplicationModule and fanout threads start after the device module threads, the application is now asyncronous and read or write operations can already take place in the main loops, even if the device is not ready yet (it might actually be broken). All read and write operations are blocked buy the exceptionHandlingDecorators at this point.
  - 2.3.1 The DeviceModule tries to open the device until it succeeds.(*)
  - 2.3.2 Device is initailised by iterating initialisationHandlers list. If there is an exception go back to 2.2.1. (*)
  - 2.3.3 The list of reported exceptions is cleared. (*)
  - 2.3.4 All valid (*) recovery accessors are written. If there is an exception go back to 2.3.1. (*)
  - 2.3.5 deviceError.status is set to 0.
  - 2.3.6 DeviceModule allows that ExceptionHandlingDecorators execute reads and writes.
  - 2.3.7 All blocked read and write operations (from 2.5.3) are notified.(*)
  - 2.3.8 The DeviceModuleThread waits for the next reported exception.

- 2.4 Device and Application are running normally
  - 2.4.1 All blocked ExceptionHandlingDecorators continue (*)
    - 2.4.1.1 write just continues (recovery accessor has done the write)
    - 2.4.1.2 read/readNonBlocking/readLatest
      - 2.4.1.2.1 tells the  DataValidityPropagationExecutor that the device error has gone
      - 2.4.1.2.2 (re-)tries to get the value. In case of an exception go to 2.5
  - 2.4.2 In the ExceptionHandlingDecorator, all write calls always fill the value into the recovery accessors before trying to execute the real write. Like this, the recovery accessor always has the last value that should have been written to the device. All recovery accessors become valid over time (see comment for 2.3.4).
    - 2.4.2.1 If a write is not executed because the device is already faulty (from 2.2 or 2.6.1), the recovery accessor has to take care of this. In this case we always have to send another exception notification to the DeviceModule to make sure that the new recovery value is not missed (avoid race condition). (*)


- 2.5. When a read / write operation on the device (1.e) causes a ChimeraTK::runtime_error exception, the exception is caught in the ExceptionHandlingDecorator
  - 2.5.1. If it is a read operation the DataValidityPropagationExecutor is informed that there was a device error. (*)
  - 2.5.2. The error is reported to the DeviceModule
  - 2.5.3. Action depending on the calling operation :
    - write : blocks until the device is recovered.
    - read : If the accessor has aleady seen its initial value, the first "blocking" read call returns immediately (remember DataValidity is set to faulty). The ExceptionHandlingDecorator remembers that it is in an exception state. The calling module thread will continue and propagate the data invalid flag. The second call will finally block. If there has not been an initial value yet, even the first call will block until it is available.
    - readNonBlocking / readLatest: will always return with data invalid flag (unless there has not been an initial value yet).
    - writeWithoutErrorBlocking: just returns (*) 

- 2.6 The exception is received in the DeviceModule thread
  - 2.6.1 deviceError.status will be set to 1. From this point on, all ExceptionHandlingDecorators for this device must block all read and write operations (see also 2.2 and 2.3.6).
  - 2.6.2 The thread goes back to 2.2.1 and tries to re-open the device.
- 1.e. In addition there can be recovery accesors for the same variables, which are not decorated. They are not directly seen by the ApplicationModule and the fanouts.
- 1.g. Output accessors can have the option not to have a recovery accessor. This is needed for instance for "trigger registers" which start an operation on the hardware. Also void registers don't have recovery accessors.
- 1.i. The specification for initial value propagation (\link spec_initialValuePropagation \endlink) also says that writes ApplicationModules don't block before the first successful read in the main loop.
- 2.3.1 Successul opening includes that the device reports isFunctional() as true.
- 2.3.2 and 2.3.4 Exceptions for re-initialisation and recovery will be reported once, but not if it occurs again before the device has completely recovered.
- 2.3.3 ExceptionHandlingDecorators must always first write the recovery accessor, then report an exception. As the device module clears the exceptions first, then processes the accessors, it is guaranteed that no value is missed. As a side effect it can be that a pending exception triggers an unnecessary recovery loop in the device module.
- 2.3.4 If a recovery accessors has not seen an initial value yet, it will not be written (see \link spec_initialValuePropagation \endlink).
- 2.3.7 This is different from 2.2.6 because 2.2.6 affects accessors which want to perform a read or write, while 2.2.7 affects accessors that failed to do so and are waiting for the device to become available again. This is needed for two cases:
  - 1. A blocking write, where the recovery accessor has already done the job when the device if back to OK.
  - 2. The first blocking read if the data has not seen the initial value yet, and retrieving it casued the exception.
- 2.4.1 writeWithoutErrorBlocking is not mentioned because it never blocks. Although blocked by different mechanisms read/readNonBlocking/readLatest behave the same:
  - read is either the second read call which is expected to deliver the next value, or any of the three are still waiting for the initial value. In any case they have to (re-)try reading.
- 2.4.2.1 Basically after each update of the recovery accessor there has to be a valid write, or an exception has to be reported to the DeviceModule, to make sure the value is seen by the device (unless the recovery accessor is updated before this happens).
- 2.5.1 incrementDataInvalidCounter() is called. See \link spec_dataValidityPropagation \endlink.
<b>Implmentation Details</b>
<b>4. Exception handling and reporting mechanism to the device module (DeviceModule).</b>
These variables are automatically connected to the control systen in this format
Nadeem Shehzad's avatar
Nadeem Shehzad committed
- /Devices/{AliasName}/message
- /Devices/{AliasName}/status

Nadeem Shehzad's avatar
Nadeem Shehzad committed
Add a thread safe function ChimeraTK::DeviceModule::reportException().
A user/application can report an exception by calling reportException of DeviceModule with an exception string. The reportException packs the exception in a queue and the blocks the thread. This queue is processed by an internal function handleException which updates the DeviceError variables (status=1 and message="YourExceptionString") and tries to open the device. Once device can be opened the DeviceError variables are updated (status=0 and message="") and blocking threads are notified to continue. It must be noted that whatever operation which lead to exception e.g., read or write, should be repeated after the exception is handled.

Implementation.
- ChimeraTK::DeviceModule
Nadeem Shehzad's avatar
Nadeem Shehzad committed
<b>5. Catch ChimeraTK::runtime_error exceptions.</b>
Nadeem Shehzad's avatar
Nadeem Shehzad committed
For a device with it's deviceError.status = 0 (see 2.4.3), catch all the ChimeraTK::runtime_error exceptions that could be thrown in read and write operations and feed the error state into the DeviceModule through the function ChimeraTK::DeviceModule::reportException().
Retry the failed operation after reportException() returns.

For a device that has been opened for the first time but has not reached 2.4.3 i.e., it's deviceError.status != 0, and it throws a ChimeraTK::runtime_error exception see 2.3.
Implementation.
- Exceptions are caught as explained in 1.e and 1.f.
- ChimeraTK::NDRegisterAccessors
- ChimeraTK::Application
Nadeem Shehzad's avatar
Nadeem Shehzad committed
<b>6. Faulty device should not block any other device.</b>
Each ChimeraTK::TriggerFanOut deals with several variable networks at the same time, which are triggered by the same trigger. Each variable network has its own feeder and one or more consumers. The trigger itself is a variable network, too. One consumer per ChimeraTK::TriggerFanOut is required.
- ChimeraTK::Application::typedMakeConnection()
Nadeem Shehzad's avatar
Nadeem Shehzad committed
<b>7. The server must always start even if a device is in error state.</b>
Nadeem Shehzad's avatar
Nadeem Shehzad committed
To make sure that the server should always start, the initial opening of the device should take place in the ChimeraTK::DeviceModule::handleException(), which has the exception handling loop so that device can go to the error state right at the beginning and the server can start despite not all its devices are available.
Does not fit here, but is the only place where handleException is mentioned:
- handleException() must not block.

- ChimeraTK::DeviceModule::handleException()
Nadeem Shehzad's avatar
Nadeem Shehzad committed
<b>8. Propogate error flag</b>
Nadeem Shehzad's avatar
Nadeem Shehzad committed
See 2.5.1.
Nadeem Shehzad's avatar
Nadeem Shehzad committed
For initial error propogation see <a href='spec_initialValuePropagation.html'>spec_initialValuePropagation</a>.
Implmentation.
- ChimeraTK::ExceptionHandlingDecorator
- ChimeraTK::TriggerFanOut
<b>9. Initialise the device</b>
The device should be automatically initialised when opened for first time (2.4.1) and automatically re-initialised after recovery (2.5.3.4).

Implementation.

A list of DeviceModule std::function is added. InitialisationHandlers can be added through construtor and addInitialisationHandler() function. When the device recovers all the initialisationHandlers in the list are executed.
- ChimeraTK::DeviceModule
- ChimeraTK::ExceptionHandlingDecorator
Nadeem Shehzad's avatar
Nadeem Shehzad committed
<b>10. Recover process variables after exception.</b>

Background.

After a device has failed and recovered, it might have re-booted and lost the values of the process variables that live in the server and are written to the device. Hence these values have to be re-written after the device has recovered.

Description.
Create a copy of accessor when writing the data to the device and use this to recover the values when the device is available again. Recovery accessor do not write if the register is never written before (2.5.3.5.).
- ChimeraTK::DeviceModule
- ChimeraTK::ExceptionHandlingDecorator
Nadeem Shehzad's avatar
Nadeem Shehzad committed
- A list of ChimeraTK::TransferElements is created as ChimeraTK::DeviceModule::writeRecoveryOpen which is populated in function ChimeraTK::DeviceModule::addRecoveryAccessor().
ChimeraTK::ExceptionHandlingDecorator is extended by adding second accessor to the same register as the target accessor it is decorating.
<I> Data is copied in doPreWrite(). [TBD: Do we want this behaviour? => Yes, it has to happen before the original accessor's pre-write because this is the last occasion where the data is still guarateed to be in our user buffer. The accessor's pre-write might swap the data out, and it might never be available again (in case of write desrictively).]</I>
- As the user buffer recovery accessor is written in an AppicationModule or fanout thread, but read in the DeviceModule thread when recovering, it has to be protected by a mutex. For efficiency one single shared mutex is used. All ExceptionHandlingDecorators will accquire a shared lock, as each decorator only touches his own buffer. The DeviceModule, which writes all recovery accessors, uses the unique lock to prevent any ExceptionHandlingDecorator to modify the user buffer while doing so.
<b> ExceptionHandlingDecorator </b>

- Device accessors must only throw in postRead and postWrite (FIXME: move text from initial value propagation spec)
- The Decorator only decorates postRead / postWrite (FIXME: conceptually, which one is the correct one?)
- The decorator provides a writeWithoutErrorBlocking() function so that even in case of exception write should return. [TBD: name of the function]

Like this the decoration also works for transfer groups and asyncronous transfers.

<b>5. Known Bugs.</b>

-  Step 2.1 The intial value of deviceError is not set to 1.

-  Step 2.2. is not correctly fulfilled as we are only waiting for device to be opened and don't wait for it to be correctly initialised.

-  Step 2.4.3. is currently being set before initialisationHandlers and writeAfterOpen.

-  Step 2.5.3.7. is currently being set before initialisationHandlers and writeRecoveryOpen.

-  Check the comment in Device.h about writeAfterOpen(). 'This is used to write constant feeders to the device.'

-  Check the documentation of DataValidity. ...'Note that if the data is distributed through a triggered FanOut....'