Adds to Questions section of EOS-CTA interface doc

f61009c6 · Michael Davis · a0a31221 · f61009c6 · f61009c6
Commit f61009c6 authored 7 years ago by Michael Davis
--- a/doc/CTA_EOS.pdf
+++ b/doc/CTA_EOS.pdf
--- a/doc/latex/CTA_EOS_Questions.tex
+++ b/doc/latex/CTA_EOS_Questions.tex
@@ -11,9 +11,9 @@
 %   workflow engine and CTA.
 %
 % Steve identified one known weakness of the current “archive” protocol.  If the “queue archive” request from the EOS
-% workflow engine to the CTA front-end times out on the client (workflow engine) side then the file never gets an archive
+% workflow engine to the CTA frontend times out on the client (workflow engine) side then the file never gets an archive
 % ID in the EOS namespace but is nonetheless marked as having a valid tape copy safely stored on tape.  No error messages
-% are reported to either the EOS end user or the CTA front-end.  The end user incorrectly thinks that their file can be
+% are reported to either the EOS end user or the CTA frontend.  The end user incorrectly thinks that their file can be
 % retrieved from EOS/CTA because they see a tape replica, but this not true because of the lack of an archive ID in the
 % EOS namespace which is required by a retrieve operation.

@@ -56,7 +56,7 @@ The generic headers can be reused by other projects which are nothing to do with

 The alternative to using generic code would be to document all the design decisions and leave it to the EOS team to implement them.

-\section{What is the mechanism by which the CTA front-end will determine the individual instance names of the EOS
+\section{What is the mechanism by which the CTA frontend will determine the individual instance names of the EOS
 instances sending it archive, retrieve and delete requests?}

 As CTA will have access to every EOS instance, we want to prevent crosstalk: one VO should not be able to access or
@@ -66,7 +66,7 @@ SSS keys are used for both identification and authentication. This implies that
 for each VO.

 XRootD issue \href{https://github.com/xrootd/xrootd/issues/591}{\#591} prevents having multiple keys in the same keytab
-with the same user ID. The most likely solution is to have a separate user ID for each VO. Then we can have multiple keys
+with the same user ID. I think this could be solved by having a separate user ID for each VO. Then we can have multiple keys
 in the keytab, each with its own unique user ID.

 \section{Will EOS instance names within the CTA catalogue be ``long'' or ``short'', in other words ``eosdev'' or just ``dev''?}
@@ -74,18 +74,55 @@ in the keytab, each with its own unique user ID.
 Following the point above, we should have a separate instance name for each VO (``eosatlas'', ``eoscms'', \textit{etc.}) and
 a unique key for each instance name.

-Personally I prefer the long name as it is clearer that this is an instance name. In any case we should make a decision
-and apply it consistently.
+Personally I prefer the long name as it is clearer that this is an instance name, but its not a religious issue.

 \section{Do we want the EOS namespace to store CTA archive IDs or not?}

+\begin{description}
+   \item[If no:] we are allowing that the EOS file ID uniquely identifies the file (and that the file is immutable).
+   \item[If yes:] the EOS file ID must somehow be mapped to a CTA archive ID which is returned to EOS.
+\end{description}
+
+Possible implementation:
+\begin{enumerate}
+   \item On OPENW: EOS calls CTA with the file metadata
+   \item CTA Frontend validates metadata (check Storage Class is valid)
+   \item CTA Frontend stores metadata
+   \item CTA Frontend allocates a unique archive ID and returns it to EOS
+   \item EOS attaches archive ID as an xattr
+   \item On CLOSEW\slash PREPARE\slash DELETE: EOS calls CTA with the archive ID
+\end{enumerate}
+If any step fails, EOS will receive an error code and will not have the archive ID so is unable to request any tape operations.
+EOS can retry to obtain the archive ID or send notice of failure to operator\slash user.
+
+If there is a failure after storing the metadata, then in the worst case we have some stored metadata and no archive file,
+but we know we have 0 tape copies, so the CTA state is always consistent.
+
+Is step (3) guaranteed to complete quickly?
+
 \section{Should the CTA catalogue methods prepareForNewFile() and prepareToRetrieveFile() detect repeated requests from
-EOS instances and if so how should the catalogue communicate such ``duplicate'' requests to the caller (Scheduler / cta
-front-end plugin)?}
+EOS instances?}
+
+EOS does not keep track of requests which have been issued. We have said that CTA should implement idempotent retrieve queuing.
+Presumably also for archives and deletes?
+
+\subsection{If so how should the catalogue communicate such ``duplicate'' requests to the caller (Scheduler\slash cta-frontend plugin)?}
+
+The frontend does not call the catalogue directly, it calls the scheduler which calls the catalogue. A possible sequence:
+\begin{enumerate}
+   \item Frontend calls Scheduler
+   \item Scheduler checks job is not in its queue\slash return failure if it is
+   \item Scheduler queries the Catalogue to ensure job is not in progress\slash return failure if it is
+   \item Job accepted into Scheduler queue\slash return success
+   \item Frontend logs result and returns success to EOS (or return failure in the case that job could not be
+      accepted for some other reason)
+\end{enumerate}
+
+\subsection{If the CTA catalogue keeps an index of ongoing archive and retrieve requests, what will be the new
+protocol additions (EOS, cta-frontend and cta-taped) required to guarantee that ``never completed'' requests are removed
+from the catalogue?}

-\section{If the CTA catalogue keeps an index of ongoing archive requests and retrieve requests, what will be the new
-protocol additions (EOS, cta front-end and cta-taped) required to guarantee that ``never completed'' requests are removed
-from the catalogue?  Such a protocol addition could something as simple as a timeout.}
+Such a protocol addition could be something as simple as a timeout.
                     
 % \section{How do we deal with the fact that the current C++ code of the EOS/CTA interface that needs to be compiled on
 % the EOS side on SLC6 will not compile because it uses std::future?}
@@ -178,7 +215,7 @@ by EOS.

 Chaining of archive and retrieve requests to retrieve requests.

-Excution of retrieve requests as disk to disk copy if possible.
+Execution of retrieve requests as disk to disk copy if possible.

 Catalogue will also keep track of requests for each files (archive and retrieve) so that queueing can be made idempotent.