Skip to content
Snippets Groups Projects
Commit 47743feb authored by Steven Murray's avatar Steven Murray
Browse files

Added the first draft of the CTA paper for CHEP 2016

parent f30a72f6
Branches
Tags
No related merge requests found
Showing with 1460 additions and 0 deletions
File added
\documentclass[a4paper]{jpconf}
\usepackage{graphicx}
\begin{document}
\title{An efficient, modular and simple tape archiving solution for LHC Run-3}
\author{S Murray\textsuperscript{1}, V Bahyl\textsuperscript{1}, G Cancio\textsuperscript{1}, E Cano\textsuperscript{1}, V Kotlyar\textsuperscript{2}, D F Kruse\textsuperscript{1}, and J Leduc\textsuperscript{1}}
\address{\textsuperscript{1}European Organization for Nuclear Research CERN, CH-1211 Gen\`eve 23}
\address{\textsuperscript{2}Institute for High Energy Physics, Russian Federation State Research Centre (IHEP), RU-142 281 Protvino Moscow Region, Russia}
\ead{\{Steven.Murray, Vladimir.Bahyl, German.Cancio.Melia, Eric.Cano, Daniele.Francesco.Kruse, Julien.Leduc\}@cern.ch and Victor.Kotlyar@ihep.ru}
\begin{abstract}
The IT Storage group at CERN develops the software responsible for archiving to
tape the custodial copy of the physics data generated by the LHC experiments.
Physics run 3 will start in 2021 and will introduce two major challenges for
which the tape archive software must be evolved. Firstly the software will need
to make more efficient use of tape drives in order to sustain the predicted data
rate of 100 petabytes per year as opposed to the current 40 petabytes per year
of Run-2. Secondly the software will need to be seamlessly integrated with EOS,
which has become the de facto disk storage system provided by the IT Storage
group for physics data.
The tape storage software for LHC physics run 3 is code named CTA (the CERN Tape
Archive). This paper describes how CTA will introduce a pre-emptive drive
scheduler to use tape drives more efficiently, will encapsulate all tape
software into a single module that will sit behind one or more EOS systems, and
will be simpler by dropping support for obsolete backwards compatibility.
\end{abstract}
\section{Introduction} \label{introduction}
The IT Storage group at CERN is currently designing and developing the CERN
Tape Archive (CTA) storage system. The primary goal of CTA is to provide the
EOS\cite{EOS} disk storage system with a tape backend. EOS is the de facto disk
storage system for physics data at CERN. EOS combined with CTA will eventually
replace the CERN Advanced STORage manager (CASTOR\cite{CASTOR}) system which is
the current system used to archive physics data to tape. The IT Storage group
plans to put CTA into production by the beginning of 2019, ready for experiments
to start migrating to it during the long shut down period between LHC physics
runs 2 and 3.
During 2016, LHC physics run 2 archived over 40 petabytes of physics data to
tape using CASTOR. LHC physics run 2 will continue at similar data rates during
2017 and 2018.
LHC physics run 3 will start in 2021 and is predicted to store over 100
petabytes per year. Run 3 will use EOS and CTA as opposed to CASTOR to archive
data to tape. The CTA project will make more efficient use of tape drives in
order to deal with the predicted 100 petabytes of physics data per year. CTA
will accomplish this by introducing a pre-emptive drive scheduler that will keep
tape drives running at full speed all of the time. Physics data are not only
read back from tape by physicists. These data are also read back for tape media
repack\cite{repack} campaigns and data verification. A repack campaign reads
data from older lower capacity tapes and writes them to newer higher capacity
tapes. This enables the CERN computing centre to store more data whilst
occupying the same amount of floor space. The pre-emptive scheduler will improve
tape drive efficiency by automatically filling otherwise idle tape drive time
with background jobs for tape media repacking and data verification.
The next two sections are intended to give a more concrete idea of what is CTA.
Section \ref{cta_architecture} describes the architecture of CTA and lists the
steps taken to archive a file to tape and then to retrieve it back to disk.
Section \ref{concepts} describes the concepts that an operator needs to
understand in order to configure and work with CTA. Section \ref{scheduler}
describes the pre-emptive drive scheduler of CTA and how it will enable the IT
storage group of CERN to handle the 100 petabytes per year transfer rate of LHC
physics run 3. Section \ref{migrating} describes how and when LHC experiments
and other users of CASTOR will migrate to EOS and CTA. Finally, section
\ref{conclusion} draws the paper to a close with its conclusions.
\section{CTA architecture} \label{cta_architecture}
Figure \ref{architecture} shows the architecture of CTA. The overall layout of
the architecture is many EOS instances connected to a single CTA instance. Each
LHC experiment at CERN has its own EOS disk storage system and its own set of
tapes. However in order to be cost effective with the use of tape hardware, all
of the tape drives at CERN are shared by all of the experiments. During any one
day, an individual tape drive may mount and transfer data to and from tapes
belonging to many different experiments.
\begin{figure}[h]
\includegraphics[width=\textwidth, trim=0mm 35mm 0mm 0mm, clip]{CTA_architecture_A4.pdf}
\caption{\label{architecture}The CTA architecture}
\end{figure}
Starting on the left hand side of figure \ref{architecture}. EOS clients such
as the \texttt{eos} and \texttt{xrdcp} command-line tools send commands to the
EOS manager server and transfer the contents of files to and from EOS disk
servers. EOS users see their disk files listed in the EOS namespace. They will
also see the copies of their files on tape in the EOS namespace. The tape files
will be seen as replicas. For example the \texttt{eos file info} command will
display the tape copies of a given EOS file.
In addition to its normal EOS duties, the EOS manager server also queues
requests with the CTA front-end in order to have EOS disk files archived to tape
or retrieved back to disk. The EOS workflow engine is the internal EOS
component that is responsible for queuing requests with the CTA front-end. The
EOS workflow engine and its configuration is the glue that holds EOS and CTA
together.
The EOS workflow engine can be configured to provide end users with the
following different storage system behaviours.
\begin{itemize}
\item D1T0 - Disk only files.
\item D1T1 - Files replicated on both disk and tape.
\item D0T1 - Tape files cached on disk.
\item Asynchronous tape file retrievals.
\item Synchronous tape file retrievals.
\end{itemize}
In the case of an synchronous tape file transfer, a user issues a bring on-line
request for an EOS file that is on tape but not on EOS disk. They poll EOS
until the file has been retrieved from tape. Once the file has been retrieved,
the user copies the file from EOS disk to their local storage. In the case of a
synchronous tape retrieval, a user is blocked when they try to open an EOS file
that is on tape but not on EOS disk. They are unblocked when the file has been
retrieved from tape.
Moving to the middle of figure \ref{architecture}. The CTA front-end server
provides a networked based interface to EOS and the CTA administration tool used
by tape operators (not shown on figure \ref{architecture}). The CTA front-end
stores requests from EOS in the persistent CTA metadata system. The CTA
front-end also queries the CTA metadata system in order to answer the query
commands of the CTA administration tool.
The CTA metadata system is composed of two parts. A relational database to
store the tape file catalogue and an object store to persistently queue data
transfer requests from EOS. A relational database was chosen for the tape file
catalogue in order to minimise risk. Relational database technology has a
proven track record of being able to safely store and recover in the event of a
disaster, mission critical information such as the location of every LHC physics
file on tape. An object store was chosen to implement the persistent queues of
CTA because without special tricks, relational database tables do not perform
well when their rows are continuously inserted and deleted as is the case for a
queue.
The CTA tape drive daemon is based on the CASTOR tape server daemon. This
means the CTA tape drive daemon benefits from the know how and experience of
CASTOR. The CTA tape drive daemon queries the CTA metadata system for tapes
to be mounted and files to be transferred. The daemon transfers files
between tapes and the EOS disk servers. When the daemon has completed a
transfer it writes the result back to the CTA metadata system.
The next two subsections explain step by step how a file is archived to tape and
how a file is retrieved back to EOS disk.
\pagebreak
\subsection{Archiving a file to tape}
The following steps describe how a file is archived to tape:
\begin{enumerate}
\item The user writes the file to EOS disk.
\item On the close of the file, the EOS workflow engine queues a request to
archive the file with the CTA front-end.
\item The CTA front-end stores the archive request in the CTA metadata system.
\item The file becomes eligible for archival. Either the file has be been
queued for the maximum permissible amount of time or there is enough data to
be transferred to warrant the cost of mounting a tape.
\item A tape server connected to a free tape drive queries the CTA metadata
system for more work and determines a tape needs to be mounted for writing.
\item The tape server sends a request to the tape library to mount the tape.
\item Whilst the tape is still being mounted the tape server queries the CTA
metadata system for the files to be written to the tape.
\item Still whilst the tape is being mounted, the tape server starts to read
the file from EOS disk into main memory.
\item The tape is mounted into the tape drive.
\item The tape server starts to write the file from main memory to tape.
\item On the close of the tape file the tape server notifies EOS that the
file is safely archived.
\item EOS updates its file namespace to reflect the fact that the file is now
safely archived.
\end{enumerate}
\subsection{Retrieving a file from tape}
The following steps describe how a file is retrieved from tape.
\begin{enumerate}
\item The user sends EOS a prepare request that instructs EOS to retrieve a
file from tape.
\item The EOS workflow engine queues the prepare request with the CTA
front-end.
\item The CTA front-end stores the retrieve request in the CTA metadata
system.
\item The file becomes eligible for recall. Either the file has be been
queued for the maximum permissible amount of time or there is enough data to
be transferred to warrant the cost of mounting a tape.
\item A tape server connected to a free tape drive queries the CTA metadata
system for more work and determines a tape needs to be mounted for reading.
\item The tape server sends a request to the tape library to mount the tape.
\item The tape is mounted into the tape drive.
\item The tape server starts transferring the file from tape to EOS disk.
\item On the close of the disk file, EOS updates its namespace to reflect the
fact that there is now a copy of the file back on EOS disk.
\end{enumerate}
\section{Configuration and operations concepts} \label{concepts}
This section is divided into two subsections whose combined goal is to further
explain CTA by describing the concepts an operator needs to understand in order
to accomplish two typical operator tasks. The first subsection describes the
concepts required to specify which EOS disk file gets archived to which set of
tapes. The second subsection covers the concepts required to give users a
tailored quality of service.
\subsection{The concepts behind routing EOS disk files to tape}
An operator needs to understand the following four concepts in order to be able
to route EOS disk files to specific tape pools:
\begin{itemize}
\item CTA storage class.
\item Tape pool.
\item Archive route.
\end{itemize}
An EOS user archives a file to tape by copying the file into to an EOS directory
that has been tagged with a CTA storage class. Within the EOS namespace a CTA
storage class is simply a name. For example the \texttt{raw\_data} storage
class could be used to refer to user files that should have one copy on tape and
should be written to tapes that are dedicated to raw physics data. Within CTA a
storage class name is mapped to the number of required copies on tape and to one
or more archive routes. An archive route specifies to which tape pool a copy of
a file should be written to. A tape pool is a logical grouping of tapes. For
example the \texttt{raw\_data\_of\_experiment\_A} tape pool could contain all of
the tapes owned by experiment A that are dedicated to storing raw physics data.
A storage class specifying two copies on tape will have two associated archive
routes, one for each copy to be written to tape. This enables operators to
create two destination tape pools, each one in a different building.
\subsection{The concepts behind giving users a tailored quality of service}
An operator needs to understand the following four concepts in order to tailor
the quality of service delivered to individual EOS users and groups of EOS
users:
\begin{itemize}
\item Mount policy.
\item Mount rule.
\end{itemize}
Mounting and dismounting a tape to read or write a single file is a
considerable waste of tape drive time and a sure way to create a queue of tape
mount and dismount requests behind the robotics of a tape library.
Mounting a tape, positioning it for reading or writing, rewinding it and
dismounting it can take 4 minutes. The tape drives used at CERN have data
transfer rates of around 300 megabytes per second. 4 minutes is therefore
equivalent to transferring a 72 Gigabyte file.
CTA transfers files to and from tape asynchronously so that it can queue up
enough file transfers to warrant the cost of mounting, positioning, rewinding
and dismounting a tape. Files are copied to and read from tape in batches.
An operator creates named mount policies in order to define the following:
\begin{itemize}
\item Mount priority: which mounts have higher priority than which.
\item Maximum amount of time a file transfer can be kept pending.
\item The maximum number tape drives that can be used when the system is
under load.
\end{itemize}
An operator creates mount rules in order to assign mount policies to individual
EOS users and groups of EOS users.
\pagebreak
\section{The pre-emptive tape drive scheduler} \label{scheduler}
TO BE DONE
\pagebreak
\section{Migrating from CASTOR to EOS and CTA} \label{migrating}
Figure \ref{current} gives a very high level view of how experiments transfer
files to and from tape today using CASTOR. CASTOR and EOS are used by the
experiments as separate Grid storage elements. Experiments store the files
they work on and modify into EOS and they store the files they need archived to
tape into CASTOR. Files can also be transferred directly between EOS and
CASTOR without having to go via an experiment.
\begin{figure}[h]
\centering
\includegraphics[scale=0.18, trim=0mm 60mm 0mm 0mm, clip]{CTA_current_deployment.pdf}
\caption{\label{current}The current deployment of CASTOR}
\end{figure}
Figure \ref{replacement} shows that an EOS instance together with CTA will be
a drop in replacement for a CASTOR storage element.
\begin{figure}[h]
\centering
\includegraphics[scale=0.18, trim=0mm 60mm 0mm 0mm, clip]{CTA_drop_in_replacement_deployment.pdf}
\caption{\label{replacement}EOS and CTA as a drop in replacement for CASTOR}
\end{figure}
Once experiments have replaced their CASTOR instance with EOS and CTA they
will in fact have two EOS instances. The original EOS instance for storing
files they work on and modify and the new EOS instance acting as a staging
area for the CTA tape backend. Figure \ref{consolidated} simply shows that
at the choice of an experiment and the IT operations teams, the two EOS
instances could be merged into one.
\begin{figure}[h]
\centering
\includegraphics[scale=0.18, trim=0mm 60mm 0mm 0mm, clip]{CTA_consolidated_deployment.pdf}
\caption{\label{consolidated}Consolidate EOS if desired}
\end{figure}
CASTOR currently stores the custodial copy of all LHC physic data on tape.
Migrating data from CASTOR to EOS and CTA will be very efficient because
CASTOR and CTA share the same tape format. This means only the metadata
needs to be copied from CASTOR to EOS and CTA, no files needs to be copied
between CASTOR and CTA tapes. CTA will simply take ownership of CASTOR
tapes as they are migrated from CASTOR to CTA.
The milestones for the CTA project are as follows. In the second quarter of
2017 an internal release of CTA will be made that does not have the ability to
repack tape media. This release is intended for redundant use cases within the
IT Storage group such as additional backups of filer data (AFS/NFS) and
additional copies of data from the Large Electron Position collider (LEP). In
the second quarter of 2018 the first production release of CTA will be made.
This release will have the ability to repack tape media and is intended to
migrate small virtual organizations such as non-LHC experiments from CASTOR to
EOS and CTA. Finally in the fourth quarter of 2018, the second production
release of CTA will be made. This release will be used to migrate large virtual
organizations such as LHC experiments from CASTOR to EOS and CTA.
\section{Conclusion} \label{conclusion}
CTA will avoid functional duplication with EOS through a clean, consolidated
separation between disk and tape. EOS will focus on providing high-performance
disk storage, data transfer protocols and meta-data operations. CTA will focus
on providing efficient tape backend storage.
CTA will introduce pre-emptive drive scheduling which will automatically
schedule the background tasks of tape media repacking and data verification
This automatic scheduling will use the tape drives at full speed all of the time
and therefore enable CTA to cope with the 100 petabytes per year data rate of
LHC physics run 3.
CASTOR and CTA share the same tape format. This means migrating data from
CASTOR to EOS and CTA only requires the metadata to be copied and CTA taking
ownership of CASTOR tapes.
In addition to the hierarchical namespace of EOS, CTA will have its own flat
catalogue of every file archived to tape. This redundancy in metadata will
provide an additional recovery tool in the case of disaster recovery.
The architecture of CTA has benefited from a fresh start. Without the need to
preserve the internal interfaces of the CASTOR networked components, CTA has
been able to reduce the total number of networked components in the tape storage
system.
LHC experiments can expect to start migrating from CASTOR to EOS and CTA at the
beginning of 2019 which is the beginning of the long shut down period between
LHC physics runs 2 and 3.
\section{References}
\begin{thebibliography}{9}
\bibitem{EOS} EOS homepage {\it http://cern.ch/eos}
\bibitem{CASTOR} CASTOR homepage {\it http://cern.ch/castor}
\bibitem{repack} Kruse D F 2013 The repack challenge {\it Jour. of Phys.: Conf. Ser.} 513 042028
\bibitem{SHIFT} Baud J-P et al. 1991 SHIFT, the Scalable Heterogeneous Integrated Facility {\it Proc. of the Int. Conf. on CHEP'91, Univ. Acad. Press, Tokyo} 571-82
\end{thebibliography}
\end{document}
File added
File added
File added
File added
File added
File added
File added
File added
#!/bin/sh
pdflatex CHEP_2016_paper_CTA.tex
This diff is collapsed.
%%
%% This is file `jpconf11.clo'
%%
%% This file is distributed in the hope that it will be useful,
%% but WITHOUT ANY WARRANTY; without even the implied warranty of
%% MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
%%
%% \CharacterTable
%% {Upper-case \A\B\C\D\E\F\G\H\I\J\K\L\M\N\O\P\Q\R\S\T\U\V\W\X\Y\Z
%% Lower-case \a\b\c\d\e\f\g\h\i\j\k\l\m\n\o\p\q\r\s\t\u\v\w\x\y\z
%% Digits \0\1\2\3\4\5\6\7\8\9
%% Exclamation \! Double quote \" Hash (number) \#
%% Dollar \$ Percent \% Ampersand \&
%% Acute accent \' Left paren \( Right paren \)
%% Asterisk \* Plus \+ Comma \,
%% Minus \- Point \. Solidus \/
%% Colon \: Semicolon \; Less than \<
%% Equals \= Greater than \> Question mark \?
%% Commercial at \@ Left bracket \[ Backslash \\
%% Right bracket \] Circumflex \^ Underscore \_
%% Grave accent \` Left brace \{ Vertical bar \|
%% Right brace \} Tilde \~}
\ProvidesFile{jpconf11.clo}[2005/05/04 v1.0 LaTeX2e file (size option)]
\renewcommand\normalsize{%
\@setfontsize\normalsize\@xipt{13}%
\abovedisplayskip 12\p@ \@plus3\p@ \@minus7\p@
\abovedisplayshortskip \z@ \@plus3\p@
\belowdisplayshortskip 6.5\p@ \@plus3.5\p@ \@minus3\p@
\belowdisplayskip \abovedisplayskip
\let\@listi\@listI}
\normalsize
\newcommand\small{%
\@setfontsize\small\@xpt{12}%
\abovedisplayskip 11\p@ \@plus3\p@ \@minus6\p@
\abovedisplayshortskip \z@ \@plus3\p@
\belowdisplayshortskip 6.5\p@ \@plus3.5\p@ \@minus3\p@
\def\@listi{\leftmargin\leftmargini
\topsep 9\p@ \@plus3\p@ \@minus5\p@
\parsep 4.5\p@ \@plus2\p@ \@minus\p@
\itemsep \parsep}%
\belowdisplayskip \abovedisplayskip}
\newcommand\footnotesize{%
% \@setfontsize\footnotesize\@xpt\@xiipt
\@setfontsize\footnotesize\@ixpt{11}%
\abovedisplayskip 10\p@ \@plus2\p@ \@minus5\p@
\abovedisplayshortskip \z@ \@plus3\p@
\belowdisplayshortskip 6\p@ \@plus3\p@ \@minus3\p@
\def\@listi{\leftmargin\leftmargini
\topsep 6\p@ \@plus2\p@ \@minus2\p@
\parsep 3\p@ \@plus2\p@ \@minus\p@
\itemsep \parsep}%
\belowdisplayskip \abovedisplayskip
}
\newcommand\scriptsize{\@setfontsize\scriptsize\@viiipt{9.5}}
\newcommand\tiny{\@setfontsize\tiny\@vipt\@viipt}
\newcommand\large{\@setfontsize\large\@xivpt{18}}
\newcommand\Large{\@setfontsize\Large\@xviipt{22}}
\newcommand\LARGE{\@setfontsize\LARGE\@xxpt{25}}
\newcommand\huge{\@setfontsize\huge\@xxvpt{30}}
\let\Huge=\huge
\if@twocolumn
\setlength\parindent{14\p@}
\else
\setlength\parindent{18\p@}
\fi
\if@letterpaper%
%\input{letmarg.tex}%
\setlength{\hoffset}{0mm}
\setlength{\marginparsep}{0mm}
\setlength{\marginparwidth}{0mm}
\setlength{\textwidth}{160mm}
\setlength{\oddsidemargin}{-0.4mm}
\setlength{\evensidemargin}{-0.4mm}
\setlength{\voffset}{0mm}
\setlength{\headheight}{8mm}
\setlength{\headsep}{5mm}
\setlength{\footskip}{0mm}
\setlength{\textheight}{230mm}
\setlength{\topmargin}{1.6mm}
\else
%\input{a4marg.tex}%
\setlength{\hoffset}{0mm}
\setlength{\marginparsep}{0mm}
\setlength{\marginparwidth}{0mm}
\setlength{\textwidth}{160mm}
\setlength{\oddsidemargin}{-0.4mm}
\setlength{\evensidemargin}{-0.4mm}
\setlength{\voffset}{0mm}
\setlength{\headheight}{8mm}
\setlength{\headsep}{5mm}
\setlength{\footskip}{0mm}
\setlength{\textheight}{230mm}
\setlength{\topmargin}{1.6mm}
\fi
\setlength\maxdepth{.5\topskip}
\setlength\@maxdepth\maxdepth
\setlength\footnotesep{8.4\p@}
\setlength{\skip\footins} {10.8\p@ \@plus 4\p@ \@minus 2\p@}
\setlength\floatsep {14\p@ \@plus 2\p@ \@minus 4\p@}
\setlength\textfloatsep {24\p@ \@plus 2\p@ \@minus 4\p@}
\setlength\intextsep {16\p@ \@plus 4\p@ \@minus 4\p@}
\setlength\dblfloatsep {16\p@ \@plus 2\p@ \@minus 4\p@}
\setlength\dbltextfloatsep{24\p@ \@plus 2\p@ \@minus 4\p@}
\setlength\@fptop{0\p@}
\setlength\@fpsep{10\p@ \@plus 1fil}
\setlength\@fpbot{0\p@}
\setlength\@dblfptop{0\p@}
\setlength\@dblfpsep{10\p@ \@plus 1fil}
\setlength\@dblfpbot{0\p@}
\setlength\partopsep{3\p@ \@plus 2\p@ \@minus 2\p@}
\def\@listI{\leftmargin\leftmargini
\parsep=\z@
\topsep=6\p@ \@plus3\p@ \@minus3\p@
\itemsep=3\p@ \@plus2\p@ \@minus1\p@}
\let\@listi\@listI
\@listi
\def\@listii {\leftmargin\leftmarginii
\labelwidth\leftmarginii
\advance\labelwidth-\labelsep
\topsep=3\p@ \@plus2\p@ \@minus\p@
\parsep=\z@
\itemsep=\parsep}
\def\@listiii{\leftmargin\leftmarginiii
\labelwidth\leftmarginiii
\advance\labelwidth-\labelsep
\topsep=\z@
\parsep=\z@
\partopsep=\z@
\itemsep=\z@}
\def\@listiv {\leftmargin\leftmarginiv
\labelwidth\leftmarginiv
\advance\labelwidth-\labelsep}
\def\@listv{\leftmargin\leftmarginv
\labelwidth\leftmarginv
\advance\labelwidth-\labelsep}
\def\@listvi {\leftmargin\leftmarginvi
\labelwidth\leftmarginvi
\advance\labelwidth-\labelsep}
\endinput
%%
%% End of file `iopart12.clo'.
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Please register or to comment