Commit 593a3373 authored by Daniele Kruse's avatar Daniele Kruse
Browse files

Merged some of the documentation in the main doc file (cta.tex)

parent f0cd014e
CTA basic concepts
CTA is operated by authorized administrators (AdminUsers) who issue CTA commands from authorized machines (AdminHosts), using the CTA command line interface. All administrative metadata (such as tape, tape pools, storage classes, etc..) is tagged with a "creationLog" and a "lastModificationLog" which say who/when/where created them and last modified them. An administrator may create ("add"), delete ("rm"), change ("ch") or list ("ls") any of the administrative metadata.
Tape pools are logical groupings of tapes that are used by operators to separate data belonging to different VOs. They are also used to categorize types of data and to separate multiple copies of files so that they end up in different buildings. Each tape belongs to one and only one tape pool.
Logical libraries are the concept that is used to link tapes and drives together. We use logical libraries to specify which tapes are mountable into which drives, and normally this mountability criteria is based on location, that is the tape has to be in the same physical library as the drive, and on read/write compatibility. Each tape and each drive has one and only one logical library.
The storage class is what we assign to each archive file to specify how many tape copies the file is expected to have. Archive routes link storage classes to tape pools. An archive route specifies onto which set of tapes the copies will be written. There is an archive route for each copy in each storage class, and normally there should be a single archive route per tape pool.
So to summarize, an archive file has a storage class that determines how many copies on tape that file should have. A storage class has an archive route per tape copy to specify into which tape pool each copy goes. Each tape tool is made of a disjoint set of tapes. And tapes can be mounted on drives that are in their same logical library.
Archiving a file with CTA
CTA has a CLI for archiving and retrieving files to/from tape, that is meant to be used by an external disk-based storage system with an archiving workflow engine such as EOS. A non-administrative "User" in CTA is an EOS user which triggers the need for archiving or retrieving a file to/from tape. A user normally belongs to a specific CTA "mount group", which specifies the mount criteria and limitations (together called "mount policy") that trigger a tape mount. Here we offer a simplified description of the archive process:
1. EOS issues an archive command for a specific file, providing its source path, its storage class (see above), and the user requesting the archival
2. CTA returns immediately an "ArchiveFileID" which is used by CTA to uniquely identify files archived on tape. This ID will be kept by EOS for any operations on this file (such as retrieval)
3. Asynchronosly, CTA carries out the archival of the file to tape, in the following steps:
a. CTA looks up the storage class provided by EOS and makes sure it has correct routings to one or more tape pools (more than one when multiple copies are required by the storage class)
b. CTA queues the corresponding archive job(s) to the proper tape pool(s)
c. in the meantime each free tape drive queries the central "scheduler" for work to be done, by communicating its name and its logical library
d. for each work request CTA checks whether there is a free tape in the required pool (as specified in b.), that belongs to the desired logical library (as specified in c.)
e. if that is the case, CTA checks whether the work queued for that tape pool is worth a mount, i.e. if it meets the archive criteria specified in the mount group to which the user (as specified in 1.) belongs
f. if that is the case, the tape is mounted in the drive and the file gets written from the source path specified in 1. to the tape
g. after a successful archival CTA notifies EOS through an asynchronous callback
An archival process can be canceled at any moment (even after correct archival, but in this case it's a "delete") through the "delete archive" command
Retrieving a file with CTA
Here we offer a simplified description of the retrieve process:
1. EOS issues a retrieve command for a specific file, providing its ArchiveFileID and desired destination path, and the user requesting the retrieval
2. CTA returns immediately
3. Asynchronosly, CTA carries out the retrieval of the file from tape, in the following steps:
a. CTA queues the corresponding retrieve job(s) to the proper tape(s) (depending on where the tape copies are located)
b. in the meantime each free tape drive queries the central "scheduler" for work to be done, by communicating its name and its logical library
c. for each work request CTA checks whether the logical library (as specified in b.) is the same of (one of) the tape(s) (as specified in a.)
d. if that is the case, CTA checks whether the work queued for that tape is worth the mount, i.e. if it meets the retrieve criteria specified in the mount group to which the user (as specified in 1.) belongs
e. if that is the case, the tape is mounted in the drive and the file gets read from tape to the destination specified in 1.
f. after a successful retrieval CTA notifies EOS through an asynchronous callback
A retrieval process can be canceled at any moment prior to correct retrieval through the "cancel retrieve" command
EOS communicates with CTA by issuing commands on trusted hosts. EOS can archive a file, retrieve it, update its
information/storage class, delete it or simply list the available storage classes. See the LimitingInstanceCrosstalk.txt
file for more details on how these commands are authorized by CTA.
**********ARCHIVING from EOS to CTA**********
1) EOS REQUEST: cta a/archive --encoded <"true" or "false"> // true if all following arguments are base64 encoded, false if all following arguments are in clear (no mixing of encoded and clear arguments)
--user <user> // string name of the requester of the action (archival), used for SLAs and logging, not kept by CTA after successful operation
--group <group> // string group of the requester of the action (archival), used for SLAs and logging, not kept by CTA after successful operation
--diskid <disk_id> // string disk id of the file to be archived, kept by CTA for reconciliation purposes
--instance <instance> // string kept by CTA for authorizing the request and for disaster recovery
--srcurl <src_URL> // string source URL of the file to archive of the form scheme://host:port/opaque_part, not kept by CTA after successful archival
--size <size> // uint64_t size in bytes kept by CTA for correct archival and disaster recovery
--checksumtype <checksum_type> // string checksum type (ex. ADLER32) kept by CTA for correct archival and disaster recovery
--checksumvalue <checksum_value> // string checksum value kept by CTA for correct archival and disaster recovery
--storageclass <storage_class> // string that determines how many copies and which tape pools will be used for archival kept by CTA for routing and authorization
--diskfilepath <disk_filepath> // string the disk logical path kept by CTA for disaster recovery and for logging
--diskfileowner <disk_fileowner> // string owner username kept by CTA for disaster recovery and for logging
--diskfilegroup <disk_filegroup> // string owner group kept by CTA for disaster recovery and for logging
--recoveryblob <recovery_blob> // 2KB string kept by CTA for disaster recovery (opaque string controlled by EOS)
--diskpool <diskpool_name> // string used (and possibly kept) by CTA for proper drive allocation
--throughput <diskpool_throughput> // uint64_t (in bytes) used (and possibly kept) by CTA for proper drive allocation
2) CTA IMMEDIATE REPLY: CTA_ArchiveFileID or Error
CTA_ArchiveFileID: string which is the unique ID of the CTA file to be kept by EOS while file exists (for future retrievals).
In case of retries, a new ID will be given by CTA (as if it was a new file), the old one can be discarded by EOS.
3) CTA CALLBACK WHEN ARCHIVED SUCCESSFULLY: src_URL and copy_number with or without Error
src_URL: this is the same string provided in the EOS archival request
copy_number: indicates which copy number was archived
note: if multiple copies are archived there will be one callback per copy
**********RETRIEVING from CTA to EOS**********
1) EOS REQUEST: cta r/retrieve --encoded <"true" or "false"> // true if all following arguments are base64 encoded, false if all following arguments are in clear (no mixing of encoded and clear arguments)
--user <user> // string name of the requester of the action (retrieval), used for SLAs and logging, not kept by CTA after successful operation
--group <group> // string group of the requester of the action (retrieval), used for SLAs and logging, not kept by CTA after successful operation
--id <CTA_ArchiveFileID> // uint64_t which is the unique ID of the CTA file
--dsturl <dst_URL> // string of the form scheme://host:port/opaque_part (NOT kept by CTA after successful archival)
--diskfilepath <disk_filepath> // string the disk logical path kept by CTA for disaster recovery and for logging
--diskfileowner <disk_fileowner> // string owner username kept by CTA for disaster recovery and for logging
--diskfilegroup <disk_filegroup> // string owner group kept by CTA for disaster recovery and for logging
--recoveryblob <recovery_blob> // 2KB string kept by CTA for disaster recovery (opaque string controlled by EOS)
--diskpool <diskpool_name> // string used (and possibly kept) by CTA for proper drive allocation
--throughput <diskpool_throughput> // uint64_t (in bytes) used (and possibly kept) by CTA for proper drive allocation
Note: disk info is piggybacked
2) CTA IMMEDIATE REPLY: Empty or Error
3) CTA CALLBACK WHEN RETRIEVED SUCCESSFULLY: dst_URL with or without Error
dst_URL: this is the same string provided in the EOS retrieval request
**********DELETING an ARCHIVE FILE**********
1) EOS REQUEST: cta da/deletearchive --encoded <"true" or "false"> // true if all following arguments are base64 encoded, false if all following arguments are in clear (no mixing of encoded and clear arguments)
--user <user> // string name of the requester of the action (deletion), used for logging, not kept by CTA after successful operation
--group <group> // string group of the requester of the action (deletion), used for logging, not kept by CTA after successful operation
--id <CTA_ArchiveFileID> // uint64_t which is the unique ID of the CTA file
Note: This command may be issued even before the actual archival process has begun
2) CTA IMMEDIATE REPLY: Empty or Error
**********CANCELING a SCHEDULED RETRIEVAL**********
1) EOS REQUEST: cta cr/cancelretrieve --encoded <"true" or "false"> // true if all following arguments are base64 encoded, false if all following arguments are in clear (no mixing of encoded and clear arguments)
--user <user> // string name of the requester of the action (cancel), used for logging, not kept by CTA after successful operation
--group <group> // string group of the requester of the action (cancel), used for logging, not kept by CTA after successful operation
--id <CTA_ArchiveFileID> // uint64_t which is the unique ID of the CTA file
--dsturl <dst_URL> // this is the same string provided in the EOS retrieval request
--diskfilepath <disk_filepath> // string the disk logical path kept by CTA for disaster recovery and for logging
--diskfileowner <disk_fileowner> // string owner username kept by CTA for disaster recovery and for logging
--diskfilegroup <disk_filegroup> // string owner group kept by CTA for disaster recovery and for logging
--recoveryblob <recovery_blob> // 2KB string kept by CTA for disaster recovery (opaque string controlled by EOS)
Note: This command will succeed ONLY before the actual retrieval process has begun
Note: disk info is piggybacked
2) CTA IMMEDIATE REPLY: Empty or Error
**********UPDATE the STORAGE CLASS of a FILE**********
1) EOS REQUEST: cta ufsc/updatefilestorageclass --encoded <"true" or "false"> // true if all following arguments are base64 encoded, false if all following arguments are in clear (no mixing of encoded and clear arguments)
--user <user> // string name of the requester of the action (update), used for logging, not kept by CTA after successful operation
--group <group> // string group of the requester of the action (update), used for logging, not kept by CTA after successful operation
--id <CTA_ArchiveFileID> // uint64_t which is the unique ID of the CTA file
--storageclass <storage_class> // updated storage class which may or may not have a different routing
--diskfilepath <disk_filepath> // string the disk logical path kept by CTA for disaster recovery and for logging
--diskfileowner <disk_fileowner> // string owner username kept by CTA for disaster recovery and for logging
--diskfilegroup <disk_filegroup> // string owner group kept by CTA for disaster recovery and for logging
--recoveryblob <recovery_blob> // 2KB string kept by CTA for disaster recovery (opaque string controlled by EOS)
Note: This command DOES NOT change the number of tape copies! The number will change asynchronously (next repack or "reconciliation").
Note: disk info is piggybacked
2) CTA IMMEDIATE REPLY: Empty or Error
**********UPDATE INFO of a FILE**********
1) EOS REQUEST: cta ufi/updatefileinfo --encoded <"true" or "false"> // true if all following arguments are base64 encoded, false if all following arguments are in clear (no mixing of encoded and clear arguments)
--id <CTA_ArchiveFileID> // uint64_t which is the unique ID of the CTA file
--diskfilepath <disk_filepath> // string the disk logical path kept by CTA for disaster recovery and for logging
--diskfileowner <disk_fileowner> // string owner username kept by CTA for disaster recovery and for logging
--diskfilegroup <disk_filegroup> // string owner group kept by CTA for disaster recovery and for logging
--recoveryblob <recovery_blob> // 2KB string kept by CTA for disaster recovery (opaque string controlled by EOS)
Note: This command is not executed on behalf of an EOS user. Instead it is part of a resynchronization process initiated by EOS.
2) CTA IMMEDIATE REPLY: Empty or Error
**********LISTING all STORAGE CLASSES available**********
1) EOS REQUEST: cta lsc/liststorageclass --encoded <"true" or "false"> // true if all following arguments are base64 encoded, false if all following arguments are in clear (no mixing of encoded and clear arguments)
--user <user> // string name of the requester of the action (listing), used for logging, not kept by CTA after successful operation
--group <group> // string group of the requester of the action (listing), used for logging, not kept by CTA after successful operation
2) CTA IMMEDIATE REPLY: storage class list
Limiting crosstalk in CTA
One of the requirements of CTA is to limit the crosstalk among different EOS
instances. In more detail:
1) A listStorageClass command should return the list of storage classes
belonging to the instance from where the command was executed only
2) A queueArchive command should be authorized only if:
- the instance provided in the command line coincides with the instance from
where the command was executed
- the storage class provided in the command line belongs to the instance from
where the command was executed
- the EOS username and/or group (of the original archive requester) provided
in the command line belongs to the instance from where the command was
executed
3) A queueRetrieve command should be authorized only if:
- the instance of the requested file coincides with the instance from where
the command was executed
- the EOS username and/or group (of the original retrieve requester) provided
in the command line belongs to the instance from where the command was
executed
4) A deleteArchive command should be authorized only if:
- the instance of the file to be deleted coincides with the instance from
where the command was executed
- the EOS username and/or group (of the original delete requester) provided
in the command line belongs to the instance from where the command was
executed
5) A cancelRetrieve command should be authorized only if:
- the instance of the file to be canceled coincides with the instance from
where the command was executed
- the EOS username and/or group (of the original cancel requester) provided
in the command line belongs to the instance from where the command was
executed
6) An updateFileStorageClass command should be authorized only if:
- the instance of the file to be updated coincides with the instance from
where the command was executed
- the storage class provided in the command line belongs to the instance from
where the command was executed
- the EOS username and/or group (of the original update requester) provided
in the command line belongs to the instance from where the command was
executed
7) An updateFileInfo command should be authorized only if:
- the instance of the file to be updated coincides with the instance from
where the command was executed
This diff is collapsed.
CTA-EOS Reconciliation Strategy
Use-case 1: Reconciling EOS file info and CTA disk file info
This should be the most common scenario causing discrepancies between the EOS namespace and the disk file info within
the CTA catalogue. The proposal is to attack this in two ways: first (already done) we piggyback disk file info on most
commands acting on CTA Archive files ("archive", "retrieve", "cancelretrieve", etc.), second (to be agreed with Andreas)
EOS could have a trigger on file renames or other file information changes (owner, group, path, etc.) that calls our
updatefileinfo command with the updated fields. In addition (also to be agreed with Andreas) there should also be a
separate low priority process (a sort of EOS-side reconciliation process) going through the entire EOS namespace
periodically calling updatefileinfo on each of the known files, we would also store the date when this update function
was called (see below to know why).
Use-case 2: Reconciling EOS deletes which haven't been propagated to CTA
Say that the above EOS-side low-priority reconciliation process takes on average 3 months and it is run continuously. We
could use the last reconciliation date to determine the list of possible candidates of files which EOS does not know
about anymore, by just taking the ones which haven't been updated say in the last 6 months. Since we have the EOS
instance name and EOS file id for each file (and Andreas confirmed that IDs are unique and never reused within a single
instance), we can then automatically check (through our own CTA-side reconciliation process) whether indeed these files
exist or not. For the ones that still exist we notify EOS admins for a possible bug in their reconciliation process and
we ask them to issue the updatefileinfo command, for the ones which don't exist anymore we double check with their
owners before deleting them from CTA.
Note: It's important to note that we do not reconcile storage class information. Any storage class change is triggered
by the EOS user and it is synchronous: once we successfully record the change our command returns.
Supports Markdown
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment