Commits · 6e4548a878254236f59fc09f15f9215c8db8d3e6 · dCache / cta

Jul 31, 2017
- Purging Oracle DB recycle bin in init, otherwise CI DB size explodes because... · 6e4548a8
  Julien Leduc authored 7 years ago
  
  Purging Oracle DB recycle bin in init, otherwise CI DB size explodes because of the recycle bin content.
  View commits for tag v0.0-28 v0.0-28
  
  6e4548a8
Jul 30, 2017

Revisited locking in BackendRados. · 600a5f06

Eric Cano authored 7 years ago

The locks in Rados have timeouts. They are needed in case a locker process dies without
releasing its lock. As we have some contention in heavily loaded situations, it can happen
that a process is till accessing objects while the lock is expired. To lessen the likeliness
of this situation, the timeout has been increased from 10s to 60s.

The backoff was ajusted using the MultithreadLockingInterface unit test, with printouts
allowing to visually see the effect of the backoff strategy. The printouts are committed,
but they are commented out.

The same unit test was fized as it used to create an empty object, which is not supported
anymore in order to be able to detect locking of non-existing objects (lock creates the object,
but we detect non-existence as it is empty and re-delete it.

This mechanism of empty object locking detection is also added to the async update of object
as it was missing there (and the backoff has been added there too).

600a5f06

Jul 29, 2017
- Improved Helpers::getLockedAndFetchedQueue<ArchiveQueue>() · 6ea5c0bf
  Eric Cano authored 7 years ago
  
  Added unlocking a non-scoped lock if needed. Added more information in logs.
  View commits for tag v0.0-27 v0.0-27
  
  6ea5c0bf
Jul 28, 2017
- Added name of object in rados failure exception messages · 7b866d15
  Eric Cano authored 7 years ago
  
  Name of object was already present in some errors but not all.
  7b866d15
- Fix missed initialisation of the startStartTime value in the · 04417b4f
  Victor Kotlyar authored 7 years ago
  
  DriveState.
  04417b4f
- Added extra information in logs. · 75d0aa2f
  Eric Cano authored 7 years ago
  
  - when failing to schedule. - now list which drive has an existing mount (at schedule time as well.
  View commits for tag v0.0-26 v0.0-26
  
  75d0aa2f
- Handing already existing disk copy · 4fbdfce9
  Vladimir Bahyl authored 7 years ago
  
  4fbdfce9
- increasing timeouts as ls -y is faster than the previous method · ee24b283
  Julien Leduc authored 7 years ago
  
  ee24b283
- Does not work for now · 1cab4829
  Vladimir Bahyl authored 7 years ago
  
  1cab4829
- Added support for forcible shutdown in OStoreDB::ArchiveMount::getNextJobBatch() · 32019d8b
  Eric Cano authored 7 years ago
  
  It was wrongly added to OStoreDB::ArchiveMount::getNextJob() instead in a05ad490.
  32019d8b
- Fixing the location of the keys for both EOSCTATAPE and CI · ac187f44
  Vladimir Bahyl authored 7 years ago
  
  ac187f44
- Timeouting full runs after 50 minutes: 10 minutes for namespace creation and... · e3af9522
  Julien Leduc authored 7 years ago
  
  Timeouting full runs after 50 minutes: 10 minutes for namespace creation and 40 minutes for the test, so that gitlab does not times it out and leaves a dirty CI runner.
  e3af9522
- Performing 100 rm in parallel for rados, this should not be painful as those... · 18016210
  Julien Leduc authored 7 years ago
  
  Performing 100 rm in parallel for rados, this should not be painful as those synchronous rm are mostly waiting
  18016210
- Updated timing logging in SharedQueueLock<Queue, Request>::~SharedQueueLock(). · bdacae77
  Eric Cano authored 7 years ago
  
  View commits for tag v0.0-25 v0.0-25
  
  bdacae77
- 5000->1000 files for nightly1 as this is the nice test · 194e4d0d
  Julien Leduc authored 7 years ago
  
  194e4d0d
- Adding workflows status · 916a859a
  Vladimir Bahyl authored 7 years ago
  
  916a859a
- XRD_STREAMTIMEOUT=600 # increased from 60s · 0754fd15
  Vladimir Bahyl authored 7 years ago
  
  XRD_TIMEOUTRESOLUTION=600 # increased from 15s
  0754fd15
- client_ar.sh can be used for preprod and CI directories · cd59ef8c
  Julien Leduc authored 7 years ago
  
  replaced eosh script with ls -y except after retrieval as archived and retrieved are the same status regarding eos... Looks like sometime ls -y determined archived files is not a growing function...
  cd59ef8c
Jul 27, 2017

client_ar.sh can now write to /eos/ctaeos/preprod with -d option, just... · e2fed77b

Julien Leduc authored 7 years ago

client_ar.sh can now write to /eos/ctaeos/preprod with -d option, just complains: Could not remove disk replica for /eos/ctaeos/preprod/ as drop is already done in the wfe script. Should test for disk replica before trying to drop with ls -y on directory.

e2fed77b

Fixed inhomogeneous variable naming. · 3aa5d40e
Eric Cano authored 7 years ago

3aa5d40e

Fixed race potentially leading to contention. · 625a95e5

Eric Cano authored 7 years ago

In the MemQueue, the promise for the next batch was set after the queue was committed, but
before the lock was released (by the last user of the queue, through a shared pointer). This
would lead to an uselessly early start of the next queue batch for writing an avoidable contention
on the object store lock. This would no lead to a pile up though as only 2 thread would be contended
(previous and early starting next).

625a95e5

Added timing logs to Helpers::getLockedAndFetchedQueue<ArchiveQueue>(). · b08e052b
Eric Cano authored 7 years ago

b08e052b
Expanded timing instrumentation in MemQueue<Request, Queue>::sharedAddToNew. · 6c5a557f
Eric Cano authored 7 years ago

6c5a557f
Added interface to RetrieveMount::getNextJobBatch(). · ba4adcf4
Eric Cano authored 7 years ago
```
... in preparation for replacement of RetrieveMount::getNextJob().
```
ba4adcf4
Adding preprod WFE and directory to the eos CI instance · 148c71c6
Julien Leduc authored 7 years ago

148c71c6
using 4 streams of 5000 files each · 57472590
Julien Leduc authored 7 years ago

57472590

Implemented missing OStoreDB::RetrieveJob::fail() · 36da7065

Eric Cano authored 7 years ago

The retrieve request now gets properly queued in case of retrieve error.
The errors are counted and the request gets deleted eventually.
A new field was added to the retrive request in object store. This commit
will fail on upgrade if there are retrieve requests still queued at update time.
Cleaned up some unused structures in cta.proto
Minor modifications to ArchiveJobs.

36da7065

Fix 'cta drive ls' and 'cta sq' output. · 387139f2
Victor Kotlyar authored 7 years ago
```
Converted all bytes to Mbytes. Removed extra space in the output.
Reordered fields.
```
387139f2
Increased the tape server's shutdown timeout to 15 minutes. · 1613dc2f
Eric Cano authored 7 years ago
```
This is a stop gap solution while we wait for efficient archive/retrieve reporting.
```
1613dc2f
Fixed missing fetch in OStoreDB::trimEmptyQueues(). · a74a365a
Eric Cano authored 7 years ago

a74a365a

Jul 26, 2017
- Deprecated the use of ArchiveMount::getNextJob() in favor of ArchiveMount::getNextJobBatch(). · a9e279be
  Eric Cano authored 7 years ago
  
  This affects only unit tests as taped already relied on getNextJobBatch().
  a9e279be
- Fix the log rotation of the XROOT log · 31698e15
  Vladimir Bahyl authored 7 years ago
  
  31698e15
- Added automatic trimming of empty queues at schedule time. · abe6dbc4
  Eric Cano authored 7 years ago
  
  View commits for tag v0.0-23 v0.0-23
  
  abe6dbc4
- Fixed racy implementation of BackendRados::lock{Exclusive|Shared}() · 6a055b3b
  Eric Cano authored 7 years ago
  
  As rados re-creates an object when trying to lock it, we tested for presence before locking. This is racy as object could be deleted in the mean time. Instead, we now lock blindly and delete the object if we find it having a zero-size. As we own the lock, this is safe. This problem led to issues in garbage colector, where agent gets polled while it could disappear.
  6a055b3b
Jul 25, 2017
- Fixed garbage collector dying when agent disappears on the wrong momemt. · 604ec893
  Eric Cano authored 7 years ago
  
  View commits for tag v0.0-22 v0.0-22
  
  604ec893
- Added logging of errors in BackendPopulator::~BackendPopulator(). · 8f5ca9cc
  Eric Cano authored 7 years ago
  
  8f5ca9cc
- Fixed function name in error strings. · e0050b9e
  Eric Cano authored 7 years ago
  
  View commits for tag v0.0-21 v0.0-21
  
  e0050b9e
- Added missing error handling when de/serializing protobufs. · 1747d830
  Eric Cano authored 7 years ago
  
  1747d830
- Added symlinking of missing cta-objecstore utilities in container. · 8223baa6
  Eric Cano authored 7 years ago
  
  8223baa6
- Updated location of cta-frontend log file in container startup script. · e703822f
  Eric Cano authored 7 years ago
  
  This fixes frontend failing to start on file permission error.
  e703822f