[Rdap] Actual cases of data management SNAFUs

Fri Sep 25 12:20:02 EDT 2015

On Fri, 25 Sep 2015, Stefan Ekman wrote:

> Dear list members,
>
> I'm looking for actual cases where research has been prevented or 
> impaired by poor data management and/or sharing practice, preferably in 
> Europe but anywhere is fine. Created scenarios (such as this: 
> https://www.youtube.com/watch?v=N2zK3sAtr-4) are easy to find, or indeed 
> imagine, but real-life examples would be valuable in discussing RDM with 
> researchers (and funders).

What do you consider to be 'impaired'?

One of the distribution systems that I deal with had multiple errors the 
year before last, which caused problems for people trying to get the data:

1. The authoritative site had one of their machines that managed a number
    of disk arrays fail.
2. Someone accidentially issued a command that deleted almost all of the
    contents of a disk array.
3. There *were* backups (tape), but the distribution system wouldn't
    perform a tape load when remote sites attempted to access the file.
    (we had to e-mail the list of objects that we were trying to retrieve,
    and someone would manually issue the command to load them from tape).
    Restoring the deleted data took ~2 weeks.
4. Another (larger) disk array failed.
5. The authoritative system didn't mark the data as unavailable, so remote
    systems kept trying to retrieve the files repeatedly (slowing down
    total throughput so much that we couldn't keep up w/ the data stream
    for new data).
6. The distribution system had no fail-back for what to do when data
    wasn't available, so we had no mechanism to inform users what was
    going on and why their requests were failing.
7. I was given access to a CGI to call to request a tape load at the
    authoritative site & modified my system to make requests if the data
    was from the failed storage array.
8. The tape drivers were written in-house, and would only process loads in
    order, rather than doing opportunistic reads (if the tape's loaded, and
    there are requests queued for that tap, load them out-of-order).
9. My sending tape load requests (batched up once per hour) was causing
    the system to thrash so badly it was affecting their normal use of the
    tape drives.
10.They finally agreed to do a full restore of the failed array, about 2-3
    months after it had failed.  (and it took another month or so to finish
    restoring the data)

...

I don't know if this is a failure of a data management plan ... I've been 
putting the blame on the site for deciding to implement their own system 
for managing the storage, rather than using something that's in use at 
other sites (and therefore, well-tested).

There was another incident with that site just before the embargo was to 
be lifted -- someone issued a command to delete almost all of the data 
from one of the instruments, and they had to restore from tape.  However, 
they had written the controller for the tape drives single-threaded, and 
so it was taking more than 15hrs to restore a day's data.  (and they had 
~4 months of data to restore) ... and the tape drives were sitting 2/3 
idle because the contention was in the controller, not the drives).

...

This is also the group that I had to argue with for 3+ months to try to 
get them to put checksums in their data.

They insisted they were using RAID6, and so they didn't need them as 
they'd have to lose 2 drives per array for there to be a problem.  Of 
course, 2 weeks before launch, they had a power outage in their data 
center and a third of the arrays lost 2 disks.  Suddenly the PI got the 
idea that checksums would be a good thing.

...

The problem is also related to other management issues resulting in 
their two main software developers quitting a good year or so before 
launch.  I suspect it was because of how dismissive the PI was towards IT 
expertise.  (as was the case when our group was brought in ... and I tried 
getting them to implement checksums and add identifiers to allow the 
scientific community to better track data from their system, but got 
nothing but push-back).

...

And even if you don't consider that 'impaired' -- dealing with that group 
has eaten up ~6 years of effort to support them, and as such, the Virtual 
Solar Observatory hasn't integrated a number of other solar data sources 
that aren't well-distributed.

-Joe